Продолжаем копипаст из pandas-cookbook. Фильтруем шум на тротуарах сначала в Бруклине, а потом во всех районах. Отвлекаемся на то чтобы получить NumPy array...
I'd like to know which borough has the most noise complaints. First, we'll take a look at the data to see what it looks like... huge ugly table :)
# The usual preamble
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
Let's continue with our NYC 311 service requests example.
complaints = pd.read_csv('../data/311-service-requests.csv')
complaints[:5]
To get the noise complaints, we need to find the rows where the "Complaint Type" column is "Noise - Street/Sidewalk". I'll show you how to do that, and then explain what's going on.
noise_complaints = complaints[complaints['Complaint Type'] == "Noise - Street/Sidewalk"]
noise_complaints[:3]
If you look at noise_complaints, you'll see that this worked, and it only contains complaints with the right complaint type. But how does this work? Let's deconstruct it into two pieces
complaints['Complaint Type'] == "Noise - Street/Sidewalk"
This is a big array of Trues and Falses, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where our boolean array evaluated to True. It's important to note that for row filtering by a boolean array the length of our dataframe's index must be the same length as the boolean array used for filtering.
You can also combine more than one condition with the & operator like this:
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
complaints[is_noise & in_brooklyn][:5]
Or if we just wanted a few columns:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']][:10]
3.2 A digression about numpy arrays¶
On the inside, the type of a column is pd.Series
pd.Series([1,2,3])
and pandas Series are internally numpy arrays. If you add .values to the end of any Series, you'll get its internal numpy array
np.array([1,2,3])
pd.Series([1,2,3]).values
So this binary-array-selection business is actually something that works with any numpy array:
arr = np.array([1,2,3])
arr != 2
arr[arr != 2]
3.3 So, which borough has the most noise complaints?¶
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
noise_complaints = complaints[is_noise]
noise_complaints['Borough'].value_counts()
It's Manhattan! But what if we wanted to divide by the total number of complaints, to make it make a bit more sense? That would be easy too:
noise_complaint_counts = noise_complaints['Borough'].value_counts()
complaint_counts = complaints['Borough'].value_counts()
noise_complaint_counts / complaint_counts
Oops, why was that zero? That's no good. This is because of integer division in Python 2. Let's fix it, by converting complaint_counts into an array of floats.
noise_complaint_counts / complaint_counts.astype(float)
(noise_complaint_counts / complaint_counts.astype(float)).plot(kind='bar')
So Manhattan really does complain more about noise than the other boroughs! Neat.
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий