Продолжаем копипаст из pandas-cookbook. Фильтруем шум на тротуарах сначала в Бруклине, а потом во всех районах. Отвлекаемся на то чтобы получить NumPy array...
I'd like to know which borough has the most noise complaints. First, we'll take a look at the data to see what it looks like... huge ugly table :)
# The usual preamble
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
Let's continue with our NYC 311 service requests example.
complaints = pd.read_csv('../data/311-service-requests.csv')
complaints[:5]
To get the noise complaints, we need to find the rows where the "Complaint Type" column is "Noise - Street/Sidewalk". I'll show you how to do that, and then explain what's going on.
noise_complaints = complaints[complaints['Complaint Type'] == "Noise - Street/Sidewalk"]
noise_complaints[:3]
If you look at noise_complaints
, you'll see that this worked, and it only contains complaints with the right complaint type. But how does this work? Let's deconstruct it into two pieces
complaints['Complaint Type'] == "Noise - Street/Sidewalk"
This is a big array of True
s and False
s, one for each row in our dataframe. When we index our dataframe with this array, we get just the rows where our boolean array evaluated to True
. It's important to note that for row filtering by a boolean array the length of our dataframe's index must be the same length as the boolean array used for filtering.
You can also combine more than one condition with the &
operator like this:
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
in_brooklyn = complaints['Borough'] == "BROOKLYN"
complaints[is_noise & in_brooklyn][:5]
Or if we just wanted a few columns:
complaints[is_noise & in_brooklyn][['Complaint Type', 'Borough', 'Created Date', 'Descriptor']][:10]
3.2 A digression about numpy arrays¶
On the inside, the type of a column is pd.Series
pd.Series([1,2,3])
and pandas Series are internally numpy arrays. If you add .values
to the end of any Series
, you'll get its internal numpy array
np.array([1,2,3])
pd.Series([1,2,3]).values
So this binary-array-selection business is actually something that works with any numpy array:
arr = np.array([1,2,3])
arr != 2
arr[arr != 2]
3.3 So, which borough has the most noise complaints?¶
is_noise = complaints['Complaint Type'] == "Noise - Street/Sidewalk"
noise_complaints = complaints[is_noise]
noise_complaints['Borough'].value_counts()
It's Manhattan! But what if we wanted to divide by the total number of complaints, to make it make a bit more sense? That would be easy too:
noise_complaint_counts = noise_complaints['Borough'].value_counts()
complaint_counts = complaints['Borough'].value_counts()
noise_complaint_counts / complaint_counts
Oops, why was that zero? That's no good. This is because of integer division in Python 2. Let's fix it, by converting complaint_counts
into an array of floats.
noise_complaint_counts / complaint_counts.astype(float)
(noise_complaint_counts / complaint_counts.astype(float)).plot(kind='bar')
So Manhattan really does complain more about noise than the other boroughs! Neat.
Посты чуть ниже также могут вас заинтересовать
Комментариев нет:
Отправить комментарий