Поиск по блогу

вторник, 17 марта 2015 г.

6 - String Operations- Which month was the snowiest (снег и температура в 2012 году)

Kопипаст из pandas-cookbook. Таблица погоды за 2012 год из предыдущей (пятой), фильтруем по слову в столбце contains('Snow'), строим димграмму снегопадов за год, затем при помощи resample() находим сренемесячные медианы температуры и объединяем в красивые диаграммы снегопады и температутру.
You'll see that the 'Weather' column has a text description of the weather that was going on each hour. We'll assume it's snowing if the text description contains "Snow"... pandas provides vectorized string functions, to make it easy to operate on columns containing text. There are some great examples in the documentation.

In [7]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 3)

We saw earlier that pandas is really good at dealing with dates. It is also amazing with strings! We're going to go back to our weather data from Chapter 5, here.

In [8]:
weather_2012 = pd.read_csv('../data/weather_2012.csv', parse_dates=True, index_col='Date/Time')
weather_2012[:5]
Out[8]:
Temp (C) Dew Point Temp (C) Rel Hum (%) Wind Spd (km/h) Visibility (km) Stn Press (kPa) Weather
Date/Time
2012-01-01 00:00:00 -1.8 -3.9 86 4 8.0 101.24 Fog
2012-01-01 01:00:00 -1.8 -3.7 87 4 8.0 101.24 Fog
2012-01-01 02:00:00 -1.8 -3.4 89 7 4.0 101.26 Freezing Drizzle,Fog
2012-01-01 03:00:00 -1.5 -3.2 88 6 4.0 101.27 Freezing Drizzle,Fog
2012-01-01 04:00:00 -1.5 -3.3 88 7 4.8 101.23 Fog

5 rows × 7 columns

6.1 String operations

In [9]:
weather_description = weather_2012['Weather']
is_snowing = weather_description.str.contains('Snow')

This gives us a binary vector, which is a bit hard to look at, so we'll plot it.

In [10]:
# Not super useful
is_snowing[:5]
Out[10]:
Date/Time
2012-01-01 00:00:00    False
2012-01-01 01:00:00    False
2012-01-01 02:00:00    False
2012-01-01 03:00:00    False
2012-01-01 04:00:00    False
Name: Weather, dtype: bool
In [11]:
# More useful!
is_snowing.plot()
Out[11]:
<matplotlib.axes.AxesSubplot at 0x10b930650>

6.2 Use resampling to find the snowiest month

If we wanted the median temperature each month, we could use the resample() method like this:

In [12]:
weather_2012['Temp (C)'].resample('M', how=np.median).plot(kind='bar')
Out[12]:
<matplotlib.axes.AxesSubplot at 0x10bd89810>

Unsurprisingly, July and August are the warmest.

So we can think of snowiness as being a bunch of 1s and 0s instead of Trues and Falses:

In [13]:
is_snowing.astype(float)[:10]
Out[13]:
Date/Time
2012-01-01 00:00:00    0
2012-01-01 01:00:00    0
2012-01-01 02:00:00    0
2012-01-01 03:00:00    0
2012-01-01 04:00:00    0
2012-01-01 05:00:00    0
2012-01-01 06:00:00    0
2012-01-01 07:00:00    0
2012-01-01 08:00:00    0
2012-01-01 09:00:00    0
Name: Weather, dtype: float64

and then use resample to find the percentage of time it was snowing each month

In [14]:
is_snowing.astype(float).resample('M', how=np.mean)
Out[14]:
Date/Time
2012-01-31    0.240591
2012-02-29    0.162356
2012-03-31    0.087366
2012-04-30    0.015278
2012-05-31    0.000000
2012-06-30    0.000000
2012-07-31    0.000000
2012-08-31    0.000000
2012-09-30    0.000000
2012-10-31    0.000000
2012-11-30    0.038889
2012-12-31    0.251344
Freq: M, Name: Weather, dtype: float64
In [15]:
is_snowing.astype(float).resample('M', how=np.mean).plot(kind='bar')
Out[15]:
<matplotlib.axes.AxesSubplot at 0x10c189890>

So now we know! In 2012, December was the snowiest month. Also, this graph suggests something that I feel -- it starts snowing pretty abruptly in November, and then tapers off slowly and takes a long time to stop, with the last snow usually being in April or May.

6.3 Plotting temperature and snowiness stats together

We can also combine these two statistics (temperature, and snowiness) into one dataframe and plot them together:

In [16]:
temperature = weather_2012['Temp (C)'].resample('M', how=np.median)
is_snowing = weather_2012['Weather'].str.contains('Snow')
snowiness = is_snowing.astype(float).resample('M', how=np.mean)

# Name the columns
temperature.name = "Temperature"
snowiness.name = "Snowiness"

We'll use concat again to combine the two statistics into a single dataframe.

In [17]:
stats = pd.concat([temperature, snowiness], axis=1)
stats
Out[17]:
Temperature Snowiness
2012-01-31 -7.05 0.240591
2012-02-29 -4.10 0.162356
2012-03-31 2.60 0.087366
2012-04-30 6.30 0.015278
2012-05-31 16.05 0.000000
2012-06-30 19.60 0.000000
2012-07-31 22.90 0.000000
2012-08-31 22.20 0.000000
2012-09-30 16.10 0.000000
2012-10-31 11.30 0.000000
2012-11-30 1.05 0.038889
2012-12-31 -2.85 0.251344

12 rows × 2 columns

In [18]:
stats.plot(kind='bar')
Out[18]:
<matplotlib.axes.AxesSubplot at 0x10c51f690>

Uh, that didn't work so well because the scale was wrong. We can do better by plotting them on two separate graphs:

In [19]:
stats.plot(kind='bar', subplots=True, figsize=(15, 10))
Out[19]:
array([<matplotlib.axes.AxesSubplot object at 0x10c268650>,
       <matplotlib.axes.AxesSubplot object at 0x10c7c1390>], dtype=object)


Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий