Поиск по блогу

пятница, 14 февраля 2014 г.

Pandas Data Frames and Kevin Durant 2012-13 Game Log with Mahdi Yusuf

Махди Юсуф записал 20-ти минутный скринкаст с данными, которые мне удалось найти по имени файла Kevin Durant 2012-13 Game Log Он анализирует действия игрока Кевина Дюранта...
Файл csv пришлось предварительно отредактировать, создание новой таблицы, процедура замены строки "40:00" числом (секунд) -2400, группировка... Пост заканчивается диаграммой, построенной с помощью модуля vincent - Vega
При импорте vincent возникли трудности, так как не была задокументирована команда "from IPython.display import display, HTML, Javascript". Решение было найдено благодаря http://nbviewer.ipython.org/gist/anonymous/5436794
In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('BM7j6YGOv7U')
Out[1]:

Introductijn Pandas DataFrames

In [60]:
# do not forget to do it if it needs
from IPython.display import display, HTML, Javascript
# %pylab inline
In [3]:
import pandas as pd
In [7]:
pd.set_option('display.max_columns',None)
In [8]:
# I forgot ... I`ve already imported Pandas
from pandas import DataFrame, Series
In [61]:
# https://pypi.python.org/pypi/vincent
import vincent
vincent.core.initialize_notebook()
<IPython.core.display.Javascript at 0xb3fb780>
<IPython.core.display.Javascript at 0xb3fb780>

Importing csv data

As you can see the CSV file has two defects (... 'Tm','','Opp','','GS',...) - missing items in header row. And huge problem with rows (header row is repeating after every 20 rows). But there are not any remarks of it in video...
I have removed repeated header rows and insert two missing header strings literals 'H/A','W/L'
In [11]:
columns=['Rk','G','Date','Age','Tm','H/A','Opp','W/L','GS','MP','FG','FGA','FG%','3P','3PA','3P%','FT','FTA','FT%','ORB','DRB','TRB','AST','STL','BLK','TOV','PF','PTS','GmSc','+/-']
In [37]:
data=pd.read_csv('../data/KevinDuran2012-13GameLog.csv',names=columns)
data.head()
Out[37]:
Rk G Date Age Tm H/A Opp W/L GS MP FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS GmSc +/-
0 1 1 2012-11-01 24-033 OKC @ SAS L (-2) 1 40:39 9 18 0.500 1 2 0.500 4 5 0.800 2 12 14 5 2 0 4 1 23 19.7 2
1 2 2 2012-11-02 24-034 OKC NaN POR W (+14) 1 42:41 7 14 0.500 1 3 0.333 8 12 0.667 2 15 17 7 0 2 6 1 23 20.2 13
2 3 3 2012-11-04 24-036 OKC NaN ATL L (-9) 1 42:05 7 17 0.412 1 4 0.250 7 8 0.875 1 11 12 8 3 2 6 3 22 19.3 -8
3 4 4 2012-11-06 24-038 OKC NaN TOR W (+20) 1 29:09 4 11 0.364 1 4 0.250 6 6 1.000 0 6 6 3 2 0 4 3 15 9.6 23
4 5 5 2012-11-08 24-040 OKC @ CHI W (+6) 1 38:23 11 19 0.579 0 2 0.000 2 2 1.000 1 3 4 1 3 3 6 1 24 16.1 -3
5 rows × 30 columns
Here you can dig the explanations of all literals
In []:
...<tr class="">
  <th data-stat="ranker" align="right" class="ranker sort_default_asc show_partial_when_sorting" tip="Rank">Rk</th>
  <th data-stat="game_season" align="right" class="tooltip" tip="Season Game">G</th>
  <th data-stat="date_game" align="left" class="tooltip sort_default_asc">Date</th>
  <th data-stat="age" align="center" class="tooltip sort_default_asc" tip="Age of Player at the start of February 1st of that season.">Age</th>
  <th data-stat="team_id" align="left" class="tooltip sort_default_asc" tip="Team">Tm</th>
  <th data-stat="game_location" align="center" class="tooltip"></th>
  <th data-stat="opp_id" align="left" class="tooltip sort_default_asc" tip="Opponent">Opp</th>
  <th data-stat="game_result" align="center" class="tooltip"></th>
  <th data-stat="gs" align="right" class="tooltip" tip="Games Started">GS</th>
  <th data-stat="mp" align="right" class="tooltip" tip="Minutes Played">MP</th>
  <th data-stat="fg" align="right" class="tooltip" tip="Field Goals">FG</th>
  <th data-stat="fga" align="right" class="tooltip" tip="Field Goal Attempts">FGA</th>
  <th data-stat="fg_pct" align="right" class="tooltip" tip="Field Goal Percentage">FG%</th>
  <th data-stat="fg3" align="right" class="tooltip" tip="3-Point Field Goals">3P</th>
  <th data-stat="fg3a" align="right" class="tooltip" tip="3-Point Field Goal Attempts">3PA</th>
  <th data-stat="fg3_pct" align="right" class="tooltip" tip="3-Point Field Goal Percentage">3P%</th>
  <th data-stat="ft" align="right" class="tooltip" tip="Free Throws">FT</th>
  <th data-stat="fta" align="right" class="tooltip" tip="Free Throw Attempts">FTA</th>
  <th data-stat="ft_pct" align="right" class="tooltip" tip="Free Throw Percentage">FT%</th>
  <th data-stat="orb" align="right" class="tooltip" tip="Offensive Rebounds">ORB</th>
  <th data-stat="drb" align="right" class="tooltip" tip="Defensive Rebounds">DRB</th>
  <th data-stat="trb" align="right" class="tooltip" tip="Total Rebounds">TRB</th>
  <th data-stat="ast" align="right" class="tooltip" tip="Assists">AST</th>
  <th data-stat="stl" align="right" class="tooltip" tip="Steals">STL</th>
  <th data-stat="blk" align="right" class="tooltip" tip="Blocks">BLK</th>
  <th data-stat="tov" align="right" class="tooltip" tip="Turnovers">TOV</th>
  <th data-stat="pf" align="right" class="tooltip" tip="Personal Fouls">PF</th>
  <th data-stat="pts" align="right" class="tooltip" tip="Points">PTS</th>
  <th data-stat="game_score" align="right" class="tooltip" tip="Game Score">GmSc</th>
  <th data-stat="plus_minus" align="right" class="tooltip" tip="Plus/Minus">+/-</th>
...</table>
In []:
Deleting columns
In [38]:
del data['Rk']
del data['H/A']
del data['Tm']
#del data['Opp']
del data['GS']
del data['W/L']
data.head()
Out[38]:
G Date Age Opp MP FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS GmSc +/-
0 1 2012-11-01 24-033 SAS 40:39 9 18 0.500 1 2 0.500 4 5 0.800 2 12 14 5 2 0 4 1 23 19.7 2
1 2 2012-11-02 24-034 POR 42:41 7 14 0.500 1 3 0.333 8 12 0.667 2 15 17 7 0 2 6 1 23 20.2 13
2 3 2012-11-04 24-036 ATL 42:05 7 17 0.412 1 4 0.250 7 8 0.875 1 11 12 8 3 2 6 3 22 19.3 -8
3 4 2012-11-06 24-038 TOR 29:09 4 11 0.364 1 4 0.250 6 6 1.000 0 6 6 3 2 0 4 3 15 9.6 23
4 5 2012-11-08 24-040 CHI 38:23 11 19 0.579 0 2 0.000 2 2 1.000 1 3 4 1 3 3 6 1 24 16.1 -3
5 rows × 25 columns

Fields Goal Made/Fields Goal Attempted per minute

In [20]:
data[['MP','FG','FGA']].dtypes
Out[20]:
MP     object
FG      int64
FGA     int64
dtype: object
In [27]:
temp=data[['MP','FG','FGA']]
MP is object (string). It is not conviniet for us. Let us try to convert it into seconds...
In [22]:
import time
import datetime
In [26]:
def string_to_seconds(minutes):
    minutes=str(minutes)
    minutes=time.strptime(minutes,'%M:%S')
    return datetime.timedelta(minutes=minutes.tm_min, seconds=minutes.tm_sec).total_seconds()
    
print string_to_seconds('40:00')
2400.0

And now We can take every element of 'MP' column and... convert it to seconds
In [29]:
temp['MP']=temp['MP'].map(string_to_seconds)
temp.head()
-c:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

Out[29]:
MP FG FGA
0 2439 9 18
1 2561 7 14
2 2525 7 17
3 1749 4 11
4 2303 11 19
5 rows × 3 columns
In [30]:
temp.dtypes
Out[30]:
MP     float64
FG       int64
FGA      int64
dtype: object
In [34]:
# Attempts per minute - create new column
temp['FGA/M']=temp['FGA']*60/temp['MP']
temp['FG/M']=temp['FG']*60/temp['MP']
temp.head()
-c:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

Out[34]:
MP FG FGA FG/M FGA/M
0 2439 9 18 0.221402 0.442804
1 2561 7 14 0.163998 0.327997
2 2525 7 17 0.166337 0.403960
3 1749 4 11 0.137221 0.377358
4 2303 11 19 0.286583 0.495007
5 rows × 5 columns
In [35]:
temp.describe()
Out[35]:
MP FG FGA FG/M FGA/M
count 81.000000 81.000000 81.000000 81.000000 81.000000
mean 2310.271605 9.024691 17.691358 0.234632 0.455967
std 346.315355 2.554287 5.001605 0.057639 0.094907
min 1426.000000 4.000000 8.000000 0.095390 0.267499
25% 2146.000000 7.000000 14.000000 0.202703 0.387812
50% 2346.000000 9.000000 17.000000 0.232288 0.447344
75% 2525.000000 10.000000 21.000000 0.267857 0.513619
max 2980.000000 16.000000 31.000000 0.411487 0.699029
8 rows × 5 columns

Let form new subset for groups

In [39]:
data.head()
Out[39]:
G Date Age Opp MP FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS GmSc +/-
0 1 2012-11-01 24-033 SAS 40:39 9 18 0.500 1 2 0.500 4 5 0.800 2 12 14 5 2 0 4 1 23 19.7 2
1 2 2012-11-02 24-034 POR 42:41 7 14 0.500 1 3 0.333 8 12 0.667 2 15 17 7 0 2 6 1 23 20.2 13
2 3 2012-11-04 24-036 ATL 42:05 7 17 0.412 1 4 0.250 7 8 0.875 1 11 12 8 3 2 6 3 22 19.3 -8
3 4 2012-11-06 24-038 TOR 29:09 4 11 0.364 1 4 0.250 6 6 1.000 0 6 6 3 2 0 4 3 15 9.6 23
4 5 2012-11-08 24-040 CHI 38:23 11 19 0.579 0 2 0.000 2 2 1.000 1 3 4 1 3 3 6 1 24 16.1 -3
5 rows × 25 columns
We deleted 'Opp' column earlier and have to rerun data=pd.read_csv('../data/KevinDuran2012-13GameLog.csv',names=columns) in the cell after 'Importing csv data' item of this paper...
In [40]:
group_by_opp=data.groupby('Opp')
In [42]:
group_by_opp.size()
Out[42]:
Opp
ATL    2
BOS    2
BRK    2
CHA    2
CHI    2
CLE    2
DAL    4
DEN    4
DET    2
GSW    4
HOU    3
IND    2
LAC    3
LAL    4
MEM    3
MIA    2
MIL    1
MIN    4
NOH    4
NYK    2
ORL    2
PHI    2
PHO    4
POR    4
SAC    3
SAS    4
TOR    2
UTA    4
WAS    2
dtype: int64
Now if you want figure out how mane shots he took against particular team
In [43]:
group_by_opp.sum()
Out[43]:
G FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS GmSc +/-
Opp
ATL 28 21 40 1.021 5 12 0.750 16 18 1.775 2 23 25 11 3 4 12 4 63 50.8 -6
BOS 76 15 36 0.825 3 11 0.429 19 19 2.000 0 14 14 6 2 1 6 6 52 35.5 8
BRK 50 20 33 1.210 4 7 1.400 15 16 1.750 0 10 10 11 3 1 6 7 59 49.1 1
CHA 77 12 20 1.250 3 7 0.833 10 11 1.857 1 11 12 11 0 5 4 2 37 37.8 60
CHI 61 17 38 0.895 1 6 0.250 8 8 2.000 3 17 20 7 3 5 10 1 43 31.4 19
CLE 54 17 37 0.944 4 9 0.929 20 25 1.640 2 17 19 5 1 1 5 3 58 42.4 16
DAL 183 42 89 1.954 13 24 2.456 45 46 3.833 2 34 36 11 7 6 17 7 142 106.8 26
DEN 206 33 75 1.818 6 25 0.950 44 49 3.484 2 28 30 20 10 5 18 10 116 90.0 19
DET 14 17 38 0.927 2 4 1.000 15 15 2.000 1 21 22 5 1 4 5 2 51 39.7 15
GSW 182 38 67 2.269 9 14 2.601 29 33 3.424 2 33 35 31 8 6 17 5 114 106.9 38
HOU 99 25 53 1.343 6 15 0.944 23 26 2.646 2 19 21 17 5 4 10 7 79 64.7 26
IND 97 22 45 0.994 3 8 0.867 14 17 1.657 2 15 17 5 0 0 6 1 61 40.1 22
LAC 113 29 63 1.400 10 20 1.541 34 39 2.705 3 19 22 16 8 0 14 4 102 78.9 18
LAL 161 45 92 1.960 9 23 1.597 40 46 3.429 3 25 28 18 4 4 8 12 139 106.4 56
MEM 124 33 65 1.589 2 7 1.250 25 28 2.678 3 22 25 14 4 4 11 5 93 71.8 4
MIA 80 23 45 1.024 3 9 0.700 24 26 1.818 1 14 15 7 1 3 9 10 73 49.8 -16
MIL 74 10 19 0.526 1 2 0.500 9 10 0.900 1 7 8 5 0 1 2 0 30 25.3 19
MIN 189 44 74 2.372 5 12 2.083 29 29 4.000 1 28 29 19 7 7 14 12 122 103.3 14
NOH 107 32 53 2.463 9 16 2.217 27 30 3.690 2 35 37 21 5 2 6 1 100 101.1 73
NYK 138 16 37 0.862 2 8 0.333 27 30 1.800 0 11 11 13 0 5 11 2 61 44.4 -2
ORL 136 15 29 1.033 3 8 0.867 18 23 1.568 2 15 17 5 6 2 11 4 51 38.9 -2
PHI 46 18 37 0.988 3 10 0.583 24 27 1.782 1 13 14 8 5 3 4 3 63 55.2 12
PHO 169 37 76 1.958 6 15 1.200 30 30 4.000 1 20 21 10 4 6 2 4 110 89.9 78
POR 190 36 63 2.295 5 16 1.222 19 24 3.542 4 32 36 21 3 6 16 5 96 80.6 52
SAC 148 27 42 1.922 6 12 1.517 24 26 2.750 1 22 23 16 8 3 9 6 84 81.8 41
SAS 164 32 64 2.000 4 8 2.167 25 27 3.550 3 32 35 20 5 8 19 6 93 75.1 7
TOR 37 10 22 0.909 3 9 0.650 14 15 1.889 1 12 13 10 3 1 5 4 37 33.6 51
UTA 212 32 52 2.441 5 9 1.750 33 37 3.596 0 32 32 17 7 7 17 9 102 89.6 34
WAS 106 13 29 0.874 4 8 1.333 19 20 1.900 0 13 13 14 3 1 6 1 49 44.5 32
29 rows × 21 columns
In [44]:
data[data.Opp=='ATL']
Out[44]:
G Date Age Opp MP FG FGA FG% 3P 3PA 3P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PTS GmSc +/-
2 3 2012-11-04 24-036 ATL 42:05 7 17 0.412 1 4 0.25 7 8 0.875 1 11 12 8 3 2 6 3 22 19.3 -8
24 25 2012-12-19 24-081 ATL 40:52 14 23 0.609 4 8 0.50 9 10 0.900 1 12 13 3 0 2 6 1 41 31.5 2
2 rows × 25 columns
In [62]:
field_goal_per_team=group_by_opp.sum()[['FGA','FG']]
field_goal_per_team
Out[62]:
FGA FG
Opp
ATL 40 21
BOS 36 15
BRK 33 20
CHA 20 12
CHI 38 17
CLE 37 17
DAL 89 42
DEN 75 33
DET 38 17
GSW 67 38
HOU 53 25
IND 45 22
LAC 63 29
LAL 92 45
MEM 65 33
MIA 45 23
MIL 19 10
MIN 74 44
NOH 53 32
NYK 37 16
ORL 29 15
PHI 37 18
PHO 76 37
POR 63 36
SAC 42 27
SAS 64 32
TOR 22 10
UTA 52 32
WAS 29 13
29 rows × 2 columns
In [63]:
stacked=vincent.StackedBar(field_goal_per_team)
In [64]:
stacked.legend(title='Field Goals')
stacked.scales['x'].padding=0.1
display(stacked)
<vincent.charts.Bar at 0xb66af98>
In [66]:
stacked.display()
<IPython.core.display.Javascript at 0xbea6588>
In []:



Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий