Поиск по блогу

четверг, 16 апреля 2015 г.

Пойди туда не знаю - куда... pandas.dtypes[0].name - страдания "чайника"

Здесь я начал с процесса сравнения файлов в папке (обнаружил, что их два вида), потом я попытался найти методы для того, чтобы рассортировать файлы по (сочетанию) типам столбцов. Обнаружил еще один "подвид" (отрицательные числа в скобках). В процессе поиска я пришел к выводу, что надо фильтровать не по типам, а по регулярным выражениям ... но этот черновик получился довольно длинным, потому экзерсисы с регулярными выражениями в следующим файле. Много времени ущло на поиск команд. Найдены всего лишь: (pdtemp.dtypes[0].name == 'int64'), pdtemp.columns = colnames_3_0-12, ** А начинается пост с констатации методических ошибок...

Когда одновременно приходится осваивать новые методы и решать конкретные задачи, то мой мозг отказывается соображать. Я не могу понять Pandas... ну, это модно "списать" на Pandas, а вот то, что я не могу сообразить, как фильтровать файлы... Это раздражает. Я успокаиваю себя тем, что "переработал", надо отдохнуть...
Я сразу же нашел "правильный" вариант (с регулярными выражениями), но меня заколдобило: "А как фиьтровать по типам, как вообще это работает?" Решил "между делом" разобораться, не разобрался (оказалось, что там Numpy надо копать), но команды нашел..., причем, искать пришлось методом тыка, обычные методы поиска не давали результата.
Но потерял уйму времени.
Утешительные рекомендации самому себе: Если ты не можешь быстро найти подсказки на Stackoverflof, то это решение гениальное, или тупое. Не позволяй себе гениалных решений, пока не освоил матчасть. А типовые решения надо искать, формулируя ВОПРОС, ва не вариант (гениального) ответа.
Проще надо быть!

In [3]:
import os
import pandas as pd
import numpy as np
In [1]:
DIRPATH = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/'
filterfilenames = 'eng_car-sales-in'
# При считывании каждого файла во фрейм нужно будет добавлять столбец дат, 
# для этого мы используем список и словарь:
months = ['january', 'february', 'march', 'april', 'may', 'june',
        'july', 'august','september', 'october','november', 'december']
    
dict = {'january':'31-1', 'february':'28-2', 'march':'31-3',
        'april':'30-4', 'may':'31-5', 'june':'30-6',
        'july':'31-7', 'august':'31-8','september':'30-9', 
        'october':'31-10','november':'30-11', 'december':'31-12'}
In [2]:
COLNAMES =['Model', 'Brandt', \
           '2014y3', '2013y3', 'u1413y3', '1413y3', \
           '2014m3', '2013m3', 'u1413m3', '1413m3', 'Datem']
In [87]:
#import pdb; pdb.set_trace()  #Если раскоментировать, то можно испоьзовать паузы    
pdall =  pd.DataFrame()
for csvfile in os.listdir(DIRPATH):
    if csvfile.startswith(filterfilenames):
        csvfile_path = os.path.join(DIRPATH, csvfile)
        
        pdtemp = pd.read_csv(csvfile_path, error_bad_lines=False, \
                             na_values=['-','n/a'], \
                             quotechar='"', sep=',', thousands=',')
        #pdtemp['Datem'] = addDatem(csvfile)                          #Закомментировать 
        #pdall = pd.concat([pdall, pdtemp], ignore_index=True)        #Закомментировать
        #pd.set_option('display.max_columns'= 9, False)
        pd.set_option('display.width', 1000)
        print (csvfile)
        print (len(pdtemp.columns)) 
        print (pdtemp).head(3)
        print (pdtemp).tail(2)
        pdtemp.info()
        print (' ')
#print(pdall)                      
eng_car-sales-in-april-2014.csv
12
   1   Granta     Lada  45570  52765 (7,195) -14%  13077  15596 (2,519) -16%  Unnamed: 11
0  2  Solaris  Hyundai  35218  35941   (723)  -2%   9774  10553   (779)  -7%          NaN
1  3  New Rio      KIA  29971  28211   1,760   6%   9045   8430     615   7%          NaN
2  4   Duster  Renault  28153  25945   2,208   9%   7606   7600       6   0%          NaN
     1     Granta        Lada  45570  52765 (7,195) -14%  13077  15596 (2,519)  -16%  Unnamed: 11
22  24  Outlander  Mitsubishi   9447   8015   1,432  18%   1526   2447   (921)  -38%          NaN
23  25    X-Trail      Nissan   9177   8767     410   5%   1436   1614   (178)  -11%          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 12 columns):
1              24 non-null int64
Granta         24 non-null object
Lada           24 non-null object
45570          24 non-null int64
52765          24 non-null int64
(7,195)        23 non-null object
-14%           23 non-null object
13077          24 non-null int64
15596          24 non-null int64
(2,519)        23 non-null object
-16%           23 non-null object
Unnamed: 11    0 non-null float64
dtypes: float64(1), int64(5), object(6)
memory usage: 1.9+ KB
 
eng_car-sales-in-august-2014.csv
13
   1   Granta     Lada  11,860  14,951  (3,091)  1.1 Granta.1   Lada.1  95,154  113,738 (18,584)  Unnamed: 12
0  2  Solaris  Hyundai    9086    8590      496    2  Solaris  Hyundai   74438    75081    (643)          NaN
1  3  New Rio      KIA    5999    7159  (1,160)    3  New Rio      KIA   60032    59046      986          NaN
2  4    Logan  Renault    4390    4012      378    4   Duster  Renault   51137    53611  (2,474)          NaN
     1   Granta   Lada  11,860  14,951 (3,091)  1.1 Granta.1 Lada.1  95,154  113,738 (18,584)  Unnamed: 12
22  24  Patriot    UAZ    1963    1427     536   24    Astra   Opel   16539    26346  (9,807)          NaN
23  25     CX-5  Mazda    1790    1789       1   25     CX-5  Mazda   15047    11575    3,472          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 13 columns):
1              24 non-null int64
Granta         24 non-null object
Lada           24 non-null object
11,860         24 non-null int64
14,951         24 non-null int64
(3,091)        24 non-null object
1.1            24 non-null int64
Granta.1       24 non-null object
Lada.1         24 non-null object
95,154         24 non-null int64
113,738        24 non-null int64
(18,584)       24 non-null object
Unnamed: 12    0 non-null float64
dtypes: float64(1), int64(6), object(6)
memory usage: 2.1+ KB
 
eng_car-sales-in-december-2014.csv
13
   1   Granta     Lada  13,405  12,923      482  1.1 Granta.1   Lada.1  152,810  166,951 (14,141)  Unnamed: 12
0  2  Solaris  Hyundai    9502    8824      678    2  Solaris  Hyundai   114644   113991      653          NaN
1  3  New Rio      KIA    9298    6917    2,381    3  New Rio      KIA    93648    89788    3,860          NaN
2  4   Duster  Renault    6562    7976  (1,414)    4   Duster  Renault    76138    83702  (7,564)          NaN
     1 Granta   Lada  13,405  12,923    482  1.1 Granta.1  Lada.1  152,810  166,951 (14,141)  Unnamed: 12
22  24   CX-5  Mazda    3281    2196  1,085   24  Corolla  Toyota    27704    30644  (2,940)          NaN
23  25  Mokka   Opel    3121    1260  1,861   25     CX-5   Mazda    24953    19725    5,228          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 13 columns):
1              24 non-null int64
Granta         24 non-null object
Lada           24 non-null object
13,405         24 non-null int64
12,923         24 non-null int64
482            24 non-null object
1.1            24 non-null int64
Granta.1       24 non-null object
Lada.1         24 non-null object
152,810        24 non-null int64
166,951        24 non-null int64
(14,141)       24 non-null object
Unnamed: 12    0 non-null float64
dtypes: float64(1), int64(6), object(6)
memory usage: 2.1+ KB
 
eng_car-sales-in-july-2014.csv
13
   1   Granta     Lada  11,819  14,542  (2,723)  1.1 Granta.1   Lada.1  83,294  98,787 (15,493)  Unnamed: 12
0  2  Solaris  Hyundai    9778    9482      296    2  Solaris  Hyundai   65352   66491  (1,139)          NaN
1  3  New Rio      KIA    6853    7651    (798)    3  New Rio      KIA   54033   51887    2,146          NaN
2  4   Duster  Renault    5694    7348  (1,654)    4   Duster  Renault   46761   48058  (1,297)          NaN
     1   Granta    Lada  11,819  14,542  (2,723)  1.1   Granta.1      Lada.1  83,294  98,787 (15,493)  Unnamed: 12
22  24  Qashqai  Nissan    2017    3242  (1,225)   24      Astra        Opel   15019   22532  (7,513)          NaN
23  25    Mokka    Opel    1938     490    1,448   25  Outlander  Mitsubishi   13545   13079      466          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 13 columns):
1              24 non-null int64
Granta         24 non-null object
Lada           24 non-null object
11,819         24 non-null int64
14,542         24 non-null int64
(2,723)        23 non-null object
1.1            24 non-null int64
Granta.1       24 non-null object
Lada.1         24 non-null object
83,294         24 non-null int64
98,787         24 non-null int64
(15,493)       24 non-null object
Unnamed: 12    0 non-null float64
dtypes: float64(1), int64(6), object(6)
memory usage: 2.1+ KB
 
eng_car-sales-in-june-2014.csv
12
   1     Lada   Granta  12,958  17,009 (4,051) -24%  71,475  84,245 (12,770) -15%  Unnamed: 11
0  2  Hyundai  Solaris   10565   10575    (10)   0%   55574   57009  (1,435)  -3%          NaN
1  3      KIA  New Rio    8696    8047     649   8%   47180   44236    2,944   7%          NaN
2  4     Lada   Largus    6981    5034   1,947  39%   34920   24702   10,218  41%          NaN
     1   Lada Granta  12,958  17,009 (4,051) -24%  71,475  84,245 (12,770) -15%  Unnamed: 11
22  24  Mazda   CX-5    2063    1656     407  25%   14450    8164    6,286  77%          NaN
23  25     VW  Jetta    2038    1792     246  14%   10932   10636      296   3%          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 12 columns):
1              24 non-null int64
Lada           24 non-null object
Granta         24 non-null object
12,958         24 non-null int64
17,009         23 non-null float64
(4,051)        23 non-null object
-24%           23 non-null object
71,475         24 non-null int64
84,245         23 non-null float64
(12,770)       23 non-null object
-15%           23 non-null object
Unnamed: 11    0 non-null float64
dtypes: float64(3), int64(3), object(6)
memory usage: 1.9+ KB
 
eng_car-sales-in-may-2014.csv
12
   1     Lada   Granta  12,947  14,471  -1,524 -11%  58,517  67,236  -8,719 -13%  Unnamed: 11
0  2  Hyundai  Solaris    9791   10493    -702  -7%   45009   46434   -1425  -3%          NaN
1  3      KIA  New Rio    8513    7978     535   7%   38484   36189    2295   6%          NaN
2  4     Lada   Largus    7448    5289    2159  41%   27939   19668    8271  42%          NaN
     1        Lada     Granta  12,947  14,471  -1,524  -11%  58,517  67,236  -8,719 -13%  Unnamed: 11
22  24  Mitsubishi  Outlander    1469    1812    -343  -19%   10916    9827    1089  11%          NaN
23  25      Nissan       Juke    1242    2111    -869  -41%   10831    9861     970  10%          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 12 columns):
1              24 non-null int64
Lada           24 non-null object
Granta         24 non-null object
12,947         24 non-null int64
14,471         23 non-null float64
-1,524         23 non-null float64
-11%           23 non-null object
58,517         24 non-null int64
67,236         23 non-null float64
-8,719         23 non-null float64
-13%           23 non-null object
Unnamed: 11    0 non-null float64
dtypes: float64(5), int64(3), object(4)
memory usage: 2.1+ KB
 
eng_car-sales-in-november-2014.csv
13
   1  New Rio      KIA  10,834  7,038    3796  1.1   Granta     Lada  139,405  154,028 (14623)  Unnamed: 12
0  2   Granta     Lada   10520  12190  (1670)    2  Solaris  Hyundai   105142   105167    (25)          NaN
1  3  Solaris  Hyundai   10134   9672     462    3  New Rio      KIA    84350    82871    1479          NaN
2  4    Logan  Renault    7152   4036    3116    4   Duster  Renault    69576    75726  (6150)          NaN
     1 New Rio  KIA  10,834  7,038    3796  1.1   Granta    Lada  139,405  154,028  (14623)  Unnamed: 12
22  24   Jetta   VW    2543   1432    1111   24    Focus    Ford    24104    59552  (35448)          NaN
23  25     New  KIA    2542   3833  (1291)   25  X-Trail  Nissan    22251    24021   (1770)          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 13 columns):
1              24 non-null int64
New Rio        24 non-null object
KIA            24 non-null object
10,834         24 non-null int64
7,038          24 non-null int64
3796           24 non-null object
1.1            24 non-null int64
Granta         24 non-null object
Lada           24 non-null object
139,405        24 non-null int64
154,028        24 non-null int64
(14623)        24 non-null object
Unnamed: 12    0 non-null float64
dtypes: float64(1), int64(6), object(6)
memory usage: 2.1+ KB
 
eng_car-sales-in-october-2014.csv
13
   1   Granta     Lada  16,807  13,357    3,450  1.1 Granta.1   Lada.1  128,885  141,838 (12,953)  Unnamed: 12
0  2  Solaris  Hyundai   10569    9738      831    2  Solaris  Hyundai    95008    95495    (487)          NaN
1  3  New Rio      KIA    7100    8155  (1,055)    3  New Rio      KIA    73516    75833  (2,317)          NaN
2  4   Largus     Lada    6561    5765      796    4   Duster  Renault    62567    68163  (5,596)          NaN
     1 Granta       Lada  16,807  13,357    3,450  1.1   Granta.1      Lada.1  128,885  141,838  (12,953)  Unnamed: 12
18  24   CX-5      Mazda    2124    1977      147   24  Outlander  Mitsubishi    20857    20461       396          NaN
19  25  Cruze  Chevrolet    2034    3808  (1,774)   25      Astra        Opel    19623    33995  (14,372)          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 13 columns):
1              20 non-null int64
Granta         20 non-null object
Lada           20 non-null object
16,807         20 non-null int64
13,357         20 non-null int64
3,450          20 non-null object
1.1            20 non-null int64
Granta.1       20 non-null object
Lada.1         20 non-null object
128,885        20 non-null int64
141,838        20 non-null int64
(12,953)       20 non-null object
Unnamed: 12    0 non-null float64
dtypes: float64(1), int64(6), object(6)
memory usage: 1.7+ KB
 
eng_car-sales-in-september-2014.csv
13
   1   Granta     Lada  16,924  14,743    2,181  1.1 Granta.1   Lada.1  112,078  128,481 (16,403)  Unnamed: 12
0  2  Solaris  Hyundai   10001   10676    (675)    2  Solaris  Hyundai    84439    85757  (1,318)          NaN
1  3  New Rio      KIA    6384    8632  (2,248)    3  New Rio      KIA    66416    67678  (1,262)          NaN
2  4   Kalina     Lada    5705    5305      400    4   Duster  Renault    56789    60426  (3,637)          NaN
     1   Granta       Lada  16,924  14,743    2,181  1.1   Granta.1      Lada.1  112,078  128,481  (16,403)  Unnamed: 12
22  24  Corolla     Toyota    2169    3555  (1,386)   24      Astra        Opel    18014    30157  (12,143)          NaN
23  25    Cruze  Chevrolet    2088    5070  (2,982)   25  Outlander  Mitsubishi    17574    17451       123          NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 13 columns):
1              24 non-null int64
Granta         24 non-null object
Lada           24 non-null object
16,924         24 non-null int64
14,743         24 non-null int64
2,181          24 non-null object
1.1            24 non-null int64
Granta.1       24 non-null object
Lada.1         24 non-null object
112,078        24 non-null int64
128,481        24 non-null int64
(16,403)       24 non-null object
Unnamed: 12    0 non-null float64
dtypes: float64(1), int64(6), object(6)
memory usage: 2.1+ KB
 

Для мая не сработал параметр thousands=','

In [ ]:
# %load '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/eng_car-sales-in-may-2014.csv'
"1","Lada","Granta","12,947","14,471","-1,524","-11%","58,517","67,236","-8,719","-13%",
"2","Hyundai","Solaris","9,791","10,493","-702","-7%","45,009","46,434","-1,425","-3%",
"3","KIA","New Rio","8,513","7,978","535","7%","38,484","36,189","2,295","6%",
"4","Lada","Largus","7,448","5,289","2,159","41%","27,939","19,668","8,271","42%",
"5","Renault","Duster","6,835","7,231","-396","-5%","34,988","33,176","1,812","5%",
"6","Lada","Kalina","5,730","5,778","-48","-1%","31,594","34,173","-2,579","-8%",
"7","VW","Polo","5,393","6,155","-762","-12%","26,759","27,953","-1,194","-4%",
"8","Renault","Logan","5,130","4,791","339","7%","19,472","21,465","-1,993","-9%",
"9","Nissan","Almera","4,456","1,163","3,293","283%","21,406","1,759","19,647","1117%",
"10","Nissan","Qashqai","4,031","2,297","1,734","75%","11,168","13,721","-2,553","-19%",
"11","Lada","Priora","3,695","5,377","-1,682","-31%","21,868","26,872","-5,004","-19%",
"12","Toyota","RAV 4","3,457","4,503","-1,046","-23%","15,958","15,913","45","0%",
"13","Lada","4x4","3,286","3,540","-254","-7%","16,217","18,139","-1,922","-11%",
"14","Hyundai","ix35","3,172","2,337","835","36%","14,910","11,995","2,915","24%",
"15","Škoda","Octavia A7","3,167","-","-","-","16,338","-","-","-",
"16","Chevrolet","Niva","2,658","3,975","-1,317","-33%","18,192","21,723","-3,531","-16%",
"17","Toyota","Camry","2,650","2,737","-87","-3%","12,988","12,811","177","1%",
"18","Renault","Sandero","2,537","4,016","-1,479","-37%","16,314","18,480","-2,166","-12%",
"19","KIA","Sportage","2,368","2,890","-522","-18%","12,347","13,248","-901","-7%",
"20","Toyota","Corolla","2,316","2,289","27","1%","12,777","10,003","2,774","28%",
"21","Chevrolet","Cruze","2,041","4,202","-2,161","-51%","14,404","18,767","-4,363","-23%",
"22","Ford","Focus","1,904","5,942","-4,038","-68%","12,827","28,220","-15,393","-55%",
"23","Opel","Astra","1,592","3,116","-1,524","-49%","11,493","15,196","-3,703","-24%",
"24","Mitsubishi","Outlander","1,469","1,812","-343","-19%","10,916","9,827","1,089","11%",
"25","Nissan","Juke","1,242","2,111","-869","-41%","10,831","9,861","970","10%",

Этот май - баловник

In [ ]:
#Как такое может быть?
12,947         24 non-null int64
14,471         23 non-null float64
# мы же задали thousands=','
# для одной строчки он сработал
# а для другой - нет?

Вот копипаст всего info для мая:

In [ ]:
Data columns (total 12 columns):
1              24 non-null int64
Lada           24 non-null object
Granta         24 non-null object
12,947         24 non-null int64
14,471         23 non-null float64
-1,524         23 non-null float64
-11%           23 non-null object
58,517         24 non-null int64
67,236         23 non-null float64
-8,719         23 non-null float64
-13%           23 non-null object
Unnamed: 11    0 non-null float64
dtypes: float64(5), int64(3), object(4)

Может быть это сбой? Перезагрузимся в новый DataFrame:

In [89]:
csvfile_path = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/eng_car-sales-in-may-2014.csv'
pd_may = pd.read_csv(csvfile_path, error_bad_lines=False, \
                             na_values=['-','n/a'], \
                             quotechar='"', sep=',', thousands=',')
In [90]:
pd_may.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 12 columns):
1              24 non-null int64
Lada           24 non-null object
Granta         24 non-null object
12,947         24 non-null int64
14,471         23 non-null float64
-1,524         23 non-null float64
-11%           23 non-null object
58,517         24 non-null int64
67,236         23 non-null float64
-8,719         23 non-null float64
-13%           23 non-null object
Unnamed: 11    0 non-null float64
dtypes: float64(5), int64(3), object(4)
memory usage: 2.1+ KB

Не мог понять, в чем дело, на всякий случай распечатал таблицу (ниже). Единственное отличие столбцов формата Int в том, что там нет ячеек Nan (строка 13). Странно эта подстраховка происходит... Почему не object?

In [91]:
pd_may
Out[91]:
1 Lada Granta 12,947 14,471 -1,524 -11% 58,517 67,236 -8,719 -13% Unnamed: 11
0 2 Hyundai Solaris 9791 10493 -702 -7% 45009 46434 -1425 -3% NaN
1 3 KIA New Rio 8513 7978 535 7% 38484 36189 2295 6% NaN
2 4 Lada Largus 7448 5289 2159 41% 27939 19668 8271 42% NaN
3 5 Renault Duster 6835 7231 -396 -5% 34988 33176 1812 5% NaN
4 6 Lada Kalina 5730 5778 -48 -1% 31594 34173 -2579 -8% NaN
5 7 VW Polo 5393 6155 -762 -12% 26759 27953 -1194 -4% NaN
6 8 Renault Logan 5130 4791 339 7% 19472 21465 -1993 -9% NaN
7 9 Nissan Almera 4456 1163 3293 283% 21406 1759 19647 1117% NaN
8 10 Nissan Qashqai 4031 2297 1734 75% 11168 13721 -2553 -19% NaN
9 11 Lada Priora 3695 5377 -1682 -31% 21868 26872 -5004 -19% NaN
10 12 Toyota RAV 4 3457 4503 -1046 -23% 15958 15913 45 0% NaN
11 13 Lada 4x4 3286 3540 -254 -7% 16217 18139 -1922 -11% NaN
12 14 Hyundai ix35 3172 2337 835 36% 14910 11995 2915 24% NaN
13 15 Škoda Octavia A7 3167 NaN NaN NaN 16338 NaN NaN NaN NaN
14 16 Chevrolet Niva 2658 3975 -1317 -33% 18192 21723 -3531 -16% NaN
15 17 Toyota Camry 2650 2737 -87 -3% 12988 12811 177 1% NaN
16 18 Renault Sandero 2537 4016 -1479 -37% 16314 18480 -2166 -12% NaN
17 19 KIA Sportage 2368 2890 -522 -18% 12347 13248 -901 -7% NaN
18 20 Toyota Corolla 2316 2289 27 1% 12777 10003 2774 28% NaN
19 21 Chevrolet Cruze 2041 4202 -2161 -51% 14404 18767 -4363 -23% NaN
20 22 Ford Focus 1904 5942 -4038 -68% 12827 28220 -15393 -55% NaN
21 23 Opel Astra 1592 3116 -1524 -49% 11493 15196 -3703 -24% NaN
22 24 Mitsubishi Outlander 1469 1812 -343 -19% 10916 9827 1089 11% NaN
23 25 Nissan Juke 1242 2111 -869 -41% 10831 9861 970 10% NaN

Проверим июнь

In [92]:
csvfile_path = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/eng_car-sales-in-june-2014.csv'
pd_june = pd.read_csv(csvfile_path, error_bad_lines=False, \
                             na_values=['-','n/a'], \
                             quotechar='"', sep=',', thousands=',')
In [93]:
pd_june
Out[93]:
1 Lada Granta 12,958 17,009 (4,051) -24% 71,475 84,245 (12,770) -15% Unnamed: 11
0 2 Hyundai Solaris 10565 10575 (10) 0% 55574 57009 (1,435) -3% NaN
1 3 KIA New Rio 8696 8047 649 8% 47180 44236 2,944 7% NaN
2 4 Lada Largus 6981 5034 1,947 39% 34920 24702 10,218 41% NaN
3 5 Renault Duster 6079 7534 (1,455) -19% 41067 40710 357 1% NaN
4 6 VW Polo 5378 6767 (1,389) -21% 32137 34720 (2,583) -7% NaN
5 7 Renault Logan 5233 4426 807 18% 24705 25891 (1,186) -5% NaN
6 8 Nissan Qashqai 4188 2425 1,763 73% 15356 16146 (790) -5% NaN
7 9 Lada Kalina 3872 2908 964 33% 35466 37081 (1,615) -4% NaN
8 10 Nissan Almera 3733 1177 2,556 217% 25139 2936 22,203 756% NaN
9 11 Toyota Camry 3199 3447 (248) -7% 16187 16258 (71) 0% NaN
10 12 Hyundai ix35 3177 3209 (32) -1% 18087 15204 2,883 19% NaN
11 13 Škoda Octavia A7 3078 979 2,099 214% 19416 979 18,437 1883% NaN
12 14 Lada Priora 2985 5009 (2,024) -40% 24853 31881 (7,028) -22% NaN
13 15 Toyota RAV 4 2921 4002 (1,081) -27% 18879 19915 (1,036) -5% NaN
14 16 KIA New Cee'd 2810 2925 (115) -4% 13236 15780 (2,544) -16% NaN
15 17 Renault Sandero 2785 3758 (973) -26% 19099 22238 (3,139) -14% NaN
16 18 Chevrolet Niva 2660 4132 (1,472) -36% 20852 25855 (5,003) -19% NaN
17 19 Lada 4x4 2459 3750 (1,291) -34% 18676 21889 (3,213) -15% NaN
18 20 Skoda Rapid 2341 NaN NaN NaN 4270 NaN NaN NaN NaN
19 21 Toyota Corolla 2214 2628 (414) -16% 14991 12631 2,360 19% NaN
20 22 Ford Focus 2106 5944 (3,838) -65% 14933 34164 (19,231) -56% NaN
21 23 KIA Sportage 2098 3107 (1,009) -32% 14445 16355 (1,910) -12% NaN
22 24 Mazda CX-5 2063 1656 407 25% 14450 8164 6,286 77% NaN
23 25 VW Jetta 2038 1792 246 14% 10932 10636 296 3% NaN

Теперь мы делаем шаг назад и работаем с последним объектом pdall

In [94]:
pd_june.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 12 columns):
1              24 non-null int64
Lada           24 non-null object
Granta         24 non-null object
12,958         24 non-null int64
17,009         23 non-null float64
(4,051)        23 non-null object
-24%           23 non-null object
71,475         24 non-null int64
84,245         23 non-null float64
(12,770)       23 non-null object
-15%           23 non-null object
Unnamed: 11    0 non-null float64
dtypes: float64(3), int64(3), object(6)
memory usage: 1.9+ KB

Так или иначе, запятые - разделители разрядов во всех случаях обрабатываются (убираются) верно. Так что, нам не так уж и важно, какой числовой формат...

In [6]:
pdtemp.columns
Out[6]:
Index([u'1', u'Granta', u'Lada', u'16,924', u'14,743', u'2,181', u'1.1', u'Granta.1', u'Lada.1', u'112,078', u'128,481', u'(16,403)', u'Unnamed: 12'], dtype='object')
In [12]:
pdtemp.columns[0], pdtemp.columns[-1], pdtemp.columns[6]
Out[12]:
('1', 'Unnamed: 12', '1.1')

Попробуем идентифицировать шаблон по сочетанию столбцов

Данные в нулевом и шестом столбцах должны иметь формат Int, а в 1,2 и 7,8 - это строки. Но санчала проверим, что мы получаем:

In [13]:
pdtemp.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24 entries, 0 to 23
Data columns (total 13 columns):
1              24 non-null int64
Granta         24 non-null object
Lada           24 non-null object
16,924         24 non-null int64
14,743         24 non-null int64
2,181          24 non-null object
1.1            24 non-null int64
Granta.1       24 non-null object
Lada.1         24 non-null object
112,078        24 non-null int64
128,481        24 non-null int64
(16,403)       24 non-null object
Unnamed: 12    0 non-null float64
dtypes: float64(1), int64(6), object(6)
memory usage: 2.1+ KB
In [28]:
pdtemp.dtypes[0], pdtemp.dtypes[6]
Out[28]:
(dtype('int64'), dtype('int64'))

Все не так просто, мы получили объекты dtype (numpy) в приложении внизу есть перечень своййств. Я попробовал 'name', подошло:

In [29]:
numpydtype = pdtemp.dtypes[0]
In [30]:
numpydtype.name
Out[30]:
'int64'
In [32]:
numpydtype.type
Out[32]:
numpy.int64
In [41]:
 pdtemp.dtypes
Out[41]:
1                int64
Granta          object
Lada            object
16,924           int64
14,743           int64
2,181           object
1.1              int64
Granta.1        object
Lada.1          object
112,078          int64
128,481          int64
(16,403)        object
Unnamed: 12    float64
dtype: object

Начинаем с апреля

Первые три строчки (номер, Модель, Марка), шестая со знаком процента, а 7 и 8-я - цифры.

In [ ]:
Data columns (total 12 columns):
1              24 non-null int64
Granta         24 non-null object
Lada           24 non-null object
45570          24 non-null int64
52765          24 non-null int64
(7,195)        23 non-null object
-14%           23 non-null object
13077          24 non-null int64
15596          24 non-null int64
(2,519)        23 non-null object
-16%           23 non-null object
Unnamed: 11    0 non-null float64
dtypes: float64(1), int64(5), object(6)

Формат отличаеся от мая с июнем. И здесь самое время проверить, есть ли здесь Nan

In [95]:
csvfile_path = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/eng_car-sales-in-april-2014.csv'
pd_april = pd.read_csv(csvfile_path, error_bad_lines=False, \
                             na_values=['-','n/a'], \
                             quotechar='"', sep=',', thousands=',')
In [96]:
pd_april
Out[96]:
1 Granta Lada 45570 52765 (7,195) -14% 13077 15596 (2,519) -16% Unnamed: 11
0 2 Solaris Hyundai 35218 35941 (723) -2% 9774 10553 (779) -7% NaN
1 3 New Rio KIA 29971 28211 1,760 6% 9045 8430 615 7% NaN
2 4 Duster Renault 28153 25945 2,208 9% 7606 7600 6 0% NaN
3 5 Kalina Lada 25864 28395 (2,531) -9% 7731 8857 (1,126) -13% NaN
4 6 Polo VW 21366 21798 (432) -2% 5964 6637 (673) -10% NaN
5 7 Largus Lada 20491 14379 6,112 43% 5889 4209 1,680 40% NaN
6 8 Priora Lada 18173 21495 (3,322) -15% 5212 4962 250 5% NaN
7 9 Almera Nissan 16950 596 16,354 2744% 5798 483 5,315 1100% NaN
8 10 Niva Chevrolet 15534 17748 (2,214) -12% 4983 5724 (741) -13% NaN
9 11 Logan Renault 14342 16674 (2,332) -14% 3796 4890 (1,094) -22% NaN
10 12 Sandero Renault 13777 14464 (687) -5% 3796 4084 (288) -7% NaN
11 13 Octavia A7 Škoda 13171 0 NaN NaN 3753 0 NaN NaN NaN
12 14 4x4 Lada 12931 14599 (1,668) -11% 3784 3808 (24) -1% NaN
13 15 RAV 4 Toyota 12501 11410 1,091 10% 4391 4974 (583) -12% NaN
14 16 Cruze Chevrolet 12363 14565 (2,202) -15% 3014 4284 (1,270) -30% NaN
15 17 ix35 Hyundai 11738 9658 2,080 22% 3119 2030 1,089 54% NaN
16 18 Focus Ford 10923 22278 (11,355) -51% 2549 5546 (2,997) -54% NaN
17 19 Corolla Toyota 10461 7714 2,747 36% 2450 2418 32 1% NaN
18 20 Camry Toyota 10338 10074 264 3% 2987 2822 165 6% NaN
19 21 Sportage KIA 9979 10358 (379) -4% 2526 2857 (331) -12% NaN
20 22 Astra Opel 9901 12080 (2,179) -18% 2049 3221 (1,172) -36% NaN
21 23 Juke Nissan 9589 7750 1,839 24% 2837 1657 1,180 71% NaN
22 24 Outlander Mitsubishi 9447 8015 1,432 18% 1526 2447 (921) -38% NaN
23 25 X-Trail Nissan 9177 8767 410 5% 1436 1614 (178) -11% NaN

Да, действительно, здесь нет "Nan"... Оставляем эту версию рабочей. Это значит, что нам в дальнейшем нужно будет задавать условия для обоих вариантов Int и Float

In [ ]:
colnames_3_0_11 =['n1_y', 'Model', 'Brandt', \
                  '2014y3', '2013y3', 'u1413y3', '1413y3', \
                  '2014m3', '2013m3', 'u1413m3', '1413m3', 'Datem']
In [97]:
true_3_0_11 =   [(pdtemp.dtypes[0].name == 'int64') &
                 (pdtemp.dtypes[1].name == 'object') & 
                 (pdtemp.dtypes[2].name == 'object') & 
                 (pdtemp.dtypes[6].name == 'object') & 
                 (pdtemp.dtypes[7].name == 'int64 | Float64') & 
                 (pdtemp.dtypes[8].name == 'int64 | Float64')]
In [98]:
true_3_0_11
Out[98]:
[False]

Май отличается не количеством столбцов, а

In [ ]:
 

Повторяются три первых столбца (номер, Модель, Марка) два раза

In [85]:
colnames_3_0_12 = ['n1_y', 'Model_y', 'Brandt_y', \
                   '2014y3', '2013y3', '1413y3', \
                   'n1_m', 'Model_m', 'Brandt_m', \
                   '2014m3', '2013m3', '1413m3', 'Datem']
In [86]:
true_3_0_12 =   [(pdtemp.dtypes[0].name == 'int64') &
                 (pdtemp.dtypes[1].name == 'object') & 
                 (pdtemp.dtypes[2].name == 'object') & 
                 (pdtemp.dtypes[6].name == 'int64') & 
                 (pdtemp.dtypes[7].name == 'object') & 
                 (pdtemp.dtypes[8].name == 'object')]
In [ ]:
 
In [75]:
true_3_0_12
Out[75]:
[True]
In [ ]:
if true_3_0_12 = True:
    pdtemp.columns = colnames_3_0-12
In [ ]:
'u1413y3', 'u1413m3', 
In [46]:
 pdtemp.loc[:, pdtemp.dtypes == np.float64].tail()
Out[46]:
Unnamed: 12
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
In [65]:
ppd = pdtemp.loc[:, pdtemp.dtypes == np.int64].tail()
In [66]:
ppd.info
Out[66]:
<bound method DataFrame.info of      1  16,924  14,743  1.1  112,078  128,481
19  21    2494    3222   21    21412    24485
20  22    2478    3807   22    20689    50406
21  23    2219     948   23    20582    24464
22  24    2169    3555   24    18014    30157
23  25    2088    5070   25    17574    17451>
In [ ]:
ppd.
In [56]:
pdtemp.columns[pdtemp.dtypes == np.int64]
Out[56]:
Index([u'1', u'16,924', u'14,743', u'1.1', u'112,078', u'128,481'], dtype='object')
In [61]:
pdtemp.columns.get_indexer
Out[61]:
<bound method Index.get_indexer of Index([u'1', u'Granta', u'Lada', u'16,924', u'14,743', u'2,181', u'1.1', u'Granta.1', u'Lada.1', u'112,078', u'128,481', u'(16,403)', u'Unnamed: 12'], dtype='object')>
In [64]:
pdtemp.columns.order()
Out[64]:
Index([u'(16,403)', u'1', u'1.1', u'112,078', u'128,481', u'14,743', u'16,924', u'2,181', u'Granta', u'Granta.1', u'Lada', u'Lada.1', u'Unnamed: 12'], dtype='object')

Приложения: dtype

In [31]:
dir(numpydtype)
Out[31]:
['__class__',
 '__delattr__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'alignment',
 'base',
 'byteorder',
 'char',
 'descr',
 'fields',
 'flags',
 'hasobject',
 'isalignedstruct',
 'isbuiltin',
 'isnative',
 'itemsize',
 'kind',
 'metadata',
 'name',
 'names',
 'newbyteorder',
 'num',
 'shape',
 'str',
 'subdtype',
 'type']
In [ ]:
 


Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий