Поиск по блогу

пятница, 17 апреля 2015 г.

Попытка доисследовать, чем же отличаются файлы в aerbu_2014_all_csv/3aug2/ приостановлена

Здесь я начал было вручную импортировать файлы в объекты DataFrame (Pandas), но пост разросся из-за того, что пришлось пробовать все подряд: последовательность обработки файлов (начнем с апреля-мая), анализ содержания файлов (структура столбцов), опции для метода pd.read_csv, регулярные выражения. Слишком много разных тем для одного поста. Потому решил остановиться, чтобы в следующих постах освоить простые методы для работы со столбцами. Здесь же неплохой вариант подготовки к пофайловой обработке csv-таблиц. Так что вернемся к этому посту позже.

I. Доисследуем, чем же отличаются файлы, начало в Пойди туда не знаю - куда

Просто скопируем несколько ячеек с кодом

In [29]:
import os
import pandas as pd
import numpy as np
from cStringIO import StringIO
In [ ]:
DIRPATH = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/'
filterfilenames = 'eng_car-sales-in'
# При считывании каждого файла во фрейм нужно будет добавлять столбец дат, 
# для этого мы используем список и словарь:
months = ['january', 'february', 'march', 'april', 'may', 'june',
        'july', 'august','september', 'october','november', 'december']
    
dict = {'january':'31-1', 'february':'28-2', 'march':'31-3',
        'april':'30-4', 'may':'31-5', 'june':'30-6',
        'july':'31-7', 'august':'31-8','september':'30-9', 
        'october':'31-10','november':'30-11', 'december':'31-12'}
In [ ]:
COLNAMES =['Model', 'Brandt', \
           '2014y3', '2013y3', 'u1413y3', '1413y3', \
           '2014m3', '2013m3', 'u1413m3', '1413m3', 'Datem']
In [ ]:
#import pdb; pdb.set_trace()  #Если раскоментировать, то можно испоьзовать паузы    
pdall =  pd.DataFrame()
for csvfile in os.listdir(DIRPATH):
    if csvfile.startswith(filterfilenames):
        csvfile_path = os.path.join(DIRPATH, csvfile)
        
        pdtemp = pd.read_csv(csvfile_path, error_bad_lines=False, \
                             na_values=['-','n/a'], \
                             quotechar='"', sep=',', thousands=',')
        #pdtemp['Datem'] = addDatem(csvfile)                          
        #pdall = pd.concat([pdall, pdtemp], ignore_index=True)        
        #pd.set_option('display.max_columns'= 9, False)
        pd.set_option('display.width', 1000)
        print (csvfile)
        print (len(pdtemp.columns)) 
        print (pdtemp).head(3)
        print (pdtemp).tail(2)
        pdtemp.info()
        print (' ')
#print(pdall)                      

II. Не будем импортировать все файлы, а рассмотрим май и апрель

In [ ]:
# %load '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/eng_car-sales-in-may-2014.csv'
"1","Lada","Granta","12,947","14,471","-1,524","-11%","58,517","67,236","-8,719","-13%",
"2","Hyundai","Solaris","9,791","10,493","-702","-7%","45,009","46,434","-1,425","-3%",
"3","KIA","New Rio","8,513","7,978","535","7%","38,484","36,189","2,295","6%",
"4","Lada","Largus","7,448","5,289","2,159","41%","27,939","19,668","8,271","42%",
"5","Renault","Duster","6,835","7,231","-396","-5%","34,988","33,176","1,812","5%",
"6","Lada","Kalina","5,730","5,778","-48","-1%","31,594","34,173","-2,579","-8%",
"7","VW","Polo","5,393","6,155","-762","-12%","26,759","27,953","-1,194","-4%",
"8","Renault","Logan","5,130","4,791","339","7%","19,472","21,465","-1,993","-9%",
"9","Nissan","Almera","4,456","1,163","3,293","283%","21,406","1,759","19,647","1117%",
"10","Nissan","Qashqai","4,031","2,297","1,734","75%","11,168","13,721","-2,553","-19%",
"11","Lada","Priora","3,695","5,377","-1,682","-31%","21,868","26,872","-5,004","-19%",
"12","Toyota","RAV 4","3,457","4,503","-1,046","-23%","15,958","15,913","45","0%",
"13","Lada","4x4","3,286","3,540","-254","-7%","16,217","18,139","-1,922","-11%",
"14","Hyundai","ix35","3,172","2,337","835","36%","14,910","11,995","2,915","24%",
"15","Škoda","Octavia A7","3,167","-","-","-","16,338","-","-","-",
"16","Chevrolet","Niva","2,658","3,975","-1,317","-33%","18,192","21,723","-3,531","-16%",
"17","Toyota","Camry","2,650","2,737","-87","-3%","12,988","12,811","177","1%",
"18","Renault","Sandero","2,537","4,016","-1,479","-37%","16,314","18,480","-2,166","-12%",
"19","KIA","Sportage","2,368","2,890","-522","-18%","12,347","13,248","-901","-7%",
"20","Toyota","Corolla","2,316","2,289","27","1%","12,777","10,003","2,774","28%",
"21","Chevrolet","Cruze","2,041","4,202","-2,161","-51%","14,404","18,767","-4,363","-23%",
"22","Ford","Focus","1,904","5,942","-4,038","-68%","12,827","28,220","-15,393","-55%",
"23","Opel","Astra","1,592","3,116","-1,524","-49%","11,493","15,196","-3,703","-24%",
"24","Mitsubishi","Outlander","1,469","1,812","-343","-19%","10,916","9,827","1,089","11%",
"25","Nissan","Juke","1,242","2,111","-869","-41%","10,831","9,861","970","10%",

header: row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None.
Explicitly pass header=0 to be able to replace existing names.
The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped).
Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True (the default), so header=0 denotes the first line of data rather than the first line of the file.

names: List of column names to use as column names. To replace header existing in file, explicitly pass header=0.

In [25]:
# month first
colnames_3_0_11_april =['n1_y', 'Model', 'Brandt', \
                  '2014y3', '2013y3', 'u1413y3', '1413y3', \
                  '2014m3', '2013m3', 'u1413m3', '1413m3', 'Datem']
In [20]:
# year first
colnames_3_0_11_may =['n1_y', 'Model', 'Brandt', \
                      '2014m3', '2013m3', 'u1413m3', '1413m3', \
                      '2014y3', '2013y3', 'u1413y3', '1413y3', 'Datem']
In [9]:
colnames_3_0_12 = ['n1_y', 'Model_y', 'Brandt_y', \
                   '2014y3', '2013y3', '1413y3', \
                   'n1_m', 'Model_m', 'Brandt_m', \
                   '2014m3', '2013m3', '1413m3', 'Datem']

colnames_3_0_12

In [21]:
csvfile_path = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/eng_car-sales-in-may-2014.csv'
pd_may = pd.read_csv(csvfile_path, error_bad_lines=False, \
                     na_values=['-','n/a'], names=colnames_3_0_11_may, \
                     quotechar='"', sep=',', thousands=',')
In [22]:
pd_may.head()
Out[22]:
n1_y Model Brandt 2014m3 2013m3 u1413m3 1413m3 2014y3 2013y3 u1413y3 1413y3 Datem
0 1 Lada Granta 12947 14471 -1524 -11% 58517 67236 -8719 -13% NaN
1 2 Hyundai Solaris 9791 10493 -702 -7% 45009 46434 -1425 -3% NaN
2 3 KIA New Rio 8513 7978 535 7% 38484 36189 2295 6% NaN
3 4 Lada Largus 7448 5289 2159 41% 27939 19668 8271 42% NaN
4 5 Renault Duster 6835 7231 -396 -5% 34988 33176 1812 5% NaN
In [26]:
csvfile_path = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/3aug2/eng_car-sales-in-april-2014.csv'
pd_april = pd.read_csv(csvfile_path, error_bad_lines=False, \
                     na_values=['-','n/a'], names=colnames_3_0_11_april, \
                     quotechar='"', sep=',', thousands=',')

В мае были изменен порядок столбцов (сначала нарастающий итог, а потом - месяц).

In [27]:
pd_april.tail() 
Out[27]:
n1_y Model Brandt 2014y3 2013y3 u1413y3 1413y3 2014m3 2013m3 u1413m3 1413m3 Datem
20 21 Sportage KIA 9979 10358 (379) -4% 2526 2857 (331) -12% NaN
21 22 Astra Opel 9901 12080 (2,179) -18% 2049 3221 (1,172) -36% NaN
22 23 Juke Nissan 9589 7750 1,839 24% 2837 1657 1,180 71% NaN
23 24 Outlander Mitsubishi 9447 8015 1,432 18% 1526 2447 (921) -38% NaN
24 25 X-Trail Nissan 9177 8767 410 5% 1436 1614 (178) -11% NaN

Очуметь! На кой им понадобилось менять порядок столбцов?
Куда денешься от тайной мысли все автоматизировать. И гонишь ее от себя, но она только притворяется, что ушла, и возникает в голове в виде плоховербализированного бормотания: "Как же определить такую перестановку столбцов? Сравнить числа в двух парах столбцов... Нарастающий итог должен быть больше... А там проблем может быть... с форматами выше крыши... А форматы можно привести..."
Пресекаем эти вредные мысли! Не думать на эту тему!!!

III. Пробуем варанты для регулярных выражений

В посте Методы Pandas для работы со столбцами я уже собрал примеры и ссылки, их и прмем за основу

In [41]:
pd_april[pd_april['2014y3']>25000]
Out[41]:
n1_y Model Brandt 2014y3 2013y3 u1413y3 1413y3 2014m3 2013m3 u1413m3 1413m3 Datem
0 1 Granta Lada 45570 52765 (7,195) -14% 13077 15596 (2,519) -16% NaN
1 2 Solaris Hyundai 35218 35941 (723) -2% 9774 10553 (779) -7% NaN
2 3 New Rio KIA 29971 28211 1,760 6% 9045 8430 615 7% NaN
3 4 Duster Renault 28153 25945 2,208 9% 7606 7600 6 0% NaN
4 5 Kalina Lada 25864 28395 (2,531) -9% 7731 8857 (1,126) -13% NaN

Нет, здесь я расеткся по древу. Начал с анализа файлов, а вот сейчас пытаюсь освоить регулярные выражения. Пробуем освоить "с ходу":

In [ ]:
pd_april[pd_april['2014y3'].str.isin(stk_list)]
In [39]:
pd_april[pd_april['2014y3'].str.contains(r'^9.+')] # ^ means start of string
                                                     # [0-9]{3} means any three digits
                                                     # $ means end of string
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-39-04fa29875cb0> in <module>()
----> 1 pd_april[pd_april['2014y3'].str.contains(r'^9.+')] # ^ means start of string
      2                                                      # [0-9]{3} means any three digits
      3                                                      # $ means end of string

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1772         if isinstance(key, (Series, np.ndarray, Index, list)):
   1773             # either boolean or fancy integer index
-> 1774             return self._getitem_array(key)
   1775         elif isinstance(key, DataFrame):
   1776             return self._getitem_frame(key)

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_array(self, key)
   1816             return self.take(indexer, axis=0, convert=False)
   1817         else:
-> 1818             indexer = self.ix._convert_to_indexer(key, axis=1)
   1819             return self.take(indexer, axis=1, convert=True)
   1820 

/usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1141                     if isinstance(obj, tuple) and is_setter:
   1142                         return {'key': obj}
-> 1143                     raise KeyError('%s not in index' % objarr[mask])
   1144 
   1145                 return _values_from_object(indexer)

KeyError: '[ nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan\n  nan  nan  nan  nan  nan  nan  nan  nan  nan  nan] not in index'

"С ходу" не получилось, займемся всем по отдельности в следующих постах



Посты чуть ниже также могут вас заинтересовать

1 комментарий: