Поиск по блогу

среда, 29 апреля 2015 г.

Шаблон Pandas для очистки второй таблицы после копипаста из текстового AEB

Копипастим в последнюю ячейку одну из трех таблиц из текстового файла, редактируем таблицу вручную. Потом загружаем содержание ячейки в DataFrame, добавляем столбец даты, убираем из столбцов символы '%" , заменяем '(' на '-'сохраняем все в 'eng_car_sales_in_feubrary_2015.csv'.
.isin .str.contains .str.replace('[().]', '')

In [2]:
import pandas as pd
import numpy as np
from cStringIO import StringIO
In [17]:
pd.set_option('display.width', 5000) 
pd.set_option('display.max_columns', 60)
In [ ]:
colnames_2015_t1 = ['Brandt', \
                  '2015m1', '2014m1', '1514m1', \
                  '2015y1', '2014y1', '1514y1']
In [20]:
colnames_2015_t2 = ['Brandt', '2015pm2', '2014pm2', '1514pm2',\
                  '2015m2', '2014m2', '1514m2', \
                  '2015py2', '2014py2', '1514py2',
                  '2015y2', '2014y2', '1514y2']
In [ ]:
colnames_2015_t3 = ['n1_m3', 'Model_m3', 'Brandt_m3', \
                    '2014m3', '2013m3', '1413m3', \
                    'n1_y3', 'Model_y3', 'Brandt_y3', \
                    '2014y3', '2013y3', '1413y3', 'Datem']

Редактировать

In [34]:
tt = t2
cl = colnames_2015_t2
In [35]:
pdt2 = pd.read_csv(StringIO(tt), names=cl, skiprows=1, \
                     na_values=['-','n/a'], skip_blank_lines=True, \
                     sep=' {4,}', thousands=',')
In [25]:
print pdt2[:5]
print pdt2.tail(10)
                   Brandt  2015pm2  2014pm2 1514pm2  2015m2  2014m2  1514m2  2015py2  2014py2 1514py2  2015y2  2014y2  1514y2
0  AVTOVAZ-RENAULT-NISSAN     35.5     31.8     3.7   45509   65608  -30.6%     34.7     31.6     3.1   84534  113671  -25.6%
1                 AVTOVAZ     18.4     15.0     3.4   23639   30896  -23.5%     16.9     15.2     1.7   41131   54543  -24.6%
2                 RENAULT      7.1      8.1   (1.0)    9100   16721  -45.6%      7.3      8.2   (0.9)   17909   29615  -39.5%
3                  NISSAN      7.4      8.3   (0.9)    9447   17158  -44.9%      7.6      7.9   (0.3)   18548   28209  -34.2%
4                INFINITI      0.3      0.4   (0.1)     419     833  -49.7%      0.5      0.4     0.1    1109    1304  -15.0%
     Brandt  2015pm2  2014pm2 1514pm2  2015m2  2014m2  1514m2  2015py2  2014py2 1514py2  2015y2  2014y2  1514y2
64      BAW        0      0.1   (0.1)      24     118  -79.7%        0      0.1   (0.1)      53     231  -77.1%
65    HAIMA        0      0.0     0.0      20      21   -4.8%        0      0.0     0.0      61      41   48.8%
66    FOTON        0      0.0     0.0       3       3    0.0%        0      0.0     0.0       4       5  -20.0%
67  CHANGAN      NaN      NaN     NaN     NaN      50     NaN      NaN      NaN     NaN     NaN      50     NaN
68   LUXGEN      NaN      0.0     NaN     NaN      15     NaN      NaN      NaN     NaN     NaN      27     NaN
69    TAGAZ      NaN      0.0     NaN     NaN      22     NaN      NaN      0.0     NaN     NaN      28     NaN
70      FAW      NaN      0.1     NaN     NaN     234     NaN      NaN      0.1     NaN     NaN     536     NaN
71      ZAZ      NaN      0.1     NaN     NaN     113     NaN      NaN      0.1     NaN     NaN     214     NaN
72      IZH      NaN      0.0     NaN     NaN       5     NaN      NaN      0.0     NaN     NaN      15     NaN
73   BOGDAN      NaN      0.0     NaN     NaN      14     NaN      NaN      0.0     NaN     NaN      70     NaN

А что стало со строками Мерседеса (тут сначала была лестница, потом я ее исправил и скопировал назад в конец поста)

In [ ]:
 MERCEDES-BENZ                    3.6        1.9         1.7        4,627          3,999           15.7%           3.3         2.0        1.3         8,126          7,038    15.5%
 MERCEDES-BENZ cars               3.2        1.7         1.5        4,151          3,520           17.9%           3.0         1.7        1.3         7,326          6,230    17.6% 
                                                                                                           

 MERCEDES-BENZ vans               0.3        0.2         0.1         444            459            -3.3%           0.3         0.2        0.1         749           771      -2.9%
                                                                                                              
 MERCEDES-BENZ smart              0.0        0.0         0.0          32            20            60.0%            0.0         0.0        0.0          51             37       37.8%

Сначала вспомнил .isin

In [36]:
pdt2[pdt2['Brandt'].isin(['MERCEDES-BENZ','cars','vans','smart'])]
Out[36]:
Brandt 2015pm2 2014pm2 1514pm2 2015m2 2014m2 1514m2 2015py2 2014py2 1514py2 2015y2 2014y2 1514y2
17 MERCEDES-BENZ 3.6 1.9 1.7 4627 3999 15.7% 3.3 2 1.3 8126 7038 15.5%
In [ ]:
А потом  проверил .str.contains(r'MERCEDES')
In [37]:
pdt2[pdt1['Brandt'].str.contains(r'MERCEDES')]
Out[37]:
Brandt 2015pm2 2014pm2 1514pm2 2015m2 2014m2 1514m2 2015py2 2014py2 1514py2 2015y2 2014y2 1514y2
17 MERCEDES-BENZ 3.6 1.9 1.7 4627 3999 15.7% 3.3 2.0 1.3 8126 7038 15.5%
18 MERCEDES-BENZ cars 3.2 1.7 1.5 4151 3520 17.9% 3.0 1.7 1.3 7326 6230 17.6%
19 MERCEDES-BENZ vans 0.3 0.2 0.1 444 459 -3.3% 0.3 0.2 0.1 749 771 -2.9%
20 MERCEDES-BENZ smart 0.0 0.0 0.0 32 20 60.0% 0.0 0.0 0.0 51 37 37.8%

Добавим и почистим столбцы

In [38]:
pdt2['Datem'] = '28-2-2015'
In [39]:
pdt2.info(verbose=True, buf=None, max_cols=None, memory_usage=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 72 entries, 0 to 71
Data columns (total 14 columns):
Brandt     72 non-null object
2015pm2    63 non-null float64
2014pm2    65 non-null float64
1514pm2    58 non-null object
2015m2     63 non-null float64
2014m2     66 non-null float64
1514m2     58 non-null object
2015py2    63 non-null float64
2014py2    64 non-null float64
1514py2    58 non-null object
2015y2     63 non-null float64
2014y2     66 non-null float64
1514y2     58 non-null object
Datem      72 non-null object
dtypes: float64(8), object(6)
memory usage: 6.8+ KB
In [40]:
pdt2['1514y2'] = pdt2['1514y2'].str.replace('%','').astype(float)
pdt2['1514m2'] = pdt2['1514m2'].str.replace('%','').astype(float)
In [51]:
pdt2['1514pm2'] = pdt2['1514pm2'].str.replace('(', '-')
pdt2['1514pm2'] = pdt2['1514pm2'].str.replace(')', '').astype(float)
In [52]:
pdt2['1514py2'] = pdt2['1514py2'].str.replace('(', '-')
pdt2['1514py2'] = pdt2['1514py2'].str.replace(')', '').astype(float)
In [ ]:
########### Тест 
pdt2['Brandt'].str.replace('*','').astype(object)
pdt2['Brandt'].tail()
  1. LCV sales are included into total brand’s sales if exist in the product line of the brand (marked with*); reported separately for some brands. LCV<3,5t (with several exceptions reaching highest range of 6t). Some updates concerning LCV data may occur.
In [54]:
pdt2[:5], pdt2.tail(10)
Out[54]:
(                   Brandt  2015pm2  2014pm2  1514pm2  2015m2  2014m2  1514m2  2015py2  2014py2  1514py2  2015y2  2014y2  1514y2      Datem
 0  AVTOVAZ-RENAULT-NISSAN     35.5     31.8      3.7   45509   65608   -30.6     34.7     31.6      3.1   84534  113671   -25.6  28-2-2015
 1                 AVTOVAZ     18.4     15.0      3.4   23639   30896   -23.5     16.9     15.2      1.7   41131   54543   -24.6  28-2-2015
 2                 RENAULT      7.1      8.1     -1.0    9100   16721   -45.6      7.3      8.2     -0.9   17909   29615   -39.5  28-2-2015
 3                  NISSAN      7.4      8.3     -0.9    9447   17158   -44.9      7.6      7.9     -0.3   18548   28209   -34.2  28-2-2015
 4                INFINITI      0.3      0.4     -0.1     419     833   -49.7      0.5      0.4      0.1    1109    1304   -15.0  28-2-2015,
      Brandt  2015pm2  2014pm2  1514pm2  2015m2  2014m2  1514m2  2015py2  2014py2  1514py2  2015y2  2014y2  1514y2      Datem
 62      BAW        0      0.1     -0.1      24     118   -79.7        0      0.1     -0.1      53     231   -77.1  28-2-2015
 63    HAIMA        0      0.0      0.0      20      21    -4.8        0      0.0      0.0      61      41    48.8  28-2-2015
 64    FOTON        0      0.0      0.0       3       3     0.0        0      0.0      0.0       4       5   -20.0  28-2-2015
 65  CHANGAN      NaN      NaN      NaN     NaN      50     NaN      NaN      NaN      NaN     NaN      50     NaN  28-2-2015
 66   LUXGEN      NaN      0.0      NaN     NaN      15     NaN      NaN      NaN      NaN     NaN      27     NaN  28-2-2015
 67    TAGAZ      NaN      0.0      NaN     NaN      22     NaN      NaN      0.0      NaN     NaN      28     NaN  28-2-2015
 68      FAW      NaN      0.1      NaN     NaN     234     NaN      NaN      0.1      NaN     NaN     536     NaN  28-2-2015
 69      ZAZ      NaN      0.1      NaN     NaN     113     NaN      NaN      0.1      NaN     NaN     214     NaN  28-2-2015
 70      IZH      NaN      0.0      NaN     NaN       5     NaN      NaN      0.0      NaN     NaN      15     NaN  28-2-2015
 71   BOGDAN      NaN      0.0      NaN     NaN      14     NaN      NaN      0.0      NaN     NaN      70     NaN  28-2-2015)
In [55]:
pdt2.info(verbose=True, buf=None, max_cols=None, memory_usage=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 72 entries, 0 to 71
Data columns (total 14 columns):
Brandt     72 non-null object
2015pm2    63 non-null float64
2014pm2    65 non-null float64
1514pm2    58 non-null float64
2015m2     63 non-null float64
2014m2     66 non-null float64
1514m2     58 non-null float64
2015py2    63 non-null float64
2014py2    64 non-null float64
1514py2    58 non-null float64
2015y2     63 non-null float64
2014y2     66 non-null float64
1514y2     58 non-null float64
Datem      72 non-null object
dtypes: float64(12), object(2)
memory usage: 7.9+ KB

Редактировать файл и путь для записи

In [58]:
ttname = 'eng_car_sales_in_feubrary_2015.csv'
ttfolder = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2015_all_csv/2/'
In [57]:
# Проверить имена столбцов
colnames = list(pdt1.columns.values)
colnames
Out[57]:
['Brandt',
 '2015pm2',
 '2014pm2',
 '1514pm2',
 '2015m2',
 '2014m2',
 '1514m2',
 '2015py2',
 '2014py2',
 '1514py2',
 '2015y2',
 '2014y2',
 '1514y2',
 'Datem']

Сохраним файл

In [59]:
pdt2.to_csv(ttfolder + ttname, 
        sep=';', 
        #float_format='%.8f',
        columns=colnames,
        index=False)

Приложение (добавить номер, например t3)

In [33]:
t2 =""""
 AVTOVAZ-RENAULT-NISSAN          35.5       31.8         3.7       45,509         65,608         -30.6%           34.7        31.6        3.1        84,534       113,671     -25.6%
 AVTOVAZ                         18.4       15.0         3.4       23,639         30,896         -23.5%           16.9        15.2        1.7        41,131         54,543    -24.6%
 RENAULT                          7.1        8.1       (1.0)        9,100         16,721         -45.6%            7.3         8.2       (0.9)       17,909         29,615    -39.5%
 NISSAN                           7.4        8.3       (0.9)        9,447         17,158         -44.9%            7.6         7.9       (0.3)       18,548         28,209    -34.2%
 INFINITI                         0.3        0.4       (0.1)         419            833          -49.7%            0.5         0.4        0.1         1,109          1,304    -15.0%
 Datsun                           2.3          -          -        2,904             -              -              2.4           -          -        5,837             -         -
 VW Group                        11.1       10.6         0.5       14,188         21,849         -35.1%           11.3        10.5        0.8        27,527         37,836    -27.2%
 VOLKSWAGEN cars                  4.5        5.5       (1.0)        5,804         11,390         -49.0%            4.9         5.6       (0.7)       11,969         19,989    -40.1%
 VOLKSWAGEN vans                  0.5        0.5         0.0         672           1,058         -36.5%            0.5         0.6       (0.1)        1,280          2,007    -36.2%
 AUDI                             1.6        1.4         0.2        2,092          2,905         -28.0%            1.4         1.3        0.1         3,502          4,528    -22.7%
 SEAT                             0.0        0.1       (0.1)           2            112          -98.2%            0.0         0.1       (0.1)          33           184      -82.1%
 ŠKODA                            4.4        3.1         1.3        5,618          6,384         -12.0%            4.4         3.1        1.3        10,743         11,128    -3.5%
 HYUNDAI                         10.3        6.7         3.6       13,233         13,901           -4.8%          10.6         6.9        3.7        25,940         24,945    4.0%
 KIA                              9.8        6.4         3.4       12,563         13,303          -5.6%            9.8         6.8        3.0        23,909         24,502     -2.4%
 TOYOTA Group                     6.2        6.1         0.1       7,934          12,606         -37.1%            6.5         6.0        0.5        15,773         21,632    -27.1%
 TOYOTA                           5.6        5.8       (0.2)        7,143         11,892         -39.9%            5.9         5.6        0.3        14,292         20,211    -29.3%
 LEXUS                            0.6        0.3         0.3         791            714            10.8%           0.6         0.4        0.2         1,481          1,421    4.2%
 MERCEDES-BENZ                    3.6        1.9         1.7        4,627          3,999           15.7%           3.3         2.0        1.3         8,126          7,038    15.5%
 MERCEDES-BENZ cars               3.2        1.7         1.5        4,151          3,520           17.9%           3.0         1.7        1.3         7,326          6,230    17.6% 
                                                                                                           

 MERCEDES-BENZ vans               0.3        0.2         0.1         444            459            -3.3%           0.3         0.2        0.1         749           771      -2.9%
                                                                                                              
 MERCEDES-BENZ smart              0.0        0.0         0.0          32            20            60.0%            0.0         0.0        0.0          51             37       37.8%

 GM Group                         3.3        9.3       (6.0)        4,201         19,155         -78.1%            3.4         9.2       (5.8)        8,362         32,936    -74.6%
 OPEL                             0.7        3.2       (2.5)         912           6,610         -86.2%            0.8         3.0       (2.2)        1,985         10,891    -81.8%
 CHEVROLET                        2.5        6.0       (3.5)        3,255         12,448         -73.9%            2.6         6.1       (3.5)        6,305         21,868    -71.2%
 CADILLAC                         0.0        0.0         0.0          34             97          -64.9%            0.0         0.0        0.0          72             177     -59.3%
 BMW Group                        2.9        1.7         1.2        3,666          3,504          4.6%             2.6         1.7        0.9         6,302          6,126     2.9%
 BMW                              2.8        1.6         1.2        3,550          3,356           5.8%            2.5         1.6        0.9         6,095          5,896    3.4%
 MINI                             0.1        0.1         0.0         116            148          -21.6%            0.1         0.1        0.0          207           230      -10.0%
 MITSUBISHI                       2.6        3.3       (0.7)        3,323          6,817         -51.3%            2.7         3.3       (0.6)        6,543         11,824    -44.7%
 GAZ LCV                          2.3        2.2         0.1        2,970          4,527         -34.4%            2.0         2.2       (0.2)        4,798          7,787    -38.4%
 MAZDA                            1.9        1.9         0.0        2,381          3,899         -38.9%            2.0         1.9        0.1         4,765          6,747    -29.4%
 UAZ                              1.8        1.4         0.4        2,260          2,903         -22.1%            1.9         1.4        0.5         4,513          5,104    -11.6%
 DAEWOO                           1.3        2.2       (0.9)        1,639          4,536         -63.9%            1.5         2.3       (0.8)        3,654          8,254    -55.7%
 FORD                             1.1        3.1       (2.0)        1,399          6,306         -77.8%            1.3         2.9       (1.6)        3,212         10,556    -69.6%
 JAGUAR LAND ROVER                1.0        1.0         0.0        1,258          2,024         -37.8%            0.9         0.8        0.1         2,092         3,040     -31.2%
 LAND ROVER                       0.9        0.9         0.0        1,183          1,867         -36.6%            0.8         0.8        0.0         1,958          2,789    -29.8%
  JAGUAR                           0.1        0.1         0.0          75            157          -52.2%            0.1         0.1        0.0          134            251     -46.6%
 SUBARU                           0.7        0.7         0.0         836           1,374         -39.2%            0.7         0.6        0.1         1,752          2,312    -24.2%
 PSA PEUGEOT                      0.6        2.2       (1.6)         757           4,546         -83.3%            0.7         2.3       (1.6)        1,655          8,109    -79.6%
 CITROЁN
 PEUGEOT                          0.3        1.1       (0.8)         377           2,354         -84.0%            0.3         1.2       (0.9)         795           4,197    -81.1%
 CITROЁN                          0.3        1.1       (0.8)         380           2,192         -82.7%            0.4         1.1       (0.7)         860           3,912    -78.0%
 VOLVO                            0.6        0.6         0.0         757           1,143         -33.8%            0.4         0.4        0.0         1,027          1,545    -33.5%
 LIFAN                            0.6        0.7       (0.1)         721           1,548         -53.4%            0.6         0.7       (0.1)        1,484          2,666    -44.3%
 GEELY                            0.5        0.8       (0.3)         681           1,687         -59.6%            0.5         0.8       (0.3)        1,131          2,845    -60.2%
 CHRYSLER Group                   0.5        0.6       (0.1)         589           1,208         -51.2%            0.4         0.6       (0.2)        1,097          2,044    -46.3%
 FIAT                             0.3        0.3         0.0         345            581          -40.6%            0.3         0.3        0.0          641            988     -35.1%
 CHRYSLER                         0.0        0.0         0.0           8             14          -42.9%            0.0         0.0        0.0           12            21      -42.9%
 JEEP                             0.2        0.3       (0.1)         235            609          -61.4%            0.2         0.3       (0.1)         440           1,029    -57.2%
 DODGE                                -      0.0          -            -              4               -             -          0.0          -            -             6        -
 ALFA ROMEO                       0.0          -          -            1              -               -            0.0           -          -            4             -        -
 SSANGYONG                        0.4        1.0       (0.6)         561           2,007         -72.0%            0.5         0.9       (0.4)        1,294          3,337    -61.2%
 SUZUKI                           0.3        0.8       (0.5)         416           1,707         -75.6%            0.4         0.9       (0.5)         871           3,189    -72.7%
 CHERY                            0.3        0.8       (0.5)         402           1,709         -76.5%            0.3         0.8       (0.5)         770           2,846    -72.9%
 GREAT WALL                       0.3        0.7       (0.4)         402           1,402         -71.3%            0.3         0.6       (0.3)         807           2,334    -65.4%
 PORSCHE                          0.3        0.1         0.2         381            174          119.0%            0.2         0.1        0.1          575            300      91.7%
 HONDA Group                      0.3        1.2       (0.9)         341           2,484         -86.3%            0.3         1.2       (0.9)         798           4,413    -81.9%
 HONDA                            0.2        1.2       (1.0)         252           2,484         -89.9%            0.3         1.2       (0.9)         636           4,413    -85.6%
 ACURA                            0.1          -          -           89              -               -            0.1           -          -          162             -        -
 BRILLIANCE                       0.1          -          -          144              -               -            0.1           -          -          241             -        -
 ISUZU                            0.1        0.0         0.1          74              5         1380.0%            0.0         0.0        0.0          103            33      212.1%
 JAC                              0.0          -          -           38              -               -            0.0           -          -           53             -        -
 BAW                              0.0        0.1       (0.1)          24            118          -79.7%            0.0         0.1       (0.1)          53            231     -77.1%
 HAIMA                            0.0        0.0         0.0          20             21            -4.8%           0.0         0.0        0.0           61            41      48.8%
 FOTON                            0.0        0.0         0.0           3              3            0.0%            0.0         0.0        0.0            4             5      -20.0%
 CHANGAN                              -        -          -            -             50               -             -            -          -            -            50        -
 LUXGEN                               -      0.0          -            -             15               -             -            -          -            -            27        -
 TAGAZ                                -      0.0          -            -             22               -             -          0.0          -            -            28        -
 FAW                                  -      0.1          -            -            234               -             -          0.1          -            -            536       -
 ZAZ                                  -      0.1          -            -            113               -             -          0.1          -            -            214       -
 IZH                                  -      0.0          -            -              5               -             -          0.0          -            -            15        -
 BOGDAN                               -      0.0          -          -             14              -                -          0.0          -         -               70         -

"""

В str.replace можно использовать регулярные выражения. Ниже пример со множеством символов

In [43]:
sss = pdt2['1514pm2']
In [48]:
sss.str.replace('[().]', '')
Out[48]:
0      37
1      34
2      10
3      09
4      01
5     NaN
6      05
7      10
8      00
9      02
10     01
11     13
12     36
13     34
14     01
...
57     10
58    NaN
59    NaN
60     01
61    NaN
62     01
63     00
64     00
65    NaN
66    NaN
67    NaN
68    NaN
69    NaN
70    NaN
71    NaN
Name: 1514pm2, Length: 72, dtype: object
In [49]:
sss 
Out[49]:
0       3.7
1       3.4
2     (1.0)
3     (0.9)
4     (0.1)
5       NaN
6       0.5
7     (1.0)
8       0.0
9       0.2
10    (0.1)
11      1.3
12      3.6
13      3.4
14      0.1
...
57    (1.0)
58      NaN
59      NaN
60      0.1
61      NaN
62    (0.1)
63      0.0
64      0.0
65      NaN
66      NaN
67      NaN
68      NaN
69      NaN
70      NaN
71      NaN
Name: 1514pm2, Length: 72, dtype: object
In [50]:
sss.str.replace('(', '-')
Out[50]:
0       3.7
1       3.4
2     -1.0)
3     -0.9)
4     -0.1)
5       NaN
6       0.5
7     -1.0)
8       0.0
9       0.2
10    -0.1)
11      1.3
12      3.6
13      3.4
14      0.1
...
57    -1.0)
58      NaN
59      NaN
60      0.1
61      NaN
62    -0.1)
63      0.0
64      0.0
65      NaN
66      NaN
67      NaN
68      NaN
69      NaN
70      NaN
71      NaN
Name: 1514pm2, Length: 72, dtype: object


Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий