Перебрали файлы в папке, из имени каждого вырезали название месяца, из словаря добавили год и последний день месяца, потом добавили столбец и конвертировали его из строки в дату (object -> datetime64[ns] ) И все это с выписками и ссылками на два мануала и документацию Pandas. Все на фоне рефлексий по поводу алгоритма...

Типовая задача: В папке пара десятков сравнительно коротких файлов, которые напрасил мой паучок. Их надо перебрать, проверить, почистить и склеить. Я уже знаю, что Pandas выдаст ошибку, если в одной из строк окажется лишняя запятая (разделитель). Может быт проще использовать построчное чтение (и запись), или модуль csv? Simple is better than complex Пока не знаю, что лучше... Errors should never pass silently... с точки зрения трудозатрат... и пропуска ошибок. Но Pandas надо осваивать в любом случае, потому тупо кодирую только то, что нужно в данный момент и надеюсь, что все само оптимизируется In the face of ambiguity, refuse the temptation to guess... ведь Although practicality beats purity.

In [46]:

import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Фрагменты кода для парсере PDF (AEBto3tables)
Копипаст второй части о подготовке данных "Data Wrangling with Pandas" (часть 2)

Как перебирать файлы в папке, читать их имена, открывать их и редактировать... и снова записывать на диск.¶

Эти приемы понадобятся для исправления повторяющихся ошибок в файлах. Они неизбежны... и файлы "типовые", т.е. во всех типовые ошибки, например в строках большей части файлов всречаются "Marsedes-Benz","vans" вместо "Marsedes-Benz vans"
Другой типовой пример: добавление в каждую таблицу столбца с данными из названия файла (например, из имени файла с eng_car-sales-in-april-2014.csv сформируем столбец даты 2014-4)

In [ ]:

import os

In [3]:

path = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/1aug2'
!ls '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/1aug2'

000_eng_car-sales-in-april-2014.csv  eng_car-sales-in-november-2014.csv
eng_car-sales-in-april-2014.csv      eng_car-sales-in-october-2014.csv
eng_car-sales-in-august-2014.csv     eng_car-sales-in-september-2014.csv
eng_car-sales-in-december-2014.csv   sales-in-december_2013_eng_final.csv
eng_car-sales-in-july-2014.csv      sales-in-february_2014_eng_final.csv
eng_car-sales-in-june-2014.csv      sales-in-january_2014_eng_final_1.csv
eng_car-sales-in-may-2014.csv      sales-in-march_2014_eng_final.csv

In [ ]:

data = {}
for csvfile in os.listdir(path):
    csvfile_path = os.path.join(path, csvfile)
    if os.path.isfile(csvfile_path):
        with open(csvfile_path, 'r') as my_file:
            data[dir_entry] = my_file.read()
# Если я захочу перебрать строки в файле, то мне понадобтся вложить еще один цикл
# Погодим с этим делом, может быть Pandas лучше?

Этот код с перебором файлов кажется мне проще, чем открытие объекта DataFrame Pandas. В предыдущем посте я испоьлзовал именно его и получил ошибку ту самую ошибку "Marsedes-Benz","vans"... вернее, обнаружил. Обнаружил также, что Pandas может читать файл по строкам, т.е. по сути - это обертка штатного метода чтения файлов. Я не стал мудрить и просто перебрал все файлы вручную в графическом редакторе (как и положено тупому ламеру) исправил строки во всех файлах. А надо было бы написать скриптик в bash или cmd ..., но не верю, что они понадобятся... это была аномалия с .pdf ... Большая часть таблиц ровненькие, тщательно обструганные в Scrapy... словом, ленюсь.

Может быть пропустить все файлы в папке через объект DataFrame, если открывается для каждого файла, то с форматированием таблиц все в порядке... ну вот, кажется я определился.

Although never is often better than right now ... Сказано - сделано¶

In [47]:

# The usual preamble
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)

In [49]:

import os

In [56]:

for csvfile in os.listdir(path):
    csvfile_path = os.path.join(path, csvfile)
    pd000 = pd.read_csv(csvfile_path, error_bad_lines=False)
    print ('******      ' + csvfile)
    pd000.info(verbose=True, buf=None, max_cols=None, memory_usage=True, null_counts=True)

******      eng_car-sales-in-april-2014.csv
<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, Lada to Alfa Romeo
Data columns (total 7 columns):
2014          60 non-null int64
2013          60 non-null object
YoY %         60 non-null object
2014.1        60 non-null object
2013.1        60 non-null object
YoY %.1       60 non-null object
Unnamed: 6    0 non-null float64
dtypes: float64(1), int64(1), object(5)
memory usage: 2.3+ KB
******      eng_car-sales-in-august-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 65 entries, 0 to 64
Data columns (total 8 columns):
Lada          65 non-null object
26,467        65 non-null object
39,079        65 non-null object
-32%          65 non-null object
247,289       65 non-null object
303,357       65 non-null object
-18%          65 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.8+ KB
******      eng_car-sales-in-december-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 61
Data columns (total 8 columns):
Lada          62 non-null object
35,315        62 non-null object
38,948        62 non-null object
-9%           62 non-null object
387,307       62 non-null object
456,309       62 non-null object
-15%          62 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.7+ KB
******      eng_car-sales-in-july-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 59 entries, 0 to 58
Data columns (total 8 columns):
Lada          59 non-null object
28,014        59 non-null object
37,549        59 non-null object
-25%          59 non-null object
220,822       59 non-null object
264,278       59 non-null object
-16%          59 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.5+ KB
******      eng_car-sales-in-june-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 59 entries, 0 to 58
Data columns (total 8 columns):
Lada          59 non-null object
30,114        59 non-null object
37,177        59 non-null object
-19%          59 non-null object
192,808       59 non-null object
226,729       59 non-null object
-15%          59 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.5+ KB
******      eng_car-sales-in-may-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 0 to 59
Data columns (total 8 columns):
Lada          60 non-null object
34,061        60 non-null object
38,025        60 non-null object
-10%          60 non-null object
162,694       60 non-null object
189,552       60 non-null object
-14%          60 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.6+ KB
******      eng_car-sales-in-november-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 61
Data columns (total 8 columns):
Lada          62 non-null object
30,402        62 non-null object
36,509        62 non-null object
-17%          62 non-null object
351,992       62 non-null object
417,361       62 non-null object
-16%          62 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.7+ KB
******      eng_car-sales-in-october-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 66 entries, 0 to 65
Data columns (total 8 columns):
Lada          66 non-null object
37,788        66 non-null object
37,484        66 non-null object
1%            66 non-null object
321,590       66 non-null object
380,852       66 non-null object
-16%          66 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.8+ KB
******      eng_car-sales-in-september-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 64 entries, 0 to 63
Data columns (total 8 columns):
Lada          64 non-null object
36,513        64 non-null object
40,011        64 non-null object
-9%           64 non-null object
283,802       64 non-null object
343,368       64 non-null object
-17%          64 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.8+ KB
******      sales-in-december_2013_eng_final.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57 entries, 0 to 56
Data columns (total 8 columns):
Lada          57 non-null object
456 309       57 non-null object
537 625       57 non-null object
-15%          57 non-null object
38 948        57 non-null object
43 354        57 non-null object
-10%          57 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.4+ KB
******      sales-in-february_2014_eng_final.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 56 entries, 0 to 55
Data columns (total 8 columns):
Lada          56 non-null object
54 543        56 non-null object
66 947        56 non-null object
-19%          56 non-null object
30 896        56 non-null object
36910         56 non-null object
-16%          56 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.4+ KB
******      sales-in-january_2014_eng_final_1.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 56 entries, 0 to 55
Data columns (total 5 columns):
Association of European Businesses    56 non-null object
Phone.:                               56 non-null int64
+7 (495) 234 27 64                    56 non-null object
E-mail: info@aebrus.ru                56 non-null object
Unnamed: 4                            0 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 2.0+ KB
******      sales-in-march_2014_eng_final.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 58 entries, 0 to 57
Data columns (total 8 columns):
Lada          58 non-null object
91603         58 non-null int64
107427        58 non-null object
-15%          58 non-null object
37060         58 non-null int64
40480         58 non-null object
-8%           58 non-null object
Unnamed: 7    0 non-null float64
dtypes: float64(1), int64(2), object(5)
memory usage: 2.9+ KB
******      000_eng_car-sales-in-april-2014.csv
<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, Lada to Mitsubishi
Data columns (total 7 columns):
2014          9 non-null int64
2013          9 non-null int64
YoY %         9 non-null object
2014.1        9 non-null int64
2013.1        9 non-null int64
YoY %.1       9 non-null object
Unnamed: 6    0 non-null float64
dtypes: float64(1), int64(4), object(2)
memory usage: 468.0+ bytes

Запускаем цикл по файлам, потом читаем все в один объект и смотрим, что считалось. И, кстати, 60 строк таблички съедают 2,5кб памяти..., получается, что 60 000 строк займут около 2,5 MB соответственно... надо бы проверить...

А если попытаться понять, почему в файлах разное количество строк? Так в оригиналеЮ или мои парсеры барахлят? Здесь нам не до таких мелочей, нам бы домучить обработку файлов..., но запомним вопрос на будущее: почему в файлах разное число строк?

Для начала мы неплохо справились с первичной проверкой файлов. Далее нам нужно будет добавить в каждый файл столбец даты... и, может быть, другие столбцы... не думаем, а делаем то, что надо прямо сейчас.

Распарсим имя файла, дабы достать оттуда "april"¶

In [4]:

csvfile0 = "000_eng_car-sales-in-april-2014.csv"

In [7]:

csvfile0.split('-'), csvfile0.split('-')[3]

Out[7]:

(['000_eng_car', 'sales', 'in', 'april', '2014.csv'], 'april')

In [40]:

dict = {'january':'31-1', 'febrary':'28-2', 'march':'31-1', 'april':'30-4', 'may':'31-5', 'jun':'30-6',
        'july':'31-7', 'august':'31-8','september':'30-9', 'october':'31-10','november':'30-11', 'december':'31-12'}

Словарь - это Питоновское решение. Использую его потому, что Python по умолчанию вставляет в объект даты первый день месяца, если не указано число. А у меня данные за месяц. Если оставить 4-2014, а через полгода забыть про это дело, то потом не докопаешься почему у тебя все данные "сдвинулись" на месяц...

In [37]:

dict['april'].isdigit()

Out[37]:

False

In [38]:

dt = dict['april'] + '-2014'
dt

Out[38]:

'30-4-2014'

In [43]:

theyear = '-2014'
dict[csvfile0.split('-')[3]] + theyear

Out[43]:

'30-4-2014'

Вот так приблизительно будем формировать строки для столбца дат. Теперь мне нужно будет либо дописать эту строчку в конец каждой строки каждого файла (запустить итератор), либо открыть каждый файл, как объект Pandas...

*-...Девочка, что ты хочешь, чтобы тебе оторвали голову, или поедем на дачу?

На дачу!!!*

Добавим столбец с помощью Pandas (...На дачу!)¶

Создадим объект из файла

In [62]:

csvfile_path1 = "/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/1aug2/" +"000_eng_car-sales-in-april-2014.csv"
pd001 = pd.read_csv(csvfile_path1, error_bad_lines=False)

In [64]:

pd001.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, Lada to Mitsubishi
Data columns (total 7 columns):
2014          9 non-null int64
2013          9 non-null int64
YoY %         9 non-null object
2014.1        9 non-null int64
2013.1        9 non-null int64
YoY %.1       9 non-null object
Unnamed: 6    0 non-null float64
dtypes: float64(1), int64(4), object(2)
memory usage: 468.0+ bytes

In [ ]:

И быстренько добавим столбец

In [67]:

pd001['dat'] = dt

In [68]:

pd001

Out[68]:

	2014	2013	YoY %	2014.1	2013.1	YoY %.1	Unnamed: 6	dat
Lada	128633	151527	-15%	37030	44100	-16%	NaN	30-4-2014
Renault*	63647	67208	-5%	17395	19178	-9%	NaN	30-4-2014
KIA	60033	60027	0%	17744	18303	-3%	NaN	30-4-2014
Nissan*	57579	44188	30%	11835	8272	43%	NaN	30-4-2014
Hyundai*	57240	56582	1%	15933	15868	0%	NaN	30-4-2014
Toyota*	50160	44610	12%	15103	15278	-1%	NaN	30-4-2014
Chevrolet	48586	52489	-7%	13279	16083	-17%	NaN	30-4-2014
VW	45828	49477	-7%	11497	14203	-19%	NaN	30-4-2014
Mitsubishi	27943	26322	6%	6101	7182	-15%	NaN	30-4-2014

Нам потом нужно будет этот столбец преобразовать в объект Date, сразу проверим, как это делается

Теперь вспомним, как прочитать строку в объект даты datetime.strptime("21/11/06 16:30", "%d/%m/%y %H:%M")¶

In [18]:

from datetime import datetime, date, time

In [39]:

datetime.strptime(dt, "%d-%m-%Y")

Out[39]:

datetime.datetime(2014, 4, 30, 0, 0)

Если использовать Pandas, то тут есть другой объект даты Time Series / Date functionality ¶

...pandas has proven very successful as a tool for working with time series data, especially in the financial data analysis space. With the 0.8 release, we have further improved the time series API in pandas by leaps and bounds. Using the new NumPy datetime64 dtype, we have consolidated a large number of features from other Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for manipulating time series data.

И нам нужно, чтобы столбец строк, который мы добавили в объект читался, как столбец дат...¶

Начнем с конца подробное решение здесь datetime.strptime(segments.st_time.ix[0], '%m/%d/%y %H:%M')

In [97]:

pd001['dat'].apply(lambda d: datetime.strptime(d, "%d-%m-%Y"))

Out[97]:

Lada         2014-04-30
Renault*     2014-04-30
KIA          2014-04-30
Nissan*      2014-04-30
Hyundai*     2014-04-30
Toyota*      2014-04-30
Chevrolet    2014-04-30
VW           2014-04-30
Mitsubishi   2014-04-30
Name: dat, dtype: datetime64[ns]

Получили то, что надо. Формат даты datetime64[ns]. As a convenience, Pandas has a to_datetime method that will parse and convert an entire Series of formatted strings into datetime objects.

In [99]:

pd.to_datetime(pd001.dat)  # ==   pd.to_datetime(pd001['dat'])

Out[99]:

Lada         2014-04-30
Renault*     2014-04-30
KIA          2014-04-30
Nissan*      2014-04-30
Hyundai*     2014-04-30
Toyota*      2014-04-30
Chevrolet    2014-04-30
VW           2014-04-30
Mitsubishi   2014-04-30
Name: dat, dtype: datetime64[ns]

Как определять формат даты автоматически?¶

The dateutil package includes a parser that attempts to detect the format of the date strings, and convert them automatically.

In [100]:

from dateutil.parser import parse

In [101]:

parse(pd001.dat[0])

Out[101]:

datetime.datetime(2014, 4, 30, 0, 0)

Есть оказывается волшебный парсер, который не только парсит все даты подряд, но и определяет формат сам. Вот упоминание в документации о парсере CSV & Text files.

date_parser: function to use to parse strings into datetime objects. If parse_dates is True (False by default), it defaults to the very robust dateutil.parser. Specifying this implicitly sets parse_dates as True. You can also use functions from community supported date converters from date_converters.py

Насколько я понял, этот парсер автоматически включаеть при загрузке CSV файлов, если поменять настройки parse_dates При случае попробуем эту штуку (на pdf файлах).

In [104]:

dir(parse)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-104-9a73fb79304f> in <module>()
----> 1 dir(parser)

NameError: name 'parser' is not defined

Далее подробности и заметки из документации¶

Оказалось, что я плохо представляю себе все эти объекты, поэтому я набрался терпенья и по мере чтения своего старого поста не забывал изучать документацию Pandas. Далее, собственно, и отражен этот процесс.

In [69]:

pd001.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9 entries, Lada to Mitsubishi
Data columns (total 8 columns):
2014          9 non-null int64
2013          9 non-null int64
YoY %         9 non-null object
2014.1        9 non-null int64
2013.1        9 non-null int64
YoY %.1       9 non-null object
Unnamed: 6    0 non-null float64
dat           9 non-null object
dtypes: float64(1), int64(4), object(3)
memory usage: 504.0+ bytes

Как работать со столбцом дат в Pandas отлдично описано здесь Копипаст второй части о подготовке данных "Data Wrangling with Pandas" (часть 2) ¶

In [70]:

# Выше мы создали столбец In [67]: pd001['dat'] = dt
pd001.dat.dtype

Out[70]:

dtype('O')

In [72]:

pd001.dtypes

Out[72]:

2014            int64
2013            int64
YoY %          object
2014.1          int64
2013.1          int64
YoY %.1        object
Unnamed: 6    float64
dat            object
dtype: object

Что такое dtypes ¶

The main types stored in pandas objects are float, int, bool, datetime64[ns], timedelta[ns], and object. In addition these dtypes have item sizes, e.g. int64 and int32. A convenient dtypes attribute for DataFrames returns a Series with the data type of each column.

Здесь не указантип category вот здесь в примере он есть

Универсальный способ преобразования типов astype и object-conversion ¶

In [74]:

pd001.astype('object').dtypes

Out[74]:

2014          object
2013          object
YoY %         object
2014.1        object
2013.1        object
YoY %.1       object
Unnamed: 6    object
dat           object
dtype: object

Сразу всю столбцы таблицы преобразуем в строковые. А можно по каждому столбцу отдельно... решить вопрос, для этого есть специальный метод. Мы его применим не ко всему объекту DataFrame, а только к столбцу, чтобы не затереть строки

...convert_objects is a method to try to force conversion of types from the object dtype to other types. To force conversion of specific types that are number like, e.g. could be a string that represents a number, pass convert_numeric=True. This will force strings and numbers alike to be numbers if possible, otherwise they will be set to np.nan.

In [76]:

pd001['2014'].convert_objects(convert_numeric=True).dtypes

Out[76]:

dtype('int64')

In [79]:

pd001['YoY %'].convert_objects(convert_numeric=True).dtypes

Out[79]:

dtype('O')

Не понял, что значит dtype('O'), и со страху распечатал

In [80]:

pd001.head()

Out[80]:

	2014	2013	YoY %	2014.1	2013.1	YoY %.1	Unnamed: 6	dat
Lada	128633	151527	-15%	37030	44100	-16%	NaN	30-4-2014
Renault*	63647	67208	-5%	17395	19178	-9%	NaN	30-4-2014
KIA	60033	60027	0%	17744	18303	-3%	NaN	30-4-2014
Nissan*	57579	44188	30%	11835	8272	43%	NaN	30-4-2014
Hyundai*	57240	56582	1%	15933	15868	0%	NaN	30-4-2014

Вообще то, метод convert_objects предназначен для преобразования всей таблицы сразу, а для упражнений со столбцами лучше использовать astype ... Так написано в документации (ссылка выше), а я так и не понял, чем эти методы отличаются... Вот наиболее приемлемый пример конвертации столбца

In [81]:

pd001['2014'].astype('int32').dtypes

Out[81]:

dtype('int32')

А вот как нужно конвертировать все подряд в формат datetime64[ns] object-conversion ¶

To force conversion to datetime64[ns], pass convert_dates='coerce'. This will convert any datetime-like object to dates, forcing other values to NaT. This might be useful if you are reading in data which is mostly dates, but occasionally has non-dates intermixed and you want to represent as missing.

In [87]:

 s = pd.Series([datetime(2001,1,1,0,0), 'foo', 1.0, 1, pd.Timestamp('20010104'),'20010105'], dtype='O')

In [88]:

Out[88]:

0    2001-01-01 00:00:00
1                    foo
2                      1
3                      1
4    2001-01-04 00:00:00
5               20010105
dtype: object

In [89]:

s.dtypes

Out[89]:

dtype('O')

In [90]:

s.convert_objects(convert_dates='coerce')

Out[90]:

0   2001-01-01
1          NaT
2          NaT
3          NaT
4   2001-01-04
5   2001-01-05
dtype: datetime64[ns]

In [91]:

s.dtypes

Out[91]:

dtype('O')

Очевидно, что надо запомнить и прием Selecting columns based on dtype ¶

In [94]:

pd001.dtypes

Out[94]:

2014            int64
2013            int64
YoY %          object
2014.1          int64
2013.1          int64
YoY %.1        object
Unnamed: 6    float64
dat            object
dtype: object

In [96]:

pd001.select_dtypes(include=['int64'])

Out[96]:

	2014	2013	2014.1	2013.1
Lada	128633	151527	37030	44100
Renault*	63647	67208	17395	19178
KIA	60033	60027	17744	18303
Nissan*	57579	44188	11835	8272
Hyundai*	57240	56582	15933	15868
Toyota*	50160	44610	15103	15278
Chevrolet	48586	52489	13279	16083
VW	45828	49477	11497	14203
Mitsubishi	27943	26322	6101	7182

In [71]:

help(pd001.dat.dtype)

Help on dtype object:

class dtype(__builtin__.object)
 |  dtype(obj, align=False, copy=False)
 |  
 |  Create a data type object.
 |  
 |  A numpy array is homogeneous, and contains elements described by a
 |  dtype object. A dtype object can be constructed from different
 |  combinations of fundamental numeric types.
 |  
 |  Parameters
 |  ----------
 |  obj
 |      Object to be converted to a data type object.
 |  align : bool, optional
 |      Add padding to the fields to match what a C compiler would output
 |      for a similar C-struct. Can be ``True`` only if `obj` is a dictionary
 |      or a comma-separated string. If a struct dtype is being created,
 |      this also sets a sticky alignment flag ``isalignedstruct``.
 |  copy : bool, optional
 |      Make a new copy of the data-type object. If ``False``, the result
 |      may just be a reference to a built-in data-type object.
 |  
 |  See also
 |  --------
 |  result_type
 |  
 |  Examples
 |  --------
 |  Using array-scalar type:
 |  
 |  >>> np.dtype(np.int16)
 |  dtype('int16')
 |  
 |  Record, one field name 'f1', containing int16:
 |  
 |  >>> np.dtype([('f1', np.int16)])
 |  dtype([('f1', '<i2')])
 |  
 |  Record, one field named 'f1', in itself containing a record with one field:
 |  
 |  >>> np.dtype([('f1', [('f1', np.int16)])])
 |  dtype([('f1', [('f1', '<i2')])])
 |  
 |  Record, two fields: the first field contains an unsigned int, the
 |  second an int32:
 |  
 |  >>> np.dtype([('f1', np.uint), ('f2', np.int32)])
 |  dtype([('f1', '<u4'), ('f2', '<i4')])
 |  
 |  Using array-protocol type strings:
 |  
 |  >>> np.dtype([('a','f8'),('b','S10')])
 |  dtype([('a', '<f8'), ('b', '|S10')])
 |  
 |  Using comma-separated field formats.  The shape is (2,3):
 |  
 |  >>> np.dtype("i4, (2,3)f8")
 |  dtype([('f0', '<i4'), ('f1', '<f8', (2, 3))])
 |  
 |  Using tuples.  ``int`` is a fixed type, 3 the field's shape.  ``void``
 |  is a flexible type, here of size 10:
 |  
 |  >>> np.dtype([('hello',(np.int,3)),('world',np.void,10)])
 |  dtype([('hello', '<i4', 3), ('world', '|V10')])
 |  
 |  Subdivide ``int16`` into 2 ``int8``'s, called x and y.  0 and 1 are
 |  the offsets in bytes:
 |  
 |  >>> np.dtype((np.int16, {'x':(np.int8,0), 'y':(np.int8,1)}))
 |  dtype(('<i2', [('x', '|i1'), ('y', '|i1')]))
 |  
 |  Using dictionaries.  Two fields named 'gender' and 'age':
 |  
 |  >>> np.dtype({'names':['gender','age'], 'formats':['S1',np.uint8]})
 |  dtype([('gender', '|S1'), ('age', '|u1')])
 |  
 |  Offsets in bytes, here 0 and 25:
 |  
 |  >>> np.dtype({'surname':('S25',0),'age':(np.uint8,25)})
 |  dtype([('surname', '|S25'), ('age', '|u1')])
 |  
 |  Methods defined here:
 |  
 |  __eq__(...)
 |      x.__eq__(y) <==> x==y
 |  
 |  __ge__(...)
 |      x.__ge__(y) <==> x>=y
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(...)
 |      x.__gt__(y) <==> x>y
 |  
 |  __hash__(...)
 |      x.__hash__() <==> hash(x)
 |  
 |  __le__(...)
 |      x.__le__(y) <==> x<=y
 |  
 |  __len__(...)
 |      x.__len__() <==> len(x)
 |  
 |  __lt__(...)
 |      x.__lt__(y) <==> x<y
 |  
 |  __mul__(...)
 |      x.__mul__(n) <==> x*n
 |  
 |  __ne__(...)
 |      x.__ne__(y) <==> x!=y
 |  
 |  __reduce__(...)
 |  
 |  __repr__(...)
 |      x.__repr__() <==> repr(x)
 |  
 |  __rmul__(...)
 |      x.__rmul__(n) <==> n*x
 |  
 |  __setstate__(...)
 |  
 |  __str__(...)
 |      x.__str__() <==> str(x)
 |  
 |  newbyteorder(...)
 |      newbyteorder(new_order='S')
 |      
 |      Return a new dtype with a different byte order.
 |      
 |      Changes are also made in all fields and sub-arrays of the data type.
 |      
 |      Parameters
 |      ----------
 |      new_order : string, optional
 |          Byte order to force; a value from the byte order
 |          specifications below.  The default value ('S') results in
 |          swapping the current byte order.
 |          `new_order` codes can be any of::
 |      
 |           * 'S' - swap dtype from current to opposite endian
 |           * {'<', 'L'} - little endian
 |           * {'>', 'B'} - big endian
 |           * {'=', 'N'} - native order
 |           * {'|', 'I'} - ignore (no change to byte order)
 |      
 |          The code does a case-insensitive check on the first letter of
 |          `new_order` for these alternatives.  For example, any of '>'
 |          or 'B' or 'b' or 'brian' are valid to specify big-endian.
 |      
 |      Returns
 |      -------
 |      new_dtype : dtype
 |          New dtype object with the given change to the byte order.
 |      
 |      Notes
 |      -----
 |      Changes are also made in all fields and sub-arrays of the data type.
 |      
 |      Examples
 |      --------
 |      >>> import sys
 |      >>> sys_is_le = sys.byteorder == 'little'
 |      >>> native_code = sys_is_le and '<' or '>'
 |      >>> swapped_code = sys_is_le and '>' or '<'
 |      >>> native_dt = np.dtype(native_code+'i2')
 |      >>> swapped_dt = np.dtype(swapped_code+'i2')
 |      >>> native_dt.newbyteorder('S') == swapped_dt
 |      True
 |      >>> native_dt.newbyteorder() == swapped_dt
 |      True
 |      >>> native_dt == swapped_dt.newbyteorder('S')
 |      True
 |      >>> native_dt == swapped_dt.newbyteorder('=')
 |      True
 |      >>> native_dt == swapped_dt.newbyteorder('N')
 |      True
 |      >>> native_dt == native_dt.newbyteorder('|')
 |      True
 |      >>> np.dtype('<i2') == native_dt.newbyteorder('<')
 |      True
 |      >>> np.dtype('<i2') == native_dt.newbyteorder('L')
 |      True
 |      >>> np.dtype('>i2') == native_dt.newbyteorder('>')
 |      True
 |      >>> np.dtype('>i2') == native_dt.newbyteorder('B')
 |      True
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  alignment
 |      The required alignment (bytes) of this data-type according to the compiler.
 |      
 |      More information is available in the C-API section of the manual.
 |  
 |  base
 |  
 |  byteorder
 |      A character indicating the byte-order of this data-type object.
 |      
 |      One of:
 |      
 |      ===  ==============
 |      '='  native
 |      '<'  little-endian
 |      '>'  big-endian
 |      '|'  not applicable
 |      ===  ==============
 |      
 |      All built-in data-type objects have byteorder either '=' or '|'.
 |      
 |      Examples
 |      --------
 |      
 |      >>> dt = np.dtype('i2')
 |      >>> dt.byteorder
 |      '='
 |      >>> # endian is not relevant for 8 bit numbers
 |      >>> np.dtype('i1').byteorder
 |      '|'
 |      >>> # or ASCII strings
 |      >>> np.dtype('S2').byteorder
 |      '|'
 |      >>> # Even if specific code is given, and it is native
 |      >>> # '=' is the byteorder
 |      >>> import sys
 |      >>> sys_is_le = sys.byteorder == 'little'
 |      >>> native_code = sys_is_le and '<' or '>'
 |      >>> swapped_code = sys_is_le and '>' or '<'
 |      >>> dt = np.dtype(native_code + 'i2')
 |      >>> dt.byteorder
 |      '='
 |      >>> # Swapped code shows up as itself
 |      >>> dt = np.dtype(swapped_code + 'i2')
 |      >>> dt.byteorder == swapped_code
 |      True
 |  
 |  char
 |      A unique character code for each of the 21 different built-in types.
 |  
 |  descr
 |      Array-interface compliant full description of the data-type.
 |      
 |      The format is that required by the 'descr' key in the
 |      `__array_interface__` attribute.
 |  
 |  fields
 |      Dictionary of named fields defined for this data type, or ``None``.
 |      
 |      The dictionary is indexed by keys that are the names of the fields.
 |      Each entry in the dictionary is a tuple fully describing the field::
 |      
 |        (dtype, offset[, title])
 |      
 |      If present, the optional title can be any object (if it is a string
 |      or unicode then it will also be a key in the fields dictionary,
 |      otherwise it's meta-data). Notice also that the first two elements
 |      of the tuple can be passed directly as arguments to the ``ndarray.getfield``
 |      and ``ndarray.setfield`` methods.
 |      
 |      See Also
 |      --------
 |      ndarray.getfield, ndarray.setfield
 |      
 |      Examples
 |      --------
 |      >>> dt = np.dtype([('name', np.str_, 16), ('grades', np.float64, (2,))])
 |      >>> print dt.fields
 |      {'grades': (dtype(('float64',(2,))), 16), 'name': (dtype('|S16'), 0)}
 |  
 |  flags
 |      Bit-flags describing how this data type is to be interpreted.
 |      
 |      Bit-masks are in `numpy.core.multiarray` as the constants
 |      `ITEM_HASOBJECT`, `LIST_PICKLE`, `ITEM_IS_POINTER`, `NEEDS_INIT`,
 |      `NEEDS_PYAPI`, `USE_GETITEM`, `USE_SETITEM`. A full explanation
 |      of these flags is in C-API documentation; they are largely useful
 |      for user-defined data-types.
 |  
 |  hasobject
 |      Boolean indicating whether this dtype contains any reference-counted
 |      objects in any fields or sub-dtypes.
 |      
 |      Recall that what is actually in the ndarray memory representing
 |      the Python object is the memory address of that object (a pointer).
 |      Special handling may be required, and this attribute is useful for
 |      distinguishing data types that may contain arbitrary Python objects
 |      and data-types that won't.
 |  
 |  isalignedstruct
 |      Boolean indicating whether the dtype is a struct which maintains
 |      field alignment. This flag is sticky, so when combining multiple
 |      structs together, it is preserved and produces new dtypes which
 |      are also aligned.
 |  
 |  isbuiltin
 |      Integer indicating how this dtype relates to the built-in dtypes.
 |      
 |      Read-only.
 |      
 |      =  ========================================================================
 |      0  if this is a structured array type, with fields
 |      1  if this is a dtype compiled into numpy (such as ints, floats etc)
 |      2  if the dtype is for a user-defined numpy type
 |         A user-defined type uses the numpy C-API machinery to extend
 |         numpy to handle a new array type. See
 |         :ref:`user.user-defined-data-types` in the Numpy manual.
 |      =  ========================================================================
 |      
 |      Examples
 |      --------
 |      >>> dt = np.dtype('i2')
 |      >>> dt.isbuiltin
 |      1
 |      >>> dt = np.dtype('f8')
 |      >>> dt.isbuiltin
 |      1
 |      >>> dt = np.dtype([('field1', 'f8')])
 |      >>> dt.isbuiltin
 |      0
 |  
 |  isnative
 |      Boolean indicating whether the byte order of this dtype is native
 |      to the platform.
 |  
 |  itemsize
 |      The element size of this data-type object.
 |      
 |      For 18 of the 21 types this number is fixed by the data-type.
 |      For the flexible data-types, this number can be anything.
 |  
 |  kind
 |      A character code (one of 'biufcOSUV') identifying the general kind of data.
 |      
 |      =  ======================
 |      b  boolean
 |      i  signed integer
 |      u  unsigned integer
 |      f  floating-point
 |      c  complex floating-point
 |      O  object
 |      S  (byte-)string
 |      U  Unicode
 |      V  void
 |      =  ======================
 |  
 |  metadata
 |  
 |  name
 |      A bit-width name for this data-type.
 |      
 |      Un-sized flexible data-type objects do not have this attribute.
 |  
 |  names
 |      Ordered list of field names, or ``None`` if there are no fields.
 |      
 |      The names are ordered according to increasing byte offset. This can be
 |      used, for example, to walk through all of the named fields in offset order.
 |      
 |      Examples
 |      --------
 |      >>> dt = np.dtype([('name', np.str_, 16), ('grades', np.float64, (2,))])
 |      >>> dt.names
 |      ('name', 'grades')
 |  
 |  num
 |      A unique number for each of the 21 different built-in types.
 |      
 |      These are roughly ordered from least-to-most precision.
 |  
 |  shape
 |      Shape tuple of the sub-array if this data type describes a sub-array,
 |      and ``()`` otherwise.
 |  
 |  str
 |      The array-protocol typestring of this data-type object.
 |  
 |  subdtype
 |      Tuple ``(item_dtype, shape)`` if this `dtype` describes a sub-array, and
 |      None otherwise.
 |      
 |      The *shape* is the fixed shape of the sub-array described by this
 |      data type, and *item_dtype* the data type of the array.
 |      
 |      If a field whose dtype object has this attribute is retrieved,
 |      then the extra dimensions implied by *shape* are tacked on to
 |      the end of the retrieved array.
 |  
 |  type
 |      The type object used to instantiate a scalar of this data-type.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __new__ = <built-in method __new__ of type object>
 |      T.__new__(S, ...) -> a new object with type S, a subtype of T

Посты чуть ниже также могут вас заинтересовать

iPython R Rapid Miner

Поиск по блогу

Страницы

понедельник, 30 марта 2015 г.

Pandas или file.read... Типовая задача: В папке пара десятков сравнительно коротких файлов...

Как перебирать файлы в папке, читать их имена, открывать их и редактировать... и снова записывать на диск.¶

Although never is often better than right now ... Сказано - сделано¶

Распарсим имя файла, дабы достать оттуда "april"¶

Добавим столбец с помощью Pandas (...На дачу!)¶

Теперь вспомним, как прочитать строку в объект даты datetime.strptime("21/11/06 16:30", "%d/%m/%y %H:%M")¶

Если использовать Pandas, то тут есть другой объект даты Time Series / Date functionality ¶

И нам нужно, чтобы столбец строк, который мы добавили в объект читался, как столбец дат...¶

Как определять формат даты автоматически?¶

Далее подробности и заметки из документации¶

Как работать со столбцом дат в Pandas отлдично описано здесь Копипаст второй части о подготовке данных "Data Wrangling with Pandas" (часть 2) ¶

Что такое dtypes ¶

Универсальный способ преобразования типов astype и object-conversion ¶

А вот как нужно конвертировать все подряд в формат datetime64[ns] object-conversion ¶

Очевидно, что надо запомнить и прием Selecting columns based on dtype ¶

Комментариев нет:

Отправить комментарий

Поиск по блогу

Страницы

понедельник, 30 марта 2015 г.

Pandas или file.read... Типовая задача: В папке пара десятков сравнительно коротких файлов...

Как перебирать файлы в папке, читать их имена, открывать их и редактировать... и снова записывать на диск.¶

Although never is often better than right now ... Сказано - сделано¶

Распарсим имя файла, дабы достать оттуда "april"¶

Добавим столбец с помощью Pandas (...На дачу!)¶

Теперь вспомним, как прочитать строку в объект даты datetime.strptime("21/11/06 16:30", "%d/%m/%y %H:%M")¶

Если использовать Pandas, то тут есть другой объект даты Time Series / Date functionality¶

И нам нужно, чтобы столбец строк, который мы добавили в объект читался, как столбец дат...¶

Как определять формат даты автоматически?¶

Далее подробности и заметки из документации¶

Как работать со столбцом дат в Pandas отлдично описано здесь Копипаст второй части о подготовке данных "Data Wrangling with Pandas" (часть 2) ¶

Что такое dtypes¶

Универсальный способ преобразования типов astype и object-conversion¶

А вот как нужно конвертировать все подряд в формат datetime64[ns] object-conversion¶

Очевидно, что надо запомнить и прием Selecting columns based on dtype¶

Комментариев нет:

Отправить комментарий

понедельник, 30 марта 2015 г.

Если использовать Pandas, то тут есть другой объект даты Time Series / Date functionality ¶

Что такое dtypes ¶

Универсальный способ преобразования типов astype и object-conversion ¶

А вот как нужно конвертировать все подряд в формат datetime64[ns] object-conversion ¶

Очевидно, что надо запомнить и прием Selecting columns based on dtype ¶