Поиск по блогу

вторник, 31 марта 2015 г.

Pandas как лучше склеивать файлы... Справочник методов read_csv, to_csv, DataFrame

Это третий пост (подряд) на эту тему. Мы разобрались с методами чистки и конвертации столбцов в DataFrame(). Теперь прежде, чем сделать из всех файлов в папке один, соберем здесь методы и свойства read_csv, to_csv Их несколько десятков, а у объекта DataFrame - несколько сотен. Так что здесь получился справочник

In [1]:
import os
In [2]:
path = '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/1aug2'
!ls '/media/MYLINUXLIVE/Documents/Xpdf/aerbu_2014_all_csv/1aug2'
000_eng_car-sales-in-april-2014.csv  eng_car-sales-in-november-2014.csv
eng_car-sales-in-april-2014.csv      eng_car-sales-in-october-2014.csv
eng_car-sales-in-august-2014.csv     eng_car-sales-in-september-2014.csv
eng_car-sales-in-december-2014.csv   sales-in-december_2013_eng_final.csv
eng_car-sales-in-july-2014.csv      sales-in-february_2014_eng_final.csv
eng_car-sales-in-june-2014.csv      sales-in-january_2014_eng_final_1.csv
eng_car-sales-in-may-2014.csv      sales-in-march_2014_eng_final.csv

Как и ранее, стараемся пока не задумываться о том, как организуем выбор (фильтрацию) файлов. Сосредоточимся на том, как их сливать (склеивать). В предыдущем посте мы обнаружили, что в файлах может быть разное количество столбцов, а не только строк. Более того, сливаемые столбцы могут иметь разные форматы и разные разные символы в родственных столбцах.
Вот пример (из предыдущего поста):

In [ ]:
******      eng_car-sales-in-september-2014.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 64 entries, 0 to 63
Data columns (total 8 columns):    ## Вставить заголовки столбцов
Lada          64 non-null object
36,513        64 non-null object   # Запятая отделяет тысячи
40,011        64 non-null object   # Запятая отделяет тысячи
-9%           64 non-null object   # Убрать символ %  
283,802       64 non-null object   # Запятая отделяет тысячи
343,368       64 non-null object   # Запятая отделяет тысячи
-17%          64 non-null object   # Убрать символ %  
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.8+ KB
******      sales-in-december_2013_eng_final.csv
<class 'pandas.core.frame.DataFrame'>
Int64Index: 57 entries, 0 to 56
Data columns (total 8 columns):    ## Вставить заголовки столбцов
Lada          57 non-null object
456 309       57 non-null object   # Убрать пробел
537 625       57 non-null object   # Убрать пробел
-15%          57 non-null object   # Убрать символ % 
38 948        57 non-null object   # Убрать пробел
43 354        57 non-null object   # Убрать пробел
-10%          57 non-null object   # Убрать символ % 
Unnamed: 7    0 non-null float64
dtypes: float64(1), object(7)
memory usage: 2.4+ KB

Количество столбцов во всех файлах -8 (один лишний из-за запятой в конце строки). Файл Января - исключение - тут нет двух столбцов нарастающего итога. Их надо будет вставлять.

Character Meaning
'r' open for reading (default)
'w' open for writing, truncating the file first
'a' open for writing, appending to the end of the file if it exists
'b' binary mode
't' text mode (default)
'+' open a disk file for updating (reading and writing)
'U' universal newlines mode (for backwards compatibility; should not be used in new code)
In [ ]:
pd000 = pd.read_csv(csvfile_path, error_bad_lines=False)

Опции read_csv Ссылка на документацию под постом

  • filepath_or_buffer: Either a string path to a file, URL (including http, ftp, and S3 locations), or any object with a read method (such as an open file or StringIO).
  • sep or delimiter: A delimiter / separator to split fields on. read_csv is capable of inferring the delimiter automatically in some cases by “sniffing.” The separator may be specified as a regular expression; for instance you may use ‘|\s*’ to indicate a pipe plus arbitrary whitespace.
  • delim_whitespace: Parse whitespace-delimited (spaces or tabs) file (much faster than using a regular expression)
  • compression: decompress 'gzip' and 'bz2' formats on the fly.
  • dialect: string or csv.Dialect instance to expose more ways to specify the file format
  • dtype: A data type name or a dict of column name to data type. If not specified, data types will be inferred. (Unsupported with engine='python')
  • header: row number(s) to use as the column names, and the start of the data. Defaults to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns E.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example are skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True (the default), so header=0 denotes the first line of data rather than the first line of the file.
  • skip_blank_lines: whether to skip over blank lines rather than interpreting them as NaN values
  • skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows
  • index_col: column number, column name, or list of column numbers/names, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the first column is taken as the index.
  • names: List of column names to use as column names. To replace header existing in file, explicitly pass header=0.
  • na_values: optional list of strings to recognize as NaN (missing values), either in addition to or in lieu of the default set.
  • true_values: list of strings to recognize as True
  • false_values: list of strings to recognize as False
  • keep_default_na: whether to include the default set of missing values in addition to the ones specified in na_values
  • parse_dates: if True then index will be parsed as dates (False by default). You can specify more complicated options to parse a subset of columns or a combination of columns into a single date column (list of ints or names, list of lists, or dict) [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column [[1, 3]] -> combine columns 1 and 3 and parse as a single date column {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
  • keep_date_col: if True, then date component columns passed into parse_dates will be retained in the output (False by default).
  • date_parser: function to use to parse strings into datetime objects. If parse_dates is True, it defaults to the very robust dateutil.parser. Specifying this implicitly sets parse_dates as True. You can also use functions from community supported date converters from date_converters.py
  • dayfirst: if True then uses the DD/MM international/European date format (This is False by default)
  • thousands: specifies the thousands separator. If not None, this character will be stripped from numeric dtypes. However, if it is the first character in a field, that column will be imported as a string. In the PythonParser, if not None, then parser will try to look for it in the output and parse relevant data to numeric dtypes. Because it has to essentially scan through the data again, this causes a significant performance hit so only use if necessary.
  • lineterminator : string (length 1), default None, Character to break file into lines. Only valid with C parser
  • quotechar : string, The character to used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
  • quoting : int, Controls whether quotes should be recognized. Values are taken from csv.QUOTE_* values. Acceptable values are 0, 1, 2, and 3 for QUOTE_MINIMAL, QUOTE_ALL, QUOTE_NONE, and QUOTE_NONNUMERIC, respectively.
  • skipinitialspace : boolean, default False, Skip spaces after delimiter
  • escapechar : string, to specify how to escape quoted data
  • comment: Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines, fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment=’#’, parsing ‘#emptyn1,2,3na,b,c’ with header=0 will result in ‘1,2,3’ being treated as the header.
  • nrows: Number of rows to read out of the file. Useful to only read a small portion of a large file
  • iterator: If True, return a TextFileReader to enable reading a file into memory piece by piece
  • chunksize: An number of rows to be used to “chunk” a file into pieces. Will cause an TextFileReader object to be returned. More on this below in the section on iterating and chunking
  • skip_footer: number of lines to skip at bottom of file (default 0) (Unsupported with engine='c')
  • converters: a dictionary of functions for converting values in certain columns, where keys are either integers or column labels
  • encoding: a string representing the encoding to use for decoding unicode data, e.g. 'utf-8` or 'latin-1'. Full list of Python standard encodings
  • verbose: show number of NA values inserted in non-numeric columns
  • squeeze: if True then output with only one column is turned into Series
  • error_bad_lines: if False then any lines causing an error will be skipped bad lines
  • usecols: a subset of columns to return, results in much faster parsing time and lower memory usage.
  • mangle_dupe_cols: boolean, default True, then duplicate columns will be specified as ‘X.0’...’X.N’, rather than ‘X’...’X’
  • tupleize_cols: boolean, default False, if False, convert a list of tuples to a multi-index of columns, otherwise, leave the column index as a list of tuples
  • float_precision : string, default None. Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, ‘high’ for the high-precision converter, and ‘round_trip’ for the round-trip converter.

Знаем, что можно записать все в один файл (есть опция mode : Python write mode, default ‘w’) можно еще использовать довольно много опций

  • path_or_buf: A string path to the file to write or a StringIO
  • sep : Field delimiter for the output file (default ”,”)
  • na_rep: A string representation of a missing value (default ‘’)
  • float_format: Format string for floating point numbers
  • cols: Columns to write (default None)
  • header: Whether to write out the column names (default True)
  • index: whether to write row (index) names (default True)
  • index_label: Column label(s) for index column(s) if desired. If None (default), and header and index are True, then the index names are used. (A sequence should be given if the DataFrame uses MultiIndex).
  • mode : Python write mode, default ‘w’
  • encoding: a string representing the encoding to use if the contents are non-ASCII, for python versions prior to 3
  • line_terminator: Character sequence denoting line end (default ‘\n’)
  • quoting: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL)
  • quotechar: Character used to quote fields (default ‘”’)
  • doublequote: Control quoting of quotechar in fields (default True)
  • escapechar: Character used to escape sep and quotechar when appropriate (default None)
  • chunksize: Number of rows to write at a time
  • tupleize_cols: If False (default), write as a list of tuples, otherwise write in an expanded line format suitable for read_csv
  • date_format: Format string for datetime objects
In [ ]:
 

В документации около сотни свойств, вот пример:

In [ ]:
append(other[, ignore_index, verify_integrity]) 
# Append rows of other to the end of this frame, returning a new object.

Этот метод можно испоьзовать для того, чтобы добавлять к фрейму другой (например, временный) фрейм, в который в итераторе считываются CSV файлы вот пример Reading Multiple CSV Files into Python Pandas Dataframe

In [4]:
import pandas as pd
dir(pd)
Out[4]:
['Categorical',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'ExcelFile',
 'ExcelWriter',
 'Expr',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int64Index',
 'LooseVersion',
 'MultiIndex',
 'NaT',
 'Panel',
 'Panel4D',
 'Period',
 'PeriodIndex',
 'Series',
 'SparseArray',
 'SparseDataFrame',
 'SparseList',
 'SparsePanel',
 'SparseSeries',
 'SparseTimeSeries',
 'Term',
 'TimeGrouper',
 'TimeSeries',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'WidePanel',
 '__builtins__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__name__',
 '__package__',
 '__path__',
 '__version__',
 '_np_version',
 '_np_version_under1p8',
 '_np_version_under1p9',
 '_sparse',
 '_testing',
 'algos',
 'bdate_range',
 'compat',
 'computation',
 'concat',
 'core',
 'crosstab',
 'cut',
 'date_range',
 'datetime',
 'datetools',
 'describe_option',
 'eval',
 'ewma',
 'ewmcorr',
 'ewmcov',
 'ewmstd',
 'ewmvar',
 'ewmvol',
 'expanding_apply',
 'expanding_corr',
 'expanding_corr_pairwise',
 'expanding_count',
 'expanding_cov',
 'expanding_kurt',
 'expanding_max',
 'expanding_mean',
 'expanding_median',
 'expanding_min',
 'expanding_quantile',
 'expanding_skew',
 'expanding_std',
 'expanding_sum',
 'expanding_var',
 'factorize',
 'fama_macbeth',
 'get_dummies',
 'get_option',
 'get_store',
 'groupby',
 'hashtable',
 'index',
 'infer_freq',
 'info',
 'io',
 'isnull',
 'json',
 'lib',
 'load',
 'lreshape',
 'match',
 'melt',
 'merge',
 'msgpack',
 'notnull',
 'np',
 'offsets',
 'ols',
 'option_context',
 'options',
 'ordered_merge',
 'pandas',
 'parser',
 'period_range',
 'pivot',
 'pivot_table',
 'plot_params',
 'pnow',
 'qcut',
 'read_clipboard',
 'read_csv',
 'read_excel',
 'read_fwf',
 'read_gbq',
 'read_hdf',
 'read_html',
 'read_json',
 'read_msgpack',
 'read_pickle',
 'read_sql',
 'read_sql_query',
 'read_sql_table',
 'read_stata',
 'read_table',
 'reset_option',
 'rolling_apply',
 'rolling_corr',
 'rolling_corr_pairwise',
 'rolling_count',
 'rolling_cov',
 'rolling_kurt',
 'rolling_max',
 'rolling_mean',
 'rolling_median',
 'rolling_min',
 'rolling_quantile',
 'rolling_skew',
 'rolling_std',
 'rolling_sum',
 'rolling_var',
 'rolling_window',
 'save',
 'scatter_matrix',
 'set_eng_float_format',
 'set_option',
 'show_versions',
 'sparse',
 'stats',
 'timedelta_range',
 'to_datetime',
 'to_msgpack',
 'to_pickle',
 'to_timedelta',
 'tools',
 'tseries',
 'tslib',
 'unique',
 'util',
 'value_counts',
 'value_range',
 'version',
 'wide_to_long']
In [5]:
help(pd)
Help on package pandas:

NAME
    pandas

FILE
    /usr/local/lib/python2.7/dist-packages/pandas/__init__.py

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    =====================================================================
    
    See http://pandas.sourceforge.net for full documentation. Otherwise, see the
    docstrings of the various objects in the pandas namespace:
    
    Series
    DataFrame
    Panel
    Index
    DatetimeIndex
    HDFStore
    bdate_range
    date_range
    read_csv
    read_fwf
    read_table
    ols

PACKAGE CONTENTS
    _sparse
    _testing
    algos
    compat (package)
    computation (package)
    core (package)
    hashtable
    index
    info
    io (package)
    json
    lib
    msgpack
    parser
    rpy (package)
    sandbox (package)
    sparse (package)
    stats (package)
    tests (package)
    tools (package)
    tseries (package)
    tslib
    util (package)
    version

SUBMODULES
    datetools
    offsets

DATA
    IndexSlice = <pandas.core.indexing._IndexSlice object>
    NaT = NaT
    __docformat__ = 'restructuredtext'
    __version__ = '0.15.2'
    describe_option = <pandas.core.config.CallableDynamicDoc object>
    get_option = <pandas.core.config.CallableDynamicDoc object>
    options = <pandas.core.config.DictWrapper object>
    plot_params = {'xaxis.compat': False}
    reset_option = <pandas.core.config.CallableDynamicDoc object>
    set_option = <pandas.core.config.CallableDynamicDoc object>

VERSION
    0.15.2




Посты чуть ниже также могут вас заинтересовать

Комментариев нет:

Отправить комментарий