Пост, в котором я сначала пытаюсь объединить несколько разных приемов для экспресс-анализа данных (парсинг- обработка таблиц - разведочный анализ данных). Для парсинга я предполагал использовать requests + lxml. Но оказалось, что Pandas загружкет csv из WWW не хуже, чем с локального диска. А для короткого фрагмента (например, оглавления раздела) проще использовать копипаст тегов. Здесь пример того, как я увяз в разнородных мелких вопросах, поскольку начал изучать приемы построения (масштабирования) диаграмм ('figure.figsize')... форматирование таблиц (например, display.max_columns')

Далее будем целенаправленно искать практикумы на GitHub+Nbviewer

Мозгу нужно прорастить несколько сотен тысяч синаптических связей, чтобы я запомнил основные приемы работы с Pandas. Помимо первичной классификации разделов и подразделов нужно еще усвоить сотни "мелких" приемов: где здесь ставить запятую, кавычки..., убирает ли Pandas traling spaces, как регулировать размеры диаграмм, форматы линий в таблицах... Таким оразом, если я начну сразу решать мои задачи, то потрачу на порядок больше времени на поиски ответов, чем в случае разбора чужих ПОХОЖИХ примеров. Вопрос в том, как найти эти "похожие примеры". Здесь я записал неправильную последовательность моих действий, ...как я пытаюсь решать свои задачи, но вязну в "похожем" примере.

В нескольких предыдущих постах я описывал отдельные приемы (команды), например merge, поскольку составил список свойств, которые понадобятся мне для моих задач. Теперь понятно, что в этом и заключается главная ошибка. Надо было начинать с "похожего примера" pandas-cookbook. Его я начинаю в следующем посте, а здесь лог моих неудач.

In []:

import requests
headers = {
    'User-Agent': 'My User Agent 1.0',
    'From': 'youremail@domain.com'  # This is another valid field
}
url= ''
response = requests.get(url, headers=headers)

In []:

from lxml import html
tree=html.fromstring(response.text)

Chapter 6: String operations! Which month was the snowiest?
pandas-cookbook
Pandas API Reference

... по рецептам из статьи "HTML Scraping"
Requests: HTTP for Humans
List of HTTP header fields
Authenticate to Google Drive and download spreadsheet with Python urllib2/requests

Parsing XML and HTML with lxml

In [3]:

%matplotlib inline
import pandas as pd
import numpy as np

In [19]:

import matplotlib.pyplot as plt
#pd.set_option('display.mpl_style', 'default')
plt.rcParams['figure.figsize'] = (15, 5)

In [10]:

!pwd

/media/SL-63-X86_6/w8/IPython Notebooks/2015_3

In [20]:

bikes = pd.read_csv('./pandas-cookbook/data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')
bikes['Berri 1'].plot()

Out[20]:

<matplotlib.axes.AxesSubplot at 0xb21720c>

In [21]:

same_url = "https://raw.githubusercontent.com/jvns/pandas-cookbook/master/data/bikes.csv"

In [23]:

plt.rcParams['figure.figsize'] = (20, 5)

In [24]:

bikes = pd.read_csv(same_url, sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')
bikes['Berri 1'].plot()

Out[24]:

<matplotlib.axes.AxesSubplot at 0xb72e80c>

In [28]:

# Set some Pandas options
pd.set_option('html', False)
#pd.set_option('max_columns', 30)
#pd.set_option('max_rows', 20)

In [29]:

bikes[:5]

Out[29]:

            Berri 1  BrÃ©beuf (donnÃ©es non disponibles)  \
Date                                                       
2012-01-01       35                                  NaN   
2012-01-02       83                                  NaN   
2012-01-03      135                                  NaN   
2012-01-04      144                                  NaN   
2012-01-05      197                                  NaN   

            CÃ´te-Sainte-Catherine  Maisonneuve 1  Maisonneuve 2  du Parc  \
Date                                                                        
2012-01-01                       0             38             51       26   
2012-01-02                       1             68            153       53   
2012-01-03                       2            104            248       89   
2012-01-04                       1            116            318      111   
2012-01-05                       2            124            330       97   

            Pierre-Dupuy  Rachel1  St-Urbain (donnÃ©es non disponibles)  
Date                                                                     
2012-01-01            10       16                                   NaN  
2012-01-02             6       43                                   NaN  
2012-01-03             3       58                                   NaN  
2012-01-04             8       61                                   NaN  
2012-01-05            13       95                                   NaN

In [44]:

pd.set_option('precision', 2)
bikes[:5]

Out[44]:

            Berri 1  BrÃ©beuf (donnÃ©es non disponibles)  \
Date                                                       
2012-01-01       35                                  NaN   
2012-01-02       83                                  NaN   
2012-01-03      135                                  NaN   
2012-01-04      144                                  NaN   
2012-01-05      197                                  NaN   

            CÃ´te-Sainte-Catherine  Maisonneuve 1  Maisonneuve 2  du Parc  \
Date                                                                        
2012-01-01                       0             38             51       26   
2012-01-02                       1             68            153       53   
2012-01-03                       2            104            248       89   
2012-01-04                       1            116            318      111   
2012-01-05                       2            124            330       97   

            Pierre-Dupuy  Rachel1  St-Urbain (donnÃ©es non disponibles)  
Date                                                                     
2012-01-01            10       16                                   NaN  
2012-01-02             6       43                                   NaN  
2012-01-03             3       58                                   NaN  
2012-01-04             8       61                                   NaN  
2012-01-05            13       95                                   NaN

In [46]:

pd.set_option('html', True)
pd.set_option('display.max_columns', 7)
bikes[:5]

Out[46]:

	Berri 1	BrÃ©beuf (donnÃ©es non disponibles)	CÃ´te-Sainte-Catherine	...	Pierre-Dupuy	Rachel1	St-Urbain (donnÃ©es non disponibles)
Date
2012-01-01	35	NaN	0	...	10	16	NaN
2012-01-02	83	NaN	1	...	6	43	NaN
2012-01-03	135	NaN	2	...	3	58	NaN
2012-01-04	144	NaN	1	...	8	61	NaN
2012-01-05	197	NaN	2	...	13	95	NaN

5 rows × 9 columns

In [48]:

pd.reset_option('display.max_columns', 'precision')
bikes[:5]

Out[48]:

	Berri 1	BrÃ©beuf (donnÃ©es non disponibles)	CÃ´te-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (donnÃ©es non disponibles)
Date
2012-01-01	35	NaN	0	38	51	26	10	16	NaN
2012-01-02	83	NaN	1	68	153	53	6	43	NaN
2012-01-03	135	NaN	2	104	248	89	3	58	NaN
2012-01-04	144	NaN	1	116	318	111	8	61	NaN
2012-01-05	197	NaN	2	124	330	97	13	95	NaN

In [49]:

pd.reset_option('encoding', 'UTF-8')
bikes[:5]

Out[49]:

	Berri 1	BrÃ©beuf (donnÃ©es non disponibles)	CÃ´te-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (donnÃ©es non disponibles)
Date
2012-01-01	35	NaN	0	38	51	26	10	16	NaN
2012-01-02	83	NaN	1	68	153	53	6	43	NaN
2012-01-03	135	NaN	2	104	248	89	3	58	NaN
2012-01-04	144	NaN	1	116	318	111	8	61	NaN
2012-01-05	197	NaN	2	124	330	97	13	95	NaN

In [50]:

pd.describe_option('encoding')

display.encoding : str/unicode
    Defaults to the detected encoding of the console.
    Specifies the encoding to be used for strings returned by to_string,
    these are generally strings meant to be displayed on the console.
    [default: UTF-8] [currently: UTF-8]

In [51]:

pd.reset_option('encoding', 'latin1')
bikes[:5]

Out[51]:

	Berri 1	BrÃ©beuf (donnÃ©es non disponibles)	CÃ´te-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (donnÃ©es non disponibles)
Date
2012-01-01	35	NaN	0	38	51	26	10	16	NaN
2012-01-02	83	NaN	1	68	153	53	6	43	NaN
2012-01-03	135	NaN	2	104	248	89	3	58	NaN
2012-01-04	144	NaN	1	116	318	111	8	61	NaN
2012-01-05	197	NaN	2	124	330	97	13	95	NaN

In [52]:

pd.describe_option('encoding')

display.encoding : str/unicode
    Defaults to the detected encoding of the console.
    Specifies the encoding to be used for strings returned by to_string,
    these are generally strings meant to be displayed on the console.
    [default: UTF-8] [currently: UTF-8]

In [38]:

pd.describe_option('html')

display.notebook_repr_html : boolean
    When True, IPython notebook will use html representation for
    pandas objects (if it is available).
    [default: True] [currently: False]

In [39]:

pd.get_option('html')

Out[39]:

False

In [40]:

pd.describe_option()

display.chop_threshold : float or None
    if set to a float value, all float values smaller then the given threshold
    will be displayed as exactly 0 by repr and friends.
    [default: None] [currently: None]

display.colheader_justify : 'left'/'right'
    Controls the justification of column headers. used by DataFrameFormatter.
    [default: right] [currently: right]

display.column_space No description available.
    [default: 12] [currently: 12]

display.date_dayfirst : boolean
    When True, prints and parses dates with the day first, eg 20/01/2005
    [default: False] [currently: False]

display.date_yearfirst : boolean
    When True, prints and parses dates with the year first, eg 2005/01/20
    [default: False] [currently: False]

display.encoding : str/unicode
    Defaults to the detected encoding of the console.
    Specifies the encoding to be used for strings returned by to_string,
    these are generally strings meant to be displayed on the console.
    [default: UTF-8] [currently: UTF-8]

display.expand_frame_repr : boolean
    Whether to print out the full DataFrame repr for wide DataFrames across
    multiple lines, `max_columns` is still respected, but the output will
    wrap-around across multiple "pages" if its width exceeds `display.width`.
    [default: True] [currently: True]

display.float_format : callable
    The callable should accept a floating point number and return
    a string with the desired format of the number. This is used
    in some places like SeriesFormatter.
    See core.format.EngFormatter for an example.
    [default: None] [currently: None]

display.height : int
    Deprecated.
    [default: 60] [currently: 20]
    (Deprecated, use `display.max_rows` instead.)

display.large_repr : 'truncate'/'info'
    For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
    show a truncated table (the default from 0.13), or switch to the view from
    df.info() (the behaviour in earlier versions of pandas).
    [default: truncate] [currently: truncate]

display.line_width : int
    Deprecated.
    [default: 80] [currently: 80]
    (Deprecated, use `display.width` instead.)

display.max_categories : int
    This sets the maximum number of categories pandas should output when printing
    out a `Categorical` or a Series of dtype "category".
    [default: 8] [currently: 8]

display.max_columns : int
    If max_cols is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the width of the terminal and print a truncated object which fits
    the screen width. The IPython notebook, IPython qtconsole, or IDLE
    do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 20] [currently: 30]

display.max_colwidth : int
    The maximum width in characters of a column in the repr of
    a pandas data structure. When the column overflows, a "..."
    placeholder is embedded in the output.
    [default: 50] [currently: 50]

display.max_info_columns : int
    max_info_columns is used in DataFrame.info method to decide if
    per column information will be printed.
    [default: 100] [currently: 100]

display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions then specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 20]

display.max_seq_items : int or None
    when pretty-printing a long sequence, no more then `max_seq_items`
    will be printed. If items are omitted, they will be denoted by the
    addition of "..." to the resulting string.

    If set to None, the number of items to be printed is unlimited.
    [default: 100] [currently: 100]

display.memory_usage : bool or None
    This specifies if the memory usage of a DataFrame should be displayed when
    df.info() is called.
    [default: True] [currently: True]

display.mpl_style : bool
    Setting this to 'default' will modify the rcParams used by matplotlib
    to give plots a more pleasing visual style by default.
    Setting this to None/False restores the values to their initial value.
    [default: None] [currently: default]

display.multi_sparse : boolean
    "sparsify" MultiIndex display (don't display repeated
    elements in outer levels within groups)
    [default: True] [currently: True]

display.notebook_repr_html : boolean
    When True, IPython notebook will use html representation for
    pandas objects (if it is available).
    [default: True] [currently: False]

display.pprint_nest_depth : int
    Controls the number of nested levels to process when pretty-printing
    [default: 3] [currently: 3]

display.precision : int
    Floating point output precision (number of significant digits). This is
    only a suggestion
    [default: 7] [currently: 7]

display.show_dimensions : boolean or 'truncate'
    Whether to print out dimensions at the end of DataFrame repr.
    If 'truncate' is specified, only print out the dimensions if the
    frame is truncated (e.g. not display all rows and/or columns)
    [default: truncate] [currently: truncate]

display.width : int
    Width of the display in characters. In case python/IPython is running in
    a terminal this can be set to None and pandas will correctly auto-detect
    the width.
    Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
    terminal and hence it is not possible to correctly detect the width.
    [default: 80] [currently: 80]

io.excel.xls.writer : string
    The default Excel writer engine for 'xls' files. Available options:
    'xlwt' (the default).
    [default: xlwt] [currently: xlwt]

io.excel.xlsm.writer : string
    The default Excel writer engine for 'xlsm' files. Available options:
    'openpyxl' (the default).
    [default: openpyxl] [currently: openpyxl]

io.excel.xlsx.writer : string
    The default Excel writer engine for 'xlsx' files. Available options:
    'openpyxl' (the default), 'xlsxwriter'.
    [default: openpyxl] [currently: openpyxl]

io.hdf.default_format : format
    default format writing format, if None, then
    put will default to 'fixed' and append will default to 'table'
    [default: None] [currently: None]

io.hdf.dropna_table : boolean
    drop ALL nan rows when appending to a table
    [default: True] [currently: True]

mode.chained_assignment : string
    Raise an exception, warn, or no action if trying to use chained assignment,
    The default is warn
    [default: warn] [currently: warn]

mode.sim_interactive : boolean
    Whether to simulate interactive mode for purposes of testing
    [default: False] [currently: False]

mode.use_inf_as_null : boolean
    True means treat None, NaN, INF, -INF as null (old way),
    False means None and NaN are null, but INF, -INF are not null
    (new way).
    [default: False] [currently: False]

In []:

Посты чуть ниже также могут вас заинтересовать

iPython R Rapid Miner

Поиск по блогу

Страницы

четверг, 12 марта 2015 г.

Я увяз в pandas-cookbook вместо того, чтобы писать свой код. Далее используем GitHub+Nbviewer

Комментариев нет:

Отправить комментарий

Поиск по блогу

Страницы

четверг, 12 марта 2015 г.

Я увяз в pandas-cookbook вместо того, чтобы писать свой код. Далее используем GitHub+Nbviewer

Комментариев нет:

Отправить комментарий

четверг, 12 марта 2015 г.