Соответствующий раздел в "Python for data analysis" начинается с примеров из Numpy. Применительно к таблицам (и объектам DataFrame), конкатенция - это добавление к существующей таблице строк или столбцов второй таблицы. Типичная задача, которую я хочу научиться решать: в папке есть однотипные таблицы (котрые я напарсил в разные файлы), как из них сделать одну таблицу? Возможны варианты с формой таблиц, добавлением столбцов-параметров..., но здесь я хочу понять, как лучше осуществлять конкатенцию. Потому начинаю с примера, найденного на nbviewer.ipython.org, а примеры конкатенции в конце поста.

На странице документации янашел ссылку Grab data from multiple excel files and merge them into a single dataframe, откуда ясно, что ничего особенного для конкатенции нескольких файлов в Pandas не придумано. Нужно организовывать цикл по файлам в папке.

In []:

# List to hold file names
FileNames = []

# Your path will be different, please modify the path below.
os.chdir(r"C:\Users\david\notebooks\pandas")

# Find any file that ends with ".xlsx"
for files in os.listdir("."):
    if files.endswith(".xlsx"):
        FileNames.append(files)
        
FileNames

Я подобное уже делал в коде AEBto3tables... вот здесь Методика перенастройки и отладки pdf парсера AEBto3tables с использован ем IPyNotebook , вот фрагмент:

In []:

def alltxt2csv(self, path=TXTDATAFOLDER):
        """
        The loop into folder with preliminary converted txt files
        to call 1-3 tables parsers
        Warning: it does not work if filename include additional
        dots ('.') except file extension '.txt'
        """
        
        for dir_entry in os.listdir(path):
            if dir_entry.split('.')[1] == 'txt':
                txtfile_path = os.path.join(path, dir_entry)

                filename = dir_entry.split('.')[0]
                #txtfilename = filename + r'.csv'
                #csvfile_path = os.path.join(targetpath, txtfilename)
                csvfilename = filename + r'.csv'
                
                if os.path.isfile(txtfile_path):

                    self.tab1fromtxt(txtfile_path, csvfilename)
                    self.tab2fromtxt(txtfile_path, csvfilename)
                    self.tab3fromtxt(txtfile_path, csvfilename)

Далее разберем, наконец, примеры конкатенции: Pandas - Concatenating Along an Axis¶

pandas-docs
Pandas часть 1
Pandas часть 2
statistical-analysis-python-tutorial
Pandas
Numpy User Guide
Python for data analysis

In []:

objs        # List or dict of pandas objects to be concatenated. The only required argument
axis        # Axis to concatenate along; defaults to 0
join        # One of 'inner', 'outer', defaulting to 'outer'; whether to intersection (inner) or union
            # (outer) together indexes along the other axes
join_axes   # Specific indexes to use for the other n-1 axes instead of performing union/intersection logic
            # keys Values to associate with objects being concatenated, forming a hierarchical index along the
            # concatenation axis. Can either be a list or array of arbitrary values, an array of tuples, or a list of
            # arrays (if multiple level arrays passed in levels)
levels      # Specific indexes to use as hierarchical index level or levels if keys passed
names       # Names for created hierarchical levels if keys and / or levels passed
verify_integrity  # Check new axis in concatenated object for duplicates and raise exception if so. By default
                  # (False) allows duplicates
ignore_index      # Do not preserve indexes along concatenation axis, instead producing a new
                  # range(total_length) index

Another kind of data combination operation is alternatively referred to as concatenation, binding, or stacking. NumPy has a concatenate function for doing this with raw NumPy arrays:

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np

In [2]:

# Set some Pandas options
pd.set_option('html', False)
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 20)

In [3]:

arr = np.arange(12).reshape((3, 4))

In [4]:

arr

Out[4]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [5]:

np.concatenate([arr, arr], axis=1)

Out[5]:

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [7]:

np.concatenate([arr, arr], axis=0)

Out[7]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In the context of pandas objects such as Series and DataFrame, having labeled axes enable you to further generalize array concatenation. In particular, you have a number of additional things to think about:

• If the objects are indexed differently on the other axes, should the collection of axes be unioned or intersected? • Do the groups need to be identifiable in the resulting object? • Does the concatenation axis matter at all?

The concat function in pandas provides a consistent way to address each of these concerns. I’ll give a number of examples to illustrate how it works. Suppose we have three Series with no index overlap:

In [9]:

s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

In [10]:

pd.concat([s1, s2, s3])

Out[10]:

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

Красивый пример, показывает, что в качестуе индекса может быть задана последовательность строк...

By default concat works along axis=0, producing another Series. If you pass axis=1, the result will instead be a DataFrame (axis=1 is the columns):

In [11]:

pd.concat([s1, s2, s3], axis=1)

Out[11]:

    0   1   2
a   0 NaN NaN
b   1 NaN NaN
c NaN   2 NaN
d NaN   3 NaN
e NaN   4 NaN
f NaN NaN   5
g NaN NaN   6

In this case there is no overlap on the other axis, which as you can see is the sorted union (the 'outer' join) of the indexes. You can instead intersect them by passing join='inner':

In [16]:

s4 = pd.concat([s1 * 5, s3])
s4

Out[16]:

a    0
b    5
f    5
g    6
dtype: int64

In [14]:

pd.concat([s1, s4], axis=1)

Out[14]:

    0  1
a   0  0
b   1  5
f NaN  5
g NaN  6

In [15]:

pd.concat([s1, s4], axis=1, join='inner')

Out[15]:

   0  1
a  0  0
b  1  5

You can even specify the axes to be used on the other axes with join_axes:

In [17]:

pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])

Out[17]:

    0   1
a   0   0
c NaN NaN
b   1   5
e NaN NaN

Посты чуть ниже также могут вас заинтересовать

iPython R Rapid Miner

Поиск по блогу

Страницы

среда, 11 марта 2015 г.

Pandas - Concatenating Along an Axis. Готовимся сцепить десяток csv-файлов в один

Далее разберем, наконец, примеры конкатенции: Pandas - Concatenating Along an Axis¶

2 комментария:

Поиск по блогу

Страницы

среда, 11 марта 2015 г.

Pandas - Concatenating Along an Axis. Готовимся сцепить десяток csv-файлов в один

Далее разберем, наконец, примеры конкатенции: Pandas - Concatenating Along an Axis¶

2 комментария:

среда, 11 марта 2015 г.