Команда merge может быть использована с опциями ключевого поля: left, right, inner, out, ... можно назначать разные ключевые поля в разных таблицах, назначать несколько колючевых полей в каждой таблице, использовать индексы в качестве ключей... Основные примеры взяты из книги "Python for data analysis"

pandas-docs
Pandas часть 1
Pandas часть 2
statistical-analysis-python-tutorial
Pandas
Numpy User Guide
Python for data analysis

Combining and Merging Data Sets from Python for data analysis ¶

Data contained in pandas objects can be combined together in a number of built-in ways:
pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
pandas.concat glues or stacks together objects along an axis.
combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

I will address each of these and give a number of examples. They’ll be utilized in examples throughout the rest of the book

In []:

left        # DataFrame to be merged on the left side
right       # DataFrame to be merged on the right side
how         # One of 'inner', 'outer', 'left' or 'right'. 'inner' by default
on          # Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys
            # given, will use the intersection of the column names in left and right as the join keys
left_on     # Columns in left DataFrame to use as join keys
right_on    # Analogous to left_on for left DataFrame
left_index  # Use row index in left as its join key (or keys, if a MultiIndex)
right_index # Analogous to left_index
sort        # Sort merged data lexicographically by join keys; True by default. Disable to get better performance in some
            # cases on large datasets
suffixes    # Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y'). For
            # example, if 'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result
copy        # If False, avoid copying data into resulting data structure in some exceptional cases. By default always copies

In [2]:

%matplotlib inline
import pandas as pd
import numpy as np

In [4]:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 
                 'data1': range(7)})

In [9]:

df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 
                 'data2': range(3)})

In [10]:

print(df1, df2)

(   data1 key
0      0   b
1      1   b
2      2   a
3      3   c
4      4   a
5      5   a
6      6   b,    data2 key
0      0   a
1      1   b
2      2   d)

This is an example of a many-to-one merge situation; the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. Calling merge with these objects we obtain:

In [11]:

pd.merge(df1, df2)

Out[11]:

	data1	key	data2
0	0	b	1
1	1	b	1
2	6	b	1
3	2	a	0
4	4	a	0
5	5	a	0

Это пример "много к одному" - в первой таблице есть нескоько строчек с одинаковыми ключами (по три штуки "a,b"), а во второй таблице только по одному значению. В итоговой таблице есть только строки с совпадающими значениями ключа (поэтому нет "c"), но зато есть все возможные сочетания. Чтобы убедится в этом, рассмотрим пример:

In []:

# Оставим без изменений
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 
                 'data1': range(7)})

In [12]:

df2 = pd.DataFrame({'key': ['a', 'b', 'b'], 
                 'data2': range(3)})

Здесь мы проведем только одну замену в строке 'key': ['a', 'b', 'b'] мы заменили последнюю 'd' на 'b' и получим вариант "многие ко многим" (many-to-many) ...здесь понятие "все возможные сочетания" становится более очевидным.

In [13]:

pd.merge(df1, df2)

Out[13]:

	data1	key	data2
0	0	b	1
1	0	b	2
2	1	b	1
3	1	b	2
4	6	b	1
5	6	b	2
6	2	a	0
7	4	a	0
8	5	a	0

Note that I didn’t specify which column to join on. If not specified, merge uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though:

In []:

# Explicity is better than implicity
pd.merge(df1, df2, on='key')

Если ключи в разных таблицах называются по разному, то их не надо переименовывать, можно их указать явно:¶

In []:

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Это было пересечение, но есть еще¶

You probably noticed that the 'c' and 'd' values and associated data are missing from the result. By default merge does an 'inner' join; the keys in the result are the intersection.
Other possible options are 'left', 'right', and 'outer'.

The outer join takes the union of the keys, combining the effect of applying both left and right joins:¶

In [19]:

pd.merge(df1, df2, how='outer')

Out[19]:

	data1	key	data2
0	0	b	1
1	0	b	2
2	1	b	1
3	1	b	2
4	6	b	1
5	6	b	2
6	2	a	0
7	4	a	0
8	5	a	0
9	3	c	NaN

Мы получили объединение таблиц, в котором есть ключ "c" (последняя строка), который присутствует только в одной таблице

In [21]:

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data1': range(5)})

In [22]:

pd.merge(df1, df2, on='key', how='left')

Out[22]:

	data1_x	key	data1_y
0	0	b	1
1	0	b	3
2	1	b	1
3	1	b	3
4	2	a	0
5	2	a	2
6	3	c	NaN
7	4	a	0
8	4	a	2
9	5	b	1
10	5	b	3

In [24]:

pd.merge(df1, df2, on='key', how='right')

Out[24]:

	data1_x	key	data1_y
0	0	b	1
1	1	b	1
2	5	b	1
3	0	b	3
4	1	b	3
5	5	b	3
6	2	a	0
7	4	a	0
8	2	a	2
9	4	a	2
10	NaN	d	4

Many-to-many joins form the Cartesian product of the rows. Since there were 3 'b' rows in the left DataFrame and 2 in the right one, there are 6 'b' rows in the result. The join method only affects the distinct key values appearing in the result:

In [23]:

pd.merge(df1, df2, how='inner')

Out[23]:

	data1	key
0	1	b
1	2	a

In []:

In [25]:

left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'], 
                 'key2': ['one', 'two', 'one'], 
                 'lval': [1, 2, 3]})

In [26]:

right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'], 
                   'key2': ['one', 'one', 'one', 'two'], 
                   'rval': [4, 5, 6, 7]})

In [27]:

pd.merge(left, right, on=['key1', 'key2'], how='outer')

Out[27]:

	key1	key2	lval	rval
0	foo	one	1	4
1	foo	one	1	5
2	foo	two	2	NaN
3	bar	one	3	6
4	bar	two	NaN	7

In []:

A last issue to consider in merge operations is the treatment of overlapping column names. While you can address the overlap manually (see the later section on renaming axis labels), merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects:

In [28]:

pd.merge(left, right, on='key1')

Out[28]:

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

In [29]:

pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Out[29]:

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

Посты чуть ниже также могут вас заинтересовать

2 комментария:

SPECULARI9 февраля 2017 г. в 16:22
Здравствуйте! Постигаю азы Pandas. И возникла следующая задача: как объединить таблицы: 1 rows × 3023 columns и 4 columns × 3023 rows?
Первая таблица (info):

MultiIndex: 3023 entries, (2017-01-03 00:00:00+00:00, Equity(45218 [NXTD])) to (2017-01-03 00:00:00+00:00, Equity(50554 [HEBT]))
Data columns (total 4 columns):
exchange_id 3023 non-null category
market_cap 3023 non-null float64
market_cap_yr_ago 2854 non-null float64
security_type 3023 non-null category
dtypes: category(2), float64(2)
memory usage: 77.1+ KB

Вторая таблица (info):

DatetimeIndex: 1 entries, 2017-01-03 to 2017-01-03
Columns: 3023 entries, Equity(45218 [NXTD]) to Equity(50554 [HEBT])
dtypes: float64(3023)
memory usage: 23.6 KB
ОтветитьУдалить
Ответы

Добавить комментарий

iPython R Rapid Miner

Поиск по блогу

Страницы

среда, 11 марта 2015 г.

How to use merge DataFrames in Pandas (Python)

Combining and Merging Data Sets from Python for data analysis ¶

Если ключи в разных таблицах называются по разному, то их не надо переименовывать, можно их указать явно:¶

Это было пересечение, но есть еще¶

The outer join takes the union of the keys, combining the effect of applying both left and right joins:¶

2 комментария:

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

Поиск по блогу

Страницы

среда, 11 марта 2015 г.

How to use merge DataFrames in Pandas (Python)

Combining and Merging Data Sets from Python for data analysis¶

Если ключи в разных таблицах называются по разному, то их не надо переименовывать, можно их указать явно:¶

Это было пересечение, но есть еще¶

The outer join takes the union of the keys, combining the effect of applying both left and right joins:¶

2 комментария:

среда, 11 марта 2015 г.

Combining and Merging Data Sets from Python for data analysis ¶

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7