Pandas數(shù)據(jù)合并

imelee 2017-03-04

展開全文

merage

內(nèi)連接

左外連接

右外連接

全外連接

示例

join

concat

merage

pandas提供了一個類似于關系數(shù)據(jù)庫的連接(join)操作的方法merage,可以根據(jù)一個或多個鍵將不同DataFrame中的行連接起來

語法如下merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False)

參數(shù)說明：

left與right：兩個不同的DataFrame

how：指的是合并(連接)的方式有inner(內(nèi)連接),left(左外連接),right(右外連接),outer(全外連接);默認為inner

on : 指的是用于連接的列索引名稱。必須存在右右兩個DataFrame對象中，如果沒有指定且其他參數(shù)也未指定則以兩個DataFrame的列名交集做為連接鍵

left_on：左則DataFrame中用作連接鍵的列名;這個參數(shù)中左右列名不相同，但代表的含義相同時非常有用。

right_on：右則DataFrame中用作連接鍵的列名

left_index：使用左則DataFrame中的行索引做為連接鍵

right_index：使用右則DataFrame中的行索引做為連接鍵

sort：默認為True，將合并的數(shù)據(jù)進行排序。在大多數(shù)情況下設置為False可以提高性能

suffixes：字符串值組成的元組，用于指定當左右DataFrame存在相同列名時在列名后面附加的后綴名稱，默認為('_x','_y')

copy：默認為True,總是將數(shù)據(jù)復制到數(shù)據(jù)結(jié)構(gòu)中；大多數(shù)情況下設置為False可以提高性能

indicator：在 0.17.0中還增加了一個顯示合并數(shù)據(jù)中來源情況；如只來自己于左邊(left_only)、兩者(both)

內(nèi)連接

符合連接條件和查詢條件的數(shù)據(jù)行，相當于數(shù)據(jù)庫中的jion,示例SQL語句 SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key;

或SELECT * FROM df1,df2 where df1.key=df2.key

對應pandas語句pd.merge(df1, df2, on='key')

左外連接

符合連接條件和查詢條件的數(shù)據(jù)行并返回左表中不符合連接條件單符合查詢條件的數(shù)據(jù)行，相當于數(shù)據(jù)庫中的left outer join,示例SQL語句 -- show all records from df1 SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.key = df2.key;

對應pandas語句pd.merge(df1, df2, on='key', how='left')

右外連接

符合連接條件和查詢條件的數(shù)據(jù)行并返回右表中不符合連接條件單符合查詢條件的數(shù)據(jù)行，相當于數(shù)據(jù)庫中的right outer join,示例SQL語句 -- show all records from df2 SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key = df2.key;

對應pandas語句pd.merge(df1, df2, on='key', how='right')

全外連接

符合連接條件和查詢條件的數(shù)據(jù)行并返回左表和左表中不符合連接條件單符合查詢條件的數(shù)據(jù)行。全外連接相當于左外連接與左外連接的合集(去掉重復),相當于數(shù)據(jù)庫中的full outer join,示例SQL語句 -- show all records from both tables SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;

對應pandas語句pd.merge(df1, df2, on='key', how='outer')

更詳細的參見官方網(wǎng)站

示例#coding=utf-8 from pandas import Series,DataFrame,merge import numpy as np data=DataFrame([{"id":0,"name":'lxh',"age":20,"cp":'lm'},{"id":1,"name":'xiao',"age":40,"cp":'ly'},{"id":2,"name":'hua',"age":4,"cp":'yry'},{"id":3,"name":'be',"age":70,"cp":'old'}]) data1=DataFrame([{"id":100,"name":'lxh','cs':10},{"id":101,"name":'xiao','cs':40},{"id":102,"name":'hua2','cs':50}]) data2=DataFrame([{"id":0,"name":'lxh','cs':10},{"id":101,"name":'xiao','cs':40},{"id":102,"name":'hua2','cs':50}]) print "單個列名做為內(nèi)鏈接的連接鍵\r\n",merge(data,data1,on="name",suffixes=('_a','_b')) print "多列名做為內(nèi)鏈接的連接鍵\r\n",merge(data,data2,on=("name","id")) print '不指定on則以兩個DataFrame的列名交集做為連接鍵\r\n',merge(data,data2) #這里使用了id與name #使用右邊的DataFrame的行索引做為連接鍵 ##設置行索引名稱 indexed_data1=data1.set_index("name") print "使用右邊的DataFrame的行索引做為連接鍵\r\n",merge(data,indexed_data1,left_on='name',right_index=True) print '左外連接\r\n',merge(data,data1,on="name",how="left",suffixes=('_a','_b')) print '左外連接1\r\n',merge(data1,data,on="name",how="left") print '右外連接\r\n',merge(data,data1,on="name",how="right") data3=DataFrame([{"mid":0,"mname":'lxh','cs':10},{"mid":101,"mname":'xiao','cs':40},{"mid":102,"mname":'hua2','cs':50}]) #當左右兩個DataFrame的列名不同，當又想做為連接鍵時可以使用left_on與right_on來指定連接鍵 print "使用left_on與right_on來指定列名字不同的連接鍵\r\n",merge(data,data3,left_on=["name","id"],right_on=["mname","mid"])

輸出單個列名做為內(nèi)鏈接的連接鍵 age cp id_a name cs id_b 0 20 lm 0 lxh 10 100 1 40 ly 1 xiao 40 101 多列名做為內(nèi)鏈接的連接鍵 age cp id name cs 0 20 lm 0 lxh 10 不指定on則以兩個DataFrame的列名交集做為連接鍵 age cp id name cs 0 20 lm 0 lxh 10 使用右邊的DataFrame的行索引做為連接鍵 age cp id_x name cs id_y 0 20 lm 0 lxh 10 100 1 40 ly 1 xiao 40 101 左外連接 age cp id_a name cs id_b 0 20 lm 0 lxh 10 100 1 40 ly 1 xiao 40 101 2 4 yry 2 hua NaN NaN 3 70 old 3 be NaN NaN 左外連接1 cs id_x name age cp id_y 0 10 100 lxh 20 lm 0 1 40 101 xiao 40 ly 1 2 50 102 hua2 NaN NaN NaN 右外連接 age cp id_x name cs id_y 0 20 lm 0 lxh 10 100 1 40 ly 1 xiao 40 101 2 NaN NaN NaN hua2 50 102 使用left_on與right_on來指定列名字不同的連接鍵 age cp id name cs mid mname 0 20 lm 0 lxh 10 0 lxh

join

join方法提供了一個簡便的方法用于將兩個DataFrame中的不同的列索引合并成為一個DataFramejoin(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False):

其中參數(shù)的意義與merge方法基本相同,只是join方法默認為左外連接how=left

示例#coding=utf-8 from pandas import Series,DataFrame,merge data=DataFrame([{"id":0,"name":'lxh',"age":20,"cp":'lm'},{"id":1,"name":'xiao',"age":40,"cp":'ly'},{"id":2,"name":'hua',"age":4,"cp":'yry'},{"id":3,"name":'be',"age":70,"cp":'old'}],index=['a','b','c','d']) data1=DataFrame([{"sex":0},{"sex":1},{"sex":2}],index=['a','b','e']) print '使用默認的左連接\r\n',data.join(data1) #這里可以看出自動屏蔽了data中沒有的index=e 那一行的數(shù)據(jù) print '使用右連接\r\n',data.join(data1,how="right") #這里出自動屏蔽了data1中沒有index=c,d的那行數(shù)據(jù)；等價于data1.join(data) print '使用內(nèi)連接\r\n',data.join(data1,how='inner') print '使用全外連接\r\n',data.join(data1,how='outer')

輸出使用默認的左連接 age cp id name sex a 20 lm 0 lxh 0 b 40 ly 1 xiao 1 c 4 yry 2 hua NaN d 70 old 3 be NaN 使用右連接 age cp id name sex a 20 lm 0 lxh 0 b 40 ly 1 xiao 1 e NaN NaN NaN NaN 2 使用內(nèi)連接 age cp id name sex a 20 lm 0 lxh 0 b 40 ly 1 xiao 1 使用全外連接 age cp id name sex a 20 lm 0 lxh 0 b 40 ly 1 xiao 1 c 4 yry 2 hua NaN d 70 old 3 be NaN e NaN NaN NaN NaN 2

concat

concat方法相當于數(shù)據(jù)庫中的全連接(UNION ALL),可以指定按某個軸進行連接,也可以指定連接的方式join(outer,inner 只有這兩種)。與數(shù)據(jù)庫不同的時concat不會去重，要達到去重的效果可以使用drop_duplicates方法 concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True):

參數(shù)說明

Parameters---------- objs : a sequence or mapping of Series, DataFrame, or Panel objects If a dict is passed, the sorted keys will be used as the `keys` argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised axis : {0, 1, ...}, default 0 The axis to concatenate along join : {'inner', 'outer'}, default 'outer' How to handle indexes on other axis(es) join_axes : list of Index objects Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic verify_integrity : boolean, default False Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation keys : sequence, default None If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level levels : list of sequences, default None Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys names : list, default None Names for the levels in the resulting hierarchical index ignore_index : boolean, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the the index values on the other axes are still respected in the join. copy : boolean, default True If False, do not copy data unnecessarily

示例#coding=utf-8 from pandas import Series,DataFrame,concat df1 = DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'], 'rank': range(1, 4)}) df2 = DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'], 'rank': [1, 4, 5]}) print '按軸進行內(nèi)連接\r\n',concat([df1,df2],join="inner",axis=1) print '進行外連接并指定keys(行索引)\r\n',concat([df1,df2],keys=['a','b']) #這里有重復的數(shù)據(jù) print '去重后\r\n',concat([df1,df2],ignore_index=True).drop_duplicates()

輸出

```python

按軸進行內(nèi)連接

city rank city rank

0 Chicago 1 Chicago 1

1 San Francisco 2 Boston 4

2 New York City 3 Los Angeles 5

進行外連接并指定keys(行索引)

city rank

a 0 Chicago 1

1 San Francisco 2

2 New York City 3