19-Pandas merge合并操作

发布于:2024-07-03 ⋅ 阅读:(13) ⋅ 点赞:(0)

Pandas merge合并操作(主键合并数据)

Pandas 提供的 merge() 函数能够进行高效的合并操作,这与 SQL 关系型数据库的 join用法非常相似。从字面意思上不难理解,merge 翻译为“合并”,指的是将两个 DataFrame 数据表按照指定的规则进行连接,最后拼接成一个新的 DataFrame 数据表。

merge() 函数的法格式如下:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes=('_x', '_y'), copy=True)

参数说明,如下表所示:

参数名称 说明
left/right 两个不同的 DataFrame 对象。
on 指定用于连接的键(即列标签的名字),该键必须同时存在于左右两个 DataFrame 中,如果没有指定,并且其他参数也未指定, 那么将会以两个 DataFrame 的列名交集做为连接键。
left_on 指定左侧 DataFrame 中作连接键的列名。该参数在左、右列标签名不相同,但表达的含义相同时非常有用。
right_on 指定左侧 DataFrame 中作连接键的列名。
left_index 布尔参数,默认为 False。如果为 True 则使用左侧 DataFrame 的行索引作为连接键,若 DataFrame 具有多层 索引(MultiIndex),则层的数量必须与连接键的数量相等。
right_index 布尔参数,默认为 False。如果为 True 则使用左侧 DataFrame 的行索引作为连接键。
how 要执行的合并类型,从 {‘left’, ‘right’, ‘outer’, ‘inner’} 中取值,默认为“inner”内连接。
sort 布尔值参数,默认为True,它会将合并后的数据进行排序;若设置为 False,则按照 how 给定的参数值进行排序。
suffixes 字符串组成的元组。当左右 DataFrame 存在相同列名时,通过该参数可以在相同的列名后附加后缀名,默认为(‘_x’,‘_y’)。
copy 默认为 True,表示对数据进行复制。

注意:Pandas 库的 merge() 支持各种内外连接,与其相似的还有 join() 函数(默认为左连接)。

准备两个不同的DataFrame,这里从文件中读取:

文件empdata.csv:

EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
7369,SMITH,CLERK,7902.0,1980-12-17,800,,20
7499,ALLEN,SALESMAN,7698.0,1981-02-20,1600,300.0,30
7521,WARD,SALESMAN,7698.0,1981-02-22,1250,500.0,30
7566,JONES,MANAGER,7839.0,1981-04-02,2975,,20
7654,MARTIN,SALESMAN,7698.0,1981-09-28,1250,1400.0,30
7698,BLAKE,MANAGER,7839.0,1981-05-01,2850,,30
7782,CLARK,MANAGER,7839.0,1981-06-09,2450,,10
7788,SCOTT,ANALYST,7566.0,1987-04-19,3000,,20
7839,KING,PRESIDENT,,1981-11-17,5000,,10
7844,TURNER,SALESMAN,7698.0,1981-09-08,1500,0.0,30
7876,ADAMS,CLERK,7788.0,1987-05-23,1100,,20
7900,JAMES,CLERK,7698.0,1981-12-03,950,,30
7902,FORD,ANALYST,7566.0,1981-12-03,3000,,20
7934,MILLER,CLERK,7782.0,1982-01-23,1300,,10

文件deptdata.csv:

DEPTNO,DNAME,LOC
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

下面创建两个不同的 DataFrame,然后对它们进行合并操作:

import pandas as pd
import numpy as np

df_emp  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
df_dept  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv')

print("员工表中的df数据:\n",df_emp)
print("部门表中的df数据:\n",df_dept)

输出如下:

员工表中的df数据:
     EMPNO   ENAME        JOB     MGR    HIREDATE   SAL    COMM  DEPTNO
0    7369   SMITH      CLERK  7902.0  1980-12-17   800     NaN      20
1    7499   ALLEN   SALESMAN  7698.0  1981-02-20  1600   300.0      30
2    7521    WARD   SALESMAN  7698.0  1981-02-22  1250   500.0      30
3    7566   JONES    MANAGER  7839.0  1981-04-02  2975     NaN      20
4    7654  MARTIN   SALESMAN  7698.0  1981-09-28  1250  1400.0      30
5    7698   BLAKE    MANAGER  7839.0  1981-05-01  2850     NaN      30
6    7782   CLARK    MANAGER  7839.0  1981-06-09  2450     NaN      10
7    7788   SCOTT    ANALYST  7566.0  1987-04-19  3000     NaN      20
8    7839    KING  PRESIDENT     NaN  1981-11-17  5000     NaN      10
9    7844  TURNER   SALESMAN  7698.0  1981-09-08  1500     0.0      30
10   7876   ADAMS      CLERK  7788.0  1987-05-23  1100     NaN      20
11   7900   JAMES      CLERK  7698.0  1981-12-03   950     NaN      30
12   7902    FORD    ANALYST  7566.0  1981-12-03  3000     NaN      20
13   7934  MILLER      CLERK  7782.0  1982-01-23  1300     NaN      10
部门表中的df数据:
    DEPTNO       DNAME       LOC
0      10  ACCOUNTING  NEW YORK
1      20    RESEARCH    DALLAS
2      30       SALES   CHICAGO
3      40  OPERATIONS    BOSTON
1) 在单个键上进行合并操作

通过 on 参数指定一个连接键,然后对上述 DataFrame 进行合并操作:

import pandas as pd
import numpy as np

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv')
# 通过on指定合并的主键
print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO"))

输出结果:

    EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0    7782   CLARK    MANAGER  7839.0  ...     NaN      10  ACCOUNTING  NEW YORK
1    7839    KING  PRESIDENT     NaN  ...     NaN      10  ACCOUNTING  NEW YORK
2    7934  MILLER      CLERK  7782.0  ...     NaN      10  ACCOUNTING  NEW YORK
3    7369   SMITH      CLERK  7902.0  ...     NaN      20    RESEARCH    DALLAS
4    7566   JONES    MANAGER  7839.0  ...     NaN      20    RESEARCH    DALLAS
5    7788   SCOTT    ANALYST  7566.0  ...     NaN      20    RESEARCH    DALLAS
6    7876   ADAMS      CLERK  7788.0  ...     NaN      20    RESEARCH    DALLAS
7    7902    FORD    ANALYST  7566.0  ...     NaN      20    RESEARCH    DALLAS
8    7499   ALLEN   SALESMAN  7698.0  ...   300.0      30       SALES   CHICAGO
9    7521    WARD   SALESMAN  7698.0  ...   500.0      30       SALES   CHICAGO
10   7654  MARTIN   SALESMAN  7698.0  ...  1400.0      30       SALES   CHICAGO
11   7698   BLAKE    MANAGER  7839.0  ...     NaN      30       SALES   CHICAGO
12   7844  TURNER   SALESMAN  7698.0  ...     0.0      30       SALES   CHICAGO
13   7900   JAMES      CLERK  7698.0  ...     NaN      30       SALES   CHICAGO

[14 rows x 10 columns]
2) 在多个键上进行合并操作

下面示例,指定多个键来合并上述两个 DataFrame 对象:(这里使用自定义数据了)

import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4],
   'Name': ['Smith', 'Maiki', 'Hunter', 'Hilen'],
   'subject_id':['sub1','sub2','sub4','sub6']})
right = pd.DataFrame({
    'id':[1,2,3,4],
   'Name': ['Bill', 'Lucy', 'Jack', 'Mike'],
   'subject_id':['sub2','sub4','sub3','sub6']})
print(pd.merge(left,right,on=['id','subject_id']))

输出结果:

   id Name_x subject_id Name_y
0   4  Hilen       sub6   Mike

使用how参数合并

通过how参数可以确定 DataFrame 中要包含哪些键,如果在左表、右表都不存的键,那么合并后该键对应的值为 NaN。为了便于大家学习,我们将 how 参数和与其等价的 SQL 语句做了总结:

Merge方法 等效 SQL 描述
left LEFT OUTER JOIN 使用左侧对象的key
right RIGHT OUTER JOIN 使用右侧对象的key
outer FULL OUTER JOIN 使用左右两侧所有key的并集
inner INNER JOIN 使用左右两侧key的交集
1) left join
import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv')
# 为了演示,这里将雇员7369的部门删除了,运行如下:
print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="left"))

输出结果:能发现7369的DEPTNO,DNAME,LOC都为NaN

    EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0    7369   SMITH      CLERK  7902.0  ...     NaN     NaN         NaN       NaN
1    7499   ALLEN   SALESMAN  7698.0  ...   300.0    30.0       SALES   CHICAGO
2    7521    WARD   SALESMAN  7698.0  ...   500.0    30.0       SALES   CHICAGO
3    7566   JONES    MANAGER  7839.0  ...     NaN    20.0    RESEARCH    DALLAS
4    7654  MARTIN   SALESMAN  7698.0  ...  1400.0    30.0       SALES   CHICAGO
5    7698   BLAKE    MANAGER  7839.0  ...     NaN    30.0       SALES   CHICAGO
6    7782   CLARK    MANAGER  7839.0  ...     NaN    10.0  ACCOUNTING  NEW YORK
7    7788   SCOTT    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
8    7839    KING  PRESIDENT     NaN  ...     NaN    10.0  ACCOUNTING  NEW YORK
9    7844  TURNER   SALESMAN  7698.0  ...     0.0    30.0       SALES   CHICAGO
10   7876   ADAMS      CLERK  7788.0  ...     NaN    20.0    RESEARCH    DALLAS
11   7900   JAMES      CLERK  7698.0  ...     NaN    30.0       SALES   CHICAGO
12   7902    FORD    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
13   7934  MILLER      CLERK  7782.0  ...     NaN    10.0  ACCOUNTING  NEW YORK

[14 rows x 10 columns]
2) right join
import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv')
# 部门40因为没有员工,所以对应员工的信息全部为NaN
print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="right"))

输出结果:(部门40因为没有员工,所以对应员工的信息全部为NaN)

     EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0   7782.0   CLARK    MANAGER  7839.0  ...     NaN      10  ACCOUNTING  NEW YORK
1   7839.0    KING  PRESIDENT     NaN  ...     NaN      10  ACCOUNTING  NEW YORK
2   7934.0  MILLER      CLERK  7782.0  ...     NaN      10  ACCOUNTING  NEW YORK
3   7369.0   SMITH      CLERK  7902.0  ...     NaN      20    RESEARCH    DALLAS
4   7566.0   JONES    MANAGER  7839.0  ...     NaN      20    RESEARCH    DALLAS
5   7788.0   SCOTT    ANALYST  7566.0  ...     NaN      20    RESEARCH    DALLAS
6   7876.0   ADAMS      CLERK  7788.0  ...     NaN      20    RESEARCH    DALLAS
7   7902.0    FORD    ANALYST  7566.0  ...     NaN      20    RESEARCH    DALLAS
8   7499.0   ALLEN   SALESMAN  7698.0  ...   300.0      30       SALES   CHICAGO
9   7521.0    WARD   SALESMAN  7698.0  ...   500.0      30       SALES   CHICAGO
10  7654.0  MARTIN   SALESMAN  7698.0  ...  1400.0      30       SALES   CHICAGO
11  7698.0   BLAKE    MANAGER  7839.0  ...     NaN      30       SALES   CHICAGO
12  7844.0  TURNER   SALESMAN  7698.0  ...     0.0      30       SALES   CHICAGO
13  7900.0   JAMES      CLERK  7698.0  ...     NaN      30       SALES   CHICAGO
14     NaN     NaN        NaN     NaN  ...     NaN      40  OPERATIONS    BOSTON

[15 rows x 10 columns]
3) outer join(并集)
import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv')
# 部门40因为没有员工,所以对应员工的信息全部为NaN,又因为7369没对应的部门,所以部门信息全部为Nan
print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="outer"))

输出结果:

     EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0   7369.0   SMITH      CLERK  7902.0  ...     NaN     NaN         NaN       NaN
1   7499.0   ALLEN   SALESMAN  7698.0  ...   300.0    30.0       SALES   CHICAGO
2   7521.0    WARD   SALESMAN  7698.0  ...   500.0    30.0       SALES   CHICAGO
3   7654.0  MARTIN   SALESMAN  7698.0  ...  1400.0    30.0       SALES   CHICAGO
4   7698.0   BLAKE    MANAGER  7839.0  ...     NaN    30.0       SALES   CHICAGO
5   7844.0  TURNER   SALESMAN  7698.0  ...     0.0    30.0       SALES   CHICAGO
6   7900.0   JAMES      CLERK  7698.0  ...     NaN    30.0       SALES   CHICAGO
7   7566.0   JONES    MANAGER  7839.0  ...     NaN    20.0    RESEARCH    DALLAS
8   7788.0   SCOTT    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
9   7876.0   ADAMS      CLERK  7788.0  ...     NaN    20.0    RESEARCH    DALLAS
10  7902.0    FORD    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
11  7782.0   CLARK    MANAGER  7839.0  ...     NaN    10.0  ACCOUNTING  NEW YORK
12  7839.0    KING  PRESIDENT     NaN  ...     NaN    10.0  ACCOUNTING  NEW YORK
13  7934.0  MILLER      CLERK  7782.0  ...     NaN    10.0  ACCOUNTING  NEW YORK
14     NaN     NaN        NaN     NaN  ...     NaN    40.0  OPERATIONS    BOSTON

[15 rows x 10 columns]
4) inner join(交集)
import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv')
# 部门40因为没有员工,,又因为7369没对应的部门,所以部门40和员工7369没显示
print(pd.merge(df_emp_left,df_dept_right,on="DEPTNO",how="inner"))

输出结果:

    EMPNO   ENAME        JOB     MGR  ...    COMM  DEPTNO       DNAME       LOC
0    7499   ALLEN   SALESMAN  7698.0  ...   300.0    30.0       SALES   CHICAGO
1    7521    WARD   SALESMAN  7698.0  ...   500.0    30.0       SALES   CHICAGO
2    7654  MARTIN   SALESMAN  7698.0  ...  1400.0    30.0       SALES   CHICAGO
3    7698   BLAKE    MANAGER  7839.0  ...     NaN    30.0       SALES   CHICAGO
4    7844  TURNER   SALESMAN  7698.0  ...     0.0    30.0       SALES   CHICAGO
5    7900   JAMES      CLERK  7698.0  ...     NaN    30.0       SALES   CHICAGO
6    7566   JONES    MANAGER  7839.0  ...     NaN    20.0    RESEARCH    DALLAS
7    7788   SCOTT    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
8    7876   ADAMS      CLERK  7788.0  ...     NaN    20.0    RESEARCH    DALLAS
9    7902    FORD    ANALYST  7566.0  ...     NaN    20.0    RESEARCH    DALLAS
10   7782   CLARK    MANAGER  7839.0  ...     NaN    10.0  ACCOUNTING  NEW YORK
11   7839    KING  PRESIDENT     NaN  ...     NaN    10.0  ACCOUNTING  NEW YORK
12   7934  MILLER      CLERK  7782.0  ...     NaN    10.0  ACCOUNTING  NEW YORK

注意:当 a 与 b 进行内连操作时 a.join(b) 不等于 b.join(a)。

5)使用join函数

如果左右两个表的主键名相同,可以使用join函数。

import pandas as pd

df_emp_left  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\empdata.csv')
df_dept_right  = pd.read_csv('C:\\Users\\qwy\Desktop\data\\deptdata.csv')
print(df_emp_left.join(df_dept_right,how="outer",rsuffix='1'))

运行结果:

    EMPNO   ENAME        JOB     MGR  ... DEPTNO  DEPTNO1       DNAME       LOC
0    7369   SMITH      CLERK  7902.0  ...    NaN     10.0  ACCOUNTING  NEW YORK
1    7499   ALLEN   SALESMAN  7698.0  ...   30.0     20.0    RESEARCH    DALLAS
2    7521    WARD   SALESMAN  7698.0  ...   30.0     30.0       SALES   CHICAGO
3    7566   JONES    MANAGER  7839.0  ...   20.0     40.0  OPERATIONS    BOSTON
4    7654  MARTIN   SALESMAN  7698.0  ...   30.0      NaN         NaN       NaN
5    7698   BLAKE    MANAGER  7839.0  ...   30.0      NaN         NaN       NaN
6    7782   CLARK    MANAGER  7839.0  ...   10.0      NaN         NaN       NaN
7    7788   SCOTT    ANALYST  7566.0  ...   20.0      NaN         NaN       NaN
8    7839    KING  PRESIDENT     NaN  ...   10.0      NaN         NaN       NaN
9    7844  TURNER   SALESMAN  7698.0  ...   30.0      NaN         NaN       NaN
10   7876   ADAMS      CLERK  7788.0  ...   20.0      NaN         NaN       NaN
11   7900   JAMES      CLERK  7698.0  ...   30.0      NaN         NaN       NaN
12   7902    FORD    ANALYST  7566.0  ...   20.0      NaN         NaN       NaN
13   7934  MILLER      CLERK  7782.0  ...   10.0      NaN         NaN       NaN

网站公告

今日签到

点亮在社区的每一天
去签到