Python Pandas - Find difference between two data frames
我有两个数据帧df1和df2,其中df2是df1的子集。 我如何获得一个新的数据帧(df3),这是两个数据帧之间的区别?
换句话说,一个数据帧具有df1中所有不在df2中的行/列吗?
通过使用
1 | pd.concat([df1,df2]).drop_duplicates(keep=False) |
Above method only working for those dataframes they do not have duplicate itself, For example
1 2 | df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]}) df2=pd.DataFrame({'A':[1],'B':[2]}) |
它将输出如下所示,这是错误的
Wrong Output :
1 2 3 4 | pd.concat([df1, df2]).drop_duplicates(keep=False) Out[655]: A B 1 2 3 |
Correct Output
1 2 3 4 5 | Out[656]: A B 1 2 3 2 3 4 3 3 4 |
How to achieve that?
方法1:将
1 2 3 4 5 6 | df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))] Out[657]: A B 1 2 3 2 3 4 3 3 4 |
方法2:
1 2 3 4 5 6 | df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both'] Out[421]: A B _merge 1 2 3 left_only 2 3 4 left_only 3 3 4 left_only |
对于行,请尝试以下操作,将
1 | m = df1.merge(df2, on=cols, how='outer', suffixes=['', '_'], indicator=True) |
对于列,请尝试以下操作:
1 | set(df1.columns).symmetric_difference(df2.columns) |
接受的答案方法1将不适用于内部具有NaN的数据帧,如
1 | df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))] |
edit2,我想出了一个无需设置索引的新解决方案
1 | newdf=pd.concat[df1,df2].drop_duplicates(keep=False) |
好的,我发现最高投票的答案已经包含了我所想的。是的,我们只能在每两个df中没有重复的情况下使用此代码。
我有一个棘手的方法。首先我们将"名称"设置为问题给出的两个数据帧的索引。由于两个df中有相同的"名称",我们可以从"较大" df中删除"较小" df的索引。
这是代码。
1 2 3 | df1.set_index('Name',inplace=True) df2.set_index('Name',inplace=True) newdf=df1.drop(df2.index) |
也许是一种简单的单行代码,具有相同或不同的列名。即使df2 ['Name2']包含重复值也可以使用。
1 | newDf = df1.set_index('Name1').drop(df2['Name2']) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | import pandas as pd # given df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',], 'Age':[23,45,12,34,27,44,28,39,40]}) df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',], 'Age':[23,12,34,44,28,40]}) # find elements in df1 that are not in df2 df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True) # output: print('df1 ', df1) print('df2 ', df2) print('df_1notin2 ', df_1notin2) # df1 # Age Name # 0 23 John # 1 45 Mike # 2 12 Smith # 3 34 Wale # 4 27 Marry # 5 44 Tom # 6 28 Menda # 7 39 Bolt # 8 40 Yuswa # df2 # Age Name # 0 23 John # 1 12 Smith # 2 34 Wale # 3 44 Tom # 4 28 Menda # 5 40 Yuswa # df_1notin2 # Age Name # 0 45 Mike # 1 27 Marry # 2 39 Bolt |
除了可接受的答案之外,我还想提出一个更宽泛的解决方案,该解决方案可以找到具有任何
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import numpy as np import pandas as pd def get_dataframe_setdiff2d(df_new: pd.DataFrame, df_old: pd.DataFrame, rtol=1e-03, atol=1e-05) -> pd.DataFrame: """Returns set difference of two pandas DataFrames""" union_index = np.union1d(df_new.index, df_old.index) union_columns = np.union1d(df_new.columns, df_old.columns) new = df_new.reindex(index=union_index, columns=union_columns) old = df_old.reindex(index=union_index, columns=union_columns) mask_diff = ~np.isclose(new, old, rtol, atol) df_bool = pd.DataFrame(mask_diff, union_index, union_columns) df_diff = pd.concat([new[df_bool].stack(), old[df_bool].stack()], axis=1) df_diff.columns = ["New","Old"] return df_diff |
例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | In [1] df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]}) df2 = pd.DataFrame({'A':[1,1],'B':[1,1]}) print("df1: ", df1," ") print("df2: ", df2," ") diff = get_dataframe_setdiff2d(df1, df2) print("diff: ", diff," ") |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | Out [1] df1: A C 0 2 2 1 1 1 2 2 2 df2: A B 0 1 1 1 1 1 diff: New Old 0 A 2.0 1.0 B NaN 1.0 C 2.0 NaN 1 B NaN 1.0 C 1.0 NaN 2 A 2.0 NaN C 2.0 NaN |
通过索引查找差异。假设df1是df2的子集,并且在设置子集时将索引结转
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna() # Example df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5),"subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5]) df2 = df1.loc[[1,3,5]] df1 gender subject 1 f bio 2 m chem 3 f phy 4 m bio 5 f bio df2 gender subject 1 f bio 3 f phy 5 f bio df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna() df3 gender subject 2 m chem 4 m bio |
@liangli解决方案的细微变化,不需要更改现有数据帧的索引:
1 | newdf = df1.drop(df1.join(df2.set_index('Name').index)) |