2つのPandasデータフレームの違いを並べて出力-違いを強調

Question

2つのデータフレーム間で何が変更されたかを正確に強調しようとしています。

2つのPython Pandasデータフレームがあるとします。

"StudentRoster Jan-1": id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.11 False Graduated 113 Zoe 4.12 True "StudentRoster Jan-2": id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.21 False Graduated 113 Zoe 4.12 False On vacation

私の目標は、次のようなHTMLテーブルを出力することです。

変更された行を識別します（int、float、boolean、string）
OLD、NEWの値が同じである行を出力し（理想的にはHTMLテーブルに）、消費者は2つのデータフレーム間で何が変更されたかを明確に確認できます。
```
"StudentRoster Difference Jan-1 - Jan-2": id Name score isEnrolled Comment 112 Nick was 1.11| now 1.21 False Graduated 113 Zoe 4.12 was True | now False was "" | now "On vacation" 
```

行ごと、列ごとに比較できると思いますが、もっと簡単な方法はありますか？

Andy Hayden · Accepted Answer

最初の部分はコンスタンティンに似ており、空の行のブール値を取得できます*：

In [21]: ne = (df1 != df2).any(1) In [22]: ne Out[22]: 0 False 1 True 2 True dtype: bool

次に、どのエントリが変更されたかを確認できます。

In [23]: ne_stacked = (df1 != df2).stack() In [24]: changed = ne_stacked[ne_stacked] In [25]: changed.index.names = ['id', 'col'] In [26]: changed Out[26]: id col 1 score True 2 isEnrolled True Comment True dtype: bool

ここで、最初のエントリはインデックスで、2番目のエントリは変更された列です

In [27]: difference_locations = np.where(df1 != df2) In [28]: changed_from = df1.values[difference_locations] In [29]: changed_to = df2.values[difference_locations] In [30]: pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index) Out[30]: from to id col 1 score 1.11 1.21 2 isEnrolled True False Comment None On vacation

*注：ここでdf1とdf2が同じインデックスを共有することが重要です。このあいまいさを克服するために、df1.index & df2.indexを使用して共有ラベルのみを確認することができますが、これは演習として残しておきます。

Ted Petrou · Answer

2つのDataFrameの違いを強調する

DataFrameスタイルプロパティを使用して、違いがあるセルの背景色を強調表示することができます。

元の質問のサンプルデータを使用する

最初のステップは、concat関数でDataFrameを水平方向に連結し、keysパラメーターで各フレームを区別することです。

df_all = pd.concat([df.set_index('id'), df2.set_index('id')], axis='columns', keys=['First', 'Second']) df_all

列レベルを交換し、同じ列名を隣同士に配置する方がおそらく簡単です。

df_final = df_all.swaplevel(axis='columns')[df.columns[1:]] df_final

これで、フレームの違いを簡単に見つけることができます。ただし、さらに進んでstyleプロパティを使用して、異なるセルを強調表示できます。ドキュメントのこの部分で確認できるカスタム関数を定義します。

def highlight_diff(data, color='yellow'): attr = 'background-color: {}'.format(color) other = data.xs('First', axis='columns', level=-1) return pd.DataFrame(np.where(data.ne(other, level=0), attr, ''), index=data.index, columns=data.columns) df_final.style.apply(highlight_diff, axis=None)

これにより、両方の値が欠落しているセルが強調表示されます。それらを埋めるか、強調表示されないように追加のロジックを提供できます。

James Owers · Answer

この答えは、単に@Andy Haydenを拡張し、数値フィールドがnanの場合に回復力を持たせ、それを関数にラップします。

import pandas as pd import numpy as np def diff_pd(df1, df2): """Identify differences between two pandas DataFrames""" assert (df1.columns == df2.columns).all(), \ "DataFrame column names are different" if any(df1.dtypes != df2.dtypes): "Data Types are different, trying to convert" df2 = df2.astype(df1.dtypes) if df1.equals(df2): return None else: # need to account for np.nan != np.nan returning True diff_mask = (df1 != df2) & ~(df1.isnull() & df2.isnull()) ne_stacked = diff_mask.stack() changed = ne_stacked[ne_stacked] changed.index.names = ['id', 'col'] difference_locations = np.where(diff_mask) changed_from = df1.values[difference_locations] changed_to = df2.values[difference_locations] return pd.DataFrame({'from': changed_from, 'to': changed_to}, index=changed.index)

したがって、データを使用して（スコア列にNaNが含まれるように少し編集します）：

import sys if sys.version_info[0] < 3: from StringIO import StringIO else: from io import StringIO DF1 = StringIO("""id Name score isEnrolled Comment 111 Jack 2.17 True "He was late to class" 112 Nick 1.11 False "Graduated" 113 Zoe NaN True " " """) DF2 = StringIO("""id Name score isEnrolled Comment 111 Jack 2.17 True "He was late to class" 112 Nick 1.21 False "Graduated" 113 Zoe NaN False "On vacation" """) df1 = pd.read_table(DF1, sep='\s+', index_col='id') df2 = pd.read_table(DF2, sep='\s+', index_col='id') diff_pd(df1, df2)

出力：

 from to id col 112 score 1.11 1.21 113 isEnrolled True False Comment On vacation

journois · Answer

私はこの問題に直面しましたが、この投稿を見つける前に答えを見つけました：

Unutbuの答えに基づいて、データを読み込みます...

import pandas as pd import io texts = ['''\ id Name score isEnrolled Date 111 Jack True 2013-05-01 12:00:00 112 Nick 1.11 False 2013-05-12 15:05:23 Zoe 4.12 True ''', '''\ id Name score isEnrolled Date 111 Jack 2.17 True 2013-05-01 12:00:00 112 Nick 1.21 False Zoe 4.12 False 2013-05-01 12:00:00'''] df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,17,20], parse_dates=[4]) df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,17,20], parse_dates=[4])

... diff関数を定義します...

def report_diff(x): return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

次に、パネルを使用して結論を出すことができます。

my_panel = pd.Panel(dict(df1=df1,df2=df2)) print my_panel.apply(report_diff, axis=0) # id Name score isEnrolled Date #0 111 Jack nan | 2.17 True 2013-05-01 12:00:00 #1 112 Nick 1.11 | 1.21 False 2013-05-12 15:05:23 | NaT #2 nan | nan Zoe 4.12 True | False NaT | 2013-05-01 12:00:00

ところで、IPython Notebookを使用している場合は、色付きのdiff関数を使用して、セルが異なるか、等しいか、左右のnullかによって色を指定することができます。

from IPython.display import HTML pd.options.display.max_colwidth = 500 # You need this, otherwise pandas # will limit your HTML strings to 50 characters def report_diff(x): if x[0]==x[1]: return unicode(x[0].__str__()) Elif pd.isnull(x[0]) and pd.isnull(x[1]): return u'<table style="background-color:#00ff00;font-weight:bold;">'+\ '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', 'nan') Elif pd.isnull(x[0]) and ~pd.isnull(x[1]): return u'<table style="background-color:#ffff00;font-weight:bold;">'+\ '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % ('nan', x[1]) Elif ~pd.isnull(x[0]) and pd.isnull(x[1]): return u'<table style="background-color:#0000ff;font-weight:bold;">'+\ '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0],'nan') else: return u'<table style="background-color:#ff0000;font-weight:bold;">'+\ '<tr><td>%s</td></tr><tr><td>%s</td></tr></table>' % (x[0], x[1]) HTML(my_panel.apply(report_diff, axis=0).to_html(escape=False))

unutbu · Answer

import pandas as pd import io texts = ['''\ id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.11 False Graduated 113 Zoe 4.12 True ''', '''\ id Name score isEnrolled Comment 111 Jack 2.17 True He was late to class 112 Nick 1.21 False Graduated 113 Zoe 4.12 False On vacation'''] df1 = pd.read_fwf(io.BytesIO(texts[0]), widths=[5,7,25,21,20]) df2 = pd.read_fwf(io.BytesIO(texts[1]), widths=[5,7,25,21,20]) df = pd.concat([df1,df2]) print(df) # id Name score isEnrolled Comment # 0 111 Jack 2.17 True He was late to class # 1 112 Nick 1.11 False Graduated # 2 113 Zoe 4.12 True NaN # 0 111 Jack 2.17 True He was late to class # 1 112 Nick 1.21 False Graduated # 2 113 Zoe 4.12 False On vacation df.set_index(['id', 'Name'], inplace=True) print(df) # score isEnrolled Comment # id Name # 111 Jack 2.17 True He was late to class # 112 Nick 1.11 False Graduated # 113 Zoe 4.12 True NaN # 111 Jack 2.17 True He was late to class # 112 Nick 1.21 False Graduated # 113 Zoe 4.12 False On vacation def report_diff(x): return x[0] if x[0] == x[1] else '{} | {}'.format(*x) changes = df.groupby(level=['id', 'Name']).agg(report_diff) print(changes)

プリント

 score isEnrolled Comment id Name 111 Jack 2.17 True He was late to class 112 Nick 1.11 | 1.21 False Graduated 113 Zoe 4.12 True | False nan | On vacation

cge · Answer

2つのデータフレームに同じIDが含まれている場合、実際に変更されたものを見つけるのは非常に簡単です。 frame1 != frame2を実行するだけで、各Trueが変更されたデータであるブールDataFrameが得られます。それから、changedids = frame1.index[np.any(frame1 != frame2,axis=1)]を実行することで、変更された各行のインデックスを簡単に取得できます。

jur · Answer

Concatとdrop_duplicatesを使用した別のアプローチ：

import sys if sys.version_info[0] < 3: from StringIO import StringIO else: from io import StringIO import pandas as pd DF1 = StringIO("""id Name score isEnrolled Comment 111 Jack 2.17 True "He was late to class" 112 Nick 1.11 False "Graduated" 113 Zoe NaN True " " """) DF2 = StringIO("""id Name score isEnrolled Comment 111 Jack 2.17 True "He was late to class" 112 Nick 1.21 False "Graduated" 113 Zoe NaN False "On vacation" """) df1 = pd.read_table(DF1, sep='\s+', index_col='id') df2 = pd.read_table(DF2, sep='\s+', index_col='id') #%% dictionary = {1:df1,2:df2} df=pd.concat(dictionary) df.drop_duplicates(keep=False)

出力：

 Name score isEnrolled Comment id 1 112 Nick 1.11 False Graduated 113 Zoe NaN True 2 112 Nick 1.21 False Graduated 113 Zoe NaN False On vacation

Aaron N. Brock · Answer

@journoisの答えをいじった後、 Panelのdeprication により、PanelではなくMultiIndexを使用して動作させることができました。

まず、ダミーデータを作成します。

df1 = pd.DataFrame({ 'id': ['111', '222', '333', '444', '555'], 'let': ['a', 'b', 'c', 'd', 'e'], 'num': ['1', '2', '3', '4', '5'] }) df2 = pd.DataFrame({ 'id': ['111', '222', '333', '444', '666'], 'let': ['a', 'b', 'c', 'D', 'f'], 'num': ['1', '2', 'Three', '4', '6'], })

次に、diff関数を定義します。この場合、彼の答えreport_diffは同じままにします：

def report_diff(x): return x[0] if x[0] == x[1] else '{} | {}'.format(*x)

次に、データをMultiIndexデータフレームに連結します。

df_all = pd.concat( [df1.set_index('id'), df2.set_index('id')], axis='columns', keys=['df1', 'df2'], join='outer' ) df_all = df_all.swaplevel(axis='columns')[df1.columns[1:]]

最後に、各列グループにreport_diffを適用します。

df_final.groupby(level=0, axis=1).apply(lambda frame: frame.apply(report_diff, axis=1))

この出力：

 let num 111 a 1 222 b 2 333 c 3 | Three 444 d | D 4 555 e | nan 5 | nan 666 nan | f nan | 6

そして、それがすべてです！

Hubbitus · Answer

結果を読みやすくするための@cgeの回答を拡張します。

a[a != b][np.any(a != b, axis=1)].join(DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join( b[a != b][np.any(a != b, axis=1)] ,rsuffix='_b', how='outer' ).fillna('')

完全なデモの例：

a = DataFrame(np.random.randn(7,3), columns=list('ABC')) b = a.copy() b.iloc[0,2] = np.nan b.iloc[1,0] = 7 b.iloc[3,1] = 77 b.iloc[4,2] = 777 a[a != b][np.any(a != b, axis=1)].join(DataFrame('a<->b', index=a.index, columns=['a<=>b'])).join( b[a != b][np.any(a != b, axis=1)] ,rsuffix='_b', how='outer' ).fillna('')

Aziz Alto · Answer

選択とマージを使用する別の方法を次に示します。

In [6]: # first lets create some dummy dataframes with some column(s) different ...: df1 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': range(20,25)}) ...: df2 = pd.DataFrame({'a': range(-5,0), 'b': range(10,15), 'c': [20] + list(range(101,105))}) In [7]: df1 Out[7]: a b c 0 -5 10 20 1 -4 11 21 2 -3 12 22 3 -2 13 23 4 -1 14 24 In [8]: df2 Out[8]: a b c 0 -5 10 20 1 -4 11 101 2 -3 12 102 3 -2 13 103 4 -1 14 104 In [10]: # make condition over the columns you want to comapre ...: condition = df1['c'] != df2['c'] ...: ...: # select rows from each dataframe where the condition holds ...: diff1 = df1[condition] ...: diff2 = df2[condition] In [11]: # merge the selected rows (dataframes) with some suffixes (optional) ...: diff1.merge(diff2, on=['a','b'], suffixes=('_before', '_after')) Out[11]: a b c_before c_after 0 -4 11 21 101 1 -3 12 22 102 2 -2 13 23 103 3 -1 14 24 104

Jupyterスクリーンショットの同じものを次に示します。

Mehmet &#214;ner Yal&#231;ın · Answer

2つのデータフレーム間の非対称の違いを見つける関数を以下に実装します：（ set pandasの違いに基づいて）要旨： https://Gist.github.com/oneryalcin/68cf25f536a25e65f0b3c84f9c118e =

def diff_df(df1, df2, how="left"): """ Find Difference of rows for given two dataframes this function is not symmetric, means diff(x, y) != diff(y, x) however diff(x, y, how='left') == diff(y, x, how='right') Ref: https://stackoverflow.com/questions/18180763/set-difference-for-pandas/40209800#40209800 """ if (df1.columns != df2.columns).any(): raise ValueError("Two dataframe columns must match") if df1.equals(df2): return None Elif how == 'right': return pd.concat([df2, df1, df1]).drop_duplicates(keep=False) Elif how == 'left': return pd.concat([df1, df2, df2]).drop_duplicates(keep=False) else: raise ValueError('how parameter supports only "left" or "right keywords"')

例：

df1 = pd.DataFrame(d1) Out[1]: Comment Name isEnrolled score 0 He was late to class Jack True 2.17 1 Graduated Nick False 1.11 2 Zoe True 4.12 df2 = pd.DataFrame(d2) Out[2]: Comment Name isEnrolled score 0 He was late to class Jack True 2.17 1 On vacation Zoe True 4.12 diff_df(df1, df2) Out[3]: Comment Name isEnrolled score 1 Graduated Nick False 1.11 2 Zoe True 4.12 diff_df(df2, df1) Out[4]: Comment Name isEnrolled score 1 On vacation Zoe True 4.12 # This gives the same result as above diff_df(df1, df2, how='right') Out[22]: Comment Name isEnrolled score 1 On vacation Zoe True 4.12