pandasの集約

Question

パンダで集約を実行する方法は？
集約後にDataFrameがありません！どうした？
主に文字列列を集約する方法（lists、Tuples、strings with separator）？
カウントを集計する方法は？
集計値で満たされた新しい列を作成する方法は？

pandas集約機能のさまざまな面について尋ねるこれらの繰り返しの質問を見てきました。集約とそのさまざまなユースケースに関する情報のほとんどは、今日では数十件の不適切な検索不能な投稿に断片化されています。ここでの目的は、後世のより重要なポイントのいくつかを照合することです。

このQ/Aは、一連の役立つユーザーガイドの次の記事となることを目的としています。

この投稿は、集約に関するドキュメントおよび groupby の代わりになるものではないことに注意してください。

jezrael · Accepted Answer

質問1

pandas？

拡張集計ドキュメント。

集約関数は、返されるオブジェクトの次元を削減するものです。出力Series/DataFrameには、元の行と同じかそれよりも少ないまたは同じ行があることを意味します。いくつかの一般的な集約関数は以下の表にまとめられています：

 関数 説明 mean（）グループの平均を計算 sum（）グループ値の合計を計算 size（）グループのサイズを計算 count（）グループのカウントを計算 std（）グループの標準偏差 var（）グループの分散を計算 sem（）グループの平均の標準誤差 describe（）記述統計量を生成 first（）グループ値の最初の計算 last（）グループ値の最後の計算 nth（）n番目の値、またはnがリストの場合はサブセット min （）グループ値の最小値を計算する max（）グループ値の最大値を計算する

np.random.seed(123) df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'], 'B' : ['one', 'two', 'three','two', 'two', 'one'], 'C' : np.random.randint(5, size=6), 'D' : np.random.randint(5, size=6), 'E' : np.random.randint(5, size=6)}) print (df) A B C D E 0 foo one 2 3 0 1 foo two 4 1 0 2 bar three 2 1 1 3 foo two 1 0 3 4 bar two 3 1 4 5 foo one 2 1 0

フィルターされた列と cython実装関数：による集約

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum() print (df1) A B C 0 bar three 2 1 bar two 3 2 foo one 4 3 foo two 5

集約関数は、groupby関数で指定されていないすべての列、ここではA, B列に使用しています：

df2 = df.groupby(['A', 'B'], as_index=False).sum() print (df2) A B C D E 0 bar three 2 1 1 1 bar two 3 1 4 2 foo one 4 4 0 3 foo two 5 1 3

groupby関数の後にリストの集計に使用される一部の列のみを指定することもできます。

df3 = df.groupby(['A', 'B'], as_index=False)['C','D'].sum() print (df3) A B C D 0 bar three 2 1 1 bar two 3 1 2 foo one 4 4 3 foo two 5 1

関数 DataFrameGroupBy.agg を使用しても同じ結果になります。

df1 = df.groupby(['A', 'B'], as_index=False)['C'].agg('sum') print (df1) A B C 0 bar three 2 1 bar two 3 2 foo one 4 3 foo two 5 df2 = df.groupby(['A', 'B'], as_index=False).agg('sum') print (df2) A B C D E 0 bar three 2 1 1 1 bar two 3 1 4 2 foo one 4 4 0 3 foo two 5 1 3

1つの列に適用される乗算関数の場合、Tuplesのリスト-新しい列の名前と集約関数を使用します。

df4 = (df.groupby(['A', 'B'])['C'] .agg([('average','mean'),('total','sum')]) .reset_index()) print (df4) A B average total 0 bar three 2.0 2 1 bar two 3.0 3 2 foo one 2.0 4 3 foo two 2.5 5

複数の関数を渡したい場合は、listsのTupleを渡します。

df5 = (df.groupby(['A', 'B']) .agg([('average','mean'),('total','sum')])) print (df5) C D E average total average total average total A B bar three 2.0 2 1.0 1 1.0 1 two 3.0 3 1.0 1 4.0 4 foo one 2.0 4 2.0 4 0.0 0 two 2.5 5 0.5 1 1.5 3

次に、列のMultiIndexを取得します。

print (df5.columns) MultiIndex(levels=[['C', 'D', 'E'], ['average', 'total']], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

また、列に変換するには、MultiIndexをフラット化してmapをjoinとともに使用します。

df5.columns = df5.columns.map('_'.join) df5 = df5.reset_index() print (df5) A B C_average C_total D_average D_total E_average E_total 0 bar three 2.0 2 1.0 1 1.0 1 1 bar two 3.0 3 1.0 1 4.0 4 2 foo one 2.0 4 2.0 4 0.0 0 3 foo two 2.5 5 0.5 1 1.5 3

別の解決策は、集計関数のリストを渡し、MultiIndexをフラット化し、別の列名に str.replace を使用することです。

df5 = df.groupby(['A', 'B']).agg(['mean','sum']) df5.columns = (df5.columns.map('_'.join) .str.replace('sum','total') .str.replace('mean','average')) df5 = df5.reset_index() print (df5) A B C_average C_total D_average D_total E_average E_total 0 bar three 2.0 2 1.0 1 1.0 1 1 bar two 3.0 3 1.0 1 4.0 4 2 foo one 2.0 4 2.0 4 0.0 0 3 foo two 2.5 5 0.5 1 1.5 3

集計関数で各列を個別に指定する場合は、dictionaryを個別に渡します。

df6 = (df.groupby(['A', 'B'], as_index=False) .agg({'C':'sum','D':'mean'}) .rename(columns={'C':'C_total', 'D':'D_average'})) print (df6) A B C_total D_average 0 bar three 2 1.0 1 bar two 3 1.0 2 foo one 4 2.0 3 foo two 5 0.5

カスタム関数も渡すことができます：

def func(x): return x.iat[0] + x.iat[-1] df7 = (df.groupby(['A', 'B'], as_index=False) .agg({'C':'sum','D': func}) .rename(columns={'C':'C_total', 'D':'D_sum_first_and_last'})) print (df7) A B C_total D_sum_first_and_last 0 bar three 2 2 1 bar two 3 2 2 foo one 4 4 3 foo two 5 1

質問2

集約後にDataFrameがありません！どうした？

2列以上の集計：

df1 = df.groupby(['A', 'B'])['C'].sum() print (df1) A B bar three 2 two 3 foo one 4 two 5 Name: C, dtype: int32

最初に、pandasオブジェクトのIndexおよびtypeを確認します。

print (df1.index) MultiIndex(levels=[['bar', 'foo'], ['one', 'three', 'two']], labels=[[0, 0, 1, 1], [1, 2, 0, 2]], names=['A', 'B']) print (type(df1)) <class 'pandas.core.series.Series'>

列にMultiIndex Seriesを取得する方法は2つあります。

パラメーターas_index=Falseを追加します

df1 = df.groupby(['A', 'B'], as_index=False)['C'].sum() print (df1) A B C 0 bar three 2 1 bar two 3 2 foo one 4 3 foo two 5

Series.reset_index を使用します。

df1 = df.groupby(['A', 'B'])['C'].sum().reset_index() print (df1) A B C 0 bar three 2 1 bar two 3 2 foo one 4 3 foo two 5

1列ごとにグループ化する場合：

df2 = df.groupby('A')['C'].sum() print (df2) A bar 5 foo 9 Name: C, dtype: int32

... SeriesでIndexを取得：

print (df2.index) Index(['bar', 'foo'], dtype='object', name='A') print (type(df2)) <class 'pandas.core.series.Series'>

解決策はMultiIndex Seriesと同じです：

df2 = df.groupby('A', as_index=False)['C'].sum() print (df2) A C 0 bar 5 1 foo 9 df2 = df.groupby('A')['C'].sum().reset_index() print (df2) A C 0 bar 5 1 foo 9

質問3

主に文字列列を（`list`s、`Tuple`s、`strings with separator`に）集約する方法は？

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'], 'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'], 'C' : ['three', 'one', 'two', 'two', 'three','two', 'one'], 'D' : [1,2,3,2,3,1,2]}) print (df) A B C D 0 a one three 1 1 c two one 2 2 b three two 3 3 b two two 2 4 a two three 3 5 c one two 1 6 b three one 2

Aggregeta関数の代わりに、列を変換するためにlist、Tuple、setを渡すことができます。

df1 = df.groupby('A')['B'].agg(list).reset_index() print (df1) A B 0 a [one, two] 1 b [three, two, three] 2 c [two, one]

代替手段は、使用 GroupBy.apply ：

df1 = df.groupby('A')['B'].apply(list).reset_index() print (df1) A B 0 a [one, two] 1 b [three, two, three] 2 c [two, one]

区切り文字を使用して文字列に変換するには、文字列列の場合のみ.joinを使用します

df2 = df.groupby('A')['B'].agg(','.join).reset_index() print (df2) A B 0 a one,two 1 b three,two,three 2 c two,one

数値列がastypesへの変換に string でラムダ関数を使用する場合：

df3 = (df.groupby('A')['D'] .agg(lambda x: ','.join(x.astype(str))) .reset_index()) print (df3) A D 0 a 1,3 1 b 3,2,2 2 c 2,1

別の解決策は、groupbyの前の文字列に変換することです。

df3 = (df.assign(D = df['D'].astype(str)) .groupby('A')['D'] .agg(','.join).reset_index()) print (df3) A D 0 a 1,3 1 b 3,2,2 2 c 2,1

すべての列を変換するには、groupbyの後に列のリストを渡しません。「迷惑」列の自動除外であるため、列Dはありません。これは、すべての数値列が除外されることを意味します。

df4 = df.groupby('A').agg(','.join).reset_index() print (df4) A B C 0 a one,two three,three 1 b three,two,three two,two,one 2 c two,one one,two

したがって、すべての列を文字列に変換してから、すべての列を取得する必要があります。

df5 = (df.groupby('A') .agg(lambda x: ','.join(x.astype(str))) .reset_index()) print (df5) A B C D 0 a one,two three,three 1,3 1 b three,two,three two,two,one 3,2,2 2 c two,one one,two 2,1

質問4

カウントを集計する方法は？

df = pd.DataFrame({'A' : ['a', 'c', 'b', 'b', 'a', 'c', 'b'], 'B' : ['one', 'two', 'three','two', 'two', 'one', 'three'], 'C' : ['three', np.nan, np.nan, 'two', 'three','two', 'one'], 'D' : [np.nan,2,3,2,3,np.nan,2]}) print (df) A B C D 0 a one three NaN 1 c two NaN 2.0 2 b three NaN 3.0 3 b two two 2.0 4 a two three 3.0 5 c one two NaN 6 b three one 2.0

関数 GroupBy.size for各グループのsize：

df1 = df.groupby('A').size().reset_index(name='COUNT') print (df1) A COUNT 0 a 2 1 b 3 2 c 2

関数 GroupBy.count 欠損値を除外する：

df2 = df.groupby('A')['C'].count().reset_index(name='COUNT') print (df2) A COUNT 0 a 2 1 b 2 2 c 1

欠落していない値をカウントするには、関数を複数の列で使用する必要があります。

df3 = df.groupby('A').count().add_suffix('_COUNT').reset_index() print (df3) A B_COUNT C_COUNT D_COUNT 0 a 2 2 1 1 b 3 2 3 2 c 2 1 1

関連関数 Series.value_counts 最初の要素が最も頻繁に発生する要素になるように、一意の値のカウントを降順に含むサイズオブジェクトを返します。デフォルトでNaNs値を除外します。

df4 = (df['A'].value_counts() .rename_axis('A') .reset_index(name='COUNT')) print (df4) A COUNT 0 b 3 1 a 2 2 c 2

関数groupby + sizeを使用するような同じ出力が必要な場合は、 Series.sort_index を追加します。

df5 = (df['A'].value_counts() .sort_index() .rename_axis('A') .reset_index(name='COUNT')) print (df5) A COUNT 0 a 2 1 b 3 2 c 2

質問5

集計値で満たされた新しい列を作成する方法は？

メソッド GroupBy.transform は、グループ化されているオブジェクトと同じ（同じサイズ）インデックスが付けられたオブジェクトを返します

パンダドキュメント詳細については。

np.random.seed(123) df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'foo', 'bar', 'foo'], 'B' : ['one', 'two', 'three','two', 'two', 'one'], 'C' : np.random.randint(5, size=6), 'D' : np.random.randint(5, size=6)}) print (df) A B C D 0 foo one 2 3 1 foo two 4 1 2 bar three 2 1 3 foo two 1 0 4 bar two 3 1 5 foo one 2 1 df['C1'] = df.groupby('A')['C'].transform('sum') df['C2'] = df.groupby(['A','B'])['C'].transform('sum') df[['C3','D3']] = df.groupby('A')['C','D'].transform('sum') df[['C4','D4']] = df.groupby(['A','B'])['C','D'].transform('sum') print (df) A B C D C1 C2 C3 D3 C4 D4 0 foo one 2 3 9 4 9 5 4 4 1 foo two 4 1 9 5 9 5 5 1 2 bar three 2 1 5 2 5 2 2 1 3 foo two 1 0 9 5 9 5 5 1 4 bar two 3 1 5 3 5 2 3 1 5 foo one 2 1 9 4 9 5 4 4