python pandas、DF.groupby（）。agg（）、agg（）の列参照

Question

具体的な問題については、DataFrame DFを持っていると言います

 Word tag count 0 a S 30 1 the S 20 2 a T 60 3 an T 5 4 the T 10

「単語」ごとに、「カウント」が最も多い「タグ」。返品は次のようになります

 Word tag count 1 the S 20 2 a T 60 3 an T 5

カウント列や、注文/インデックスがオリジナルか台無しかは気にしません。辞書を返す{'the'： 'S'、...}は問題ありません。

できるといいな

DF.groupby(['Word']).agg(lambda x: x['tag'][ x['count'].argmax() ] )

しかし、それは機能しません。列情報にアクセスできません。

より抽象的には、agg（functionのfunction ））引数として参照してください？

ところで、.agg（）は.aggregate（）と同じですか？

どうもありがとう。

unutbu · Accepted Answer

aggはaggregateと同じです。この呼び出し可能オブジェクトには、Seriesの列（DataFrameオブジェクト）が1つずつ渡されます。

idxmaxを使用して、最大カウントの行のインデックスラベルを収集できます。

idx = df.groupby('Word')['count'].idxmax() print(idx)

利回り

Word a 2 an 3 the 1 Name: count

locを使用して、Word列とtag列の行を選択します。

print(df.loc[idx, ['Word', 'tag']])

利回り

 Word tag 2 a T 3 an T 1 the S

idxmaxはインデックスlabelsを返すことに注意してください。 df.locを使用して、ラベルで行を選択できます。しかし、インデックスが一意でない場合、つまり、重複するインデックスラベルを持つ行がある場合、df.locは、idxにリストされたラベルですべての行を選択します。 df.index.is_uniqueでTrueを使用する場合は、df.locがidxmaxであることに注意してください

代わりに、applyを使用できます。 applyの呼び出し可能オブジェクトには、すべての列にアクセスできるサブデータフレームが渡されます。

import pandas as pd df = pd.DataFrame({'Word':'a the a an the'.split(), 'tag': list('SSTTT'), 'count': [30, 20, 60, 5, 10]}) print(df.groupby('Word').apply(lambda subf: subf['tag'][subf['count'].idxmax()]))

利回り

Word a T an T the S

idxmaxとlocの使用は、特に大きなDataFrameの場合、通常applyよりも高速です。 IPythonの％timeitを使用：

N = 10000 df = pd.DataFrame({'Word':'a the a an the'.split()*N, 'tag': list('SSTTT')*N, 'count': [30, 20, 60, 5, 10]*N}) def using_apply(df): return (df.groupby('Word').apply(lambda subf: subf['tag'][subf['count'].idxmax()])) def using_idxmax_loc(df): idx = df.groupby('Word')['count'].idxmax() return df.loc[idx, ['Word', 'tag']] In [22]: %timeit using_apply(df) 100 loops, best of 3: 7.68 ms per loop In [23]: %timeit using_idxmax_loc(df) 100 loops, best of 3: 5.43 ms per loop

単語をタグにマッピングする辞書が必要な場合、次のようにset_indexとto_dictを使用できます。

In [36]: df2 = df.loc[idx, ['Word', 'tag']].set_index('Word') In [37]: df2 Out[37]: tag Word a T an T the S In [38]: df2.to_dict()['tag'] Out[38]: {'a': 'T', 'an': 'T', 'the': 'S'}

Jeff · Answer

渡された（unutbu）ソリューションが「適用」されていることを把握する簡単な方法を次に示します。

In [33]: def f(x): ....: print type(x) ....: print x ....: In [34]: df.groupby('Word').apply(f) <class 'pandas.core.frame.DataFrame'> Word tag count 0 a S 30 2 a T 60 <class 'pandas.core.frame.DataFrame'> Word tag count 0 a S 30 2 a T 60 <class 'pandas.core.frame.DataFrame'> Word tag count 3 an T 5 <class 'pandas.core.frame.DataFrame'> Word tag count 1 the S 20 4 the T 10

あなたの関数は（この場合）グループ化された変数がすべて同じ値を持つフレームのサブセクションで動作します（このcas 'Word'で）、関数を渡す場合、集計を処理する必要があります潜在的に非ストリング列の; 'sum'のような標準機能はあなたのためにこれをします

文字列列で自動的に集計しません

In [41]: df.groupby('Word').sum() Out[41]: count Word a 90 an 5 the 30

すべての列で集計しています

In [42]: df.groupby('Word').apply(lambda x: x.sum()) Out[42]: Word tag count Word a aa ST 90 an an T 5 the thethe ST 30

関数内ではほとんど何でもできます

In [43]: df.groupby('Word').apply(lambda x: x['count'].sum()) Out[43]: Word a 90 an 5 the 30