Pandas文字列を含むデータフレームのピボット-「集計する数値型がありません」エラー

Question

このエラーについてはかなりの数の質問がありますが、周りを見渡しても、まだ解決策を見つけることができません。文字列でデータフレームをピボットして、行データを列にしようとしていますが、これまでのところうまくいきません。

私のdfの形状

<class 'pandas.core.frame.DataFrame'> Int64Index: 515932 entries, 0 to 515931 Data columns (total 5 columns): id 515932 non-null object cc_contact_id 515932 non-null object Network_Name 515932 non-null object question 515932 non-null object response_answer 515932 non-null object dtypes: object(5) memory usage: 23.6+ MB

サンプル形式

id contact_id question response_answer 16 137519 2206 State Ca 17 137520 2206 State Ca 18 137521 2206 State Ca 19 137522 2206 State Ca 20 137523 2208 City Lancaster 21 137524 2208 City Lancaster 22 137525 2208 City Lancaster 23 137526 2208 City Lancaster 24 137527 2208 Trip_End Location Home 25 137528 2208 Trip_End Location Home 26 137529 2208 Trip_End Location Home 27 137530 2208 Trip_End Location Home

ピボットしたいもの

id contact_id State City Trip_End Location 16 137519 2206 Ca None None None 20 137523 2208 None Lancaster None None 24 137527 2208 None None None Home etc. etc.

ここで、questionの値が列になり、response_answerが含まれます対応する列、およびIDの保持

私が試したこと

unified_df = pd.DataFrame(unified_data, columns=target_table_headers, dtype=object) pivot_table = unified_df.pivot_table('response_answer',['id','cc_contact_id'],'question') # OR pivot_table = unified_df.pivot_table('response_answer','question')

DataError：集計する数値型がありません

文字列値でデータフレームをピボットする方法は何ですか？

cwharland · Accepted Answer

pivot_tableのデフォルトのaggfuncはnp.sumであり、文字列の処理方法がわからないため、インデックスを適切に指定する必要があります。次のようなことを試す：

pivot_table = unified_df.pivot_table(index=['id', 'contact_id'], columns='question', values='response_answer', aggfunc=lambda x: ' '.join(x))

これにより、id, contact_idペアごとに1行が明示的に設定され、questionでresponse_answer値のセットがピボットされます。 aggfuncは、生データ内の同じ質問に対する複数の回答がある場合、それらをスペースで連結することを保証します。 pivot_tableの構文は、pandasバージョンによって異なります。

ここに簡単な例があります：

In [24]: import pandas as pd In [25]: import random In [26]: df = pd.DataFrame({'id':[100*random.randint(10, 50) for _ in range(100)], 'question': [str(random.randint(0,3)) for _ in range(100)], 'response': [str(random.randint(100,120)) for _ in range(100)]}) In [27]: df.head() Out[27]: id question response 0 3100 1 116 1 4500 2 113 2 5000 1 120 3 3900 2 103 4 4300 0 117 In [28]: df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 100 entries, 0 to 99 Data columns (total 3 columns): id 100 non-null int64 question 100 non-null object response 100 non-null object dtypes: int64(1), object(2) memory usage: 3.1+ KB In [29]: df.pivot_table(index='id', columns='question', values='response', aggfunc=lambda x: ' '.join(x)).head() Out[29]: question 0 1 2 3 id 1000 110 120 NaN 100 NaN 1100 NaN 106 108 104 NaN 1200 104 113 119 NaN 101 1300 102 NaN 116 108 120 1400 NaN NaN 116 NaN

johnInHome · Answer

いくつかの方法があります。

1

df1 = df.groupby(["id","contact_id","Network_Name","question"])['response_answer'].aggregate(lambda x: x).unstack().reset_index() df1.columns=df1.columns.tolist() print (df1)

2

df1 = df.set_index(["id","contact_id","Network_Name","question"])['response_answer'].unstack().reset_index() df1.columns=df1.columns.tolist() print (df1)

３

df1 = df.groupby(["id","contact_id","Network_Name","question"])['response_answer'].aggregate('first').unstack().reset_index() df1.columns=df1.columns.tolist() print (df1)

4

df1 = df.pivot_table(index=["id","contact_id","Network_Name"], columns='question', values=['response_answer'], aggfunc='first') df1.columns = df1.columns.droplevel() df1 = df1.reset_index() df1.columns=df1.columns.tolist() print (df1)

同じans。

 id contact_id Network_Name City State Trip_End_Location 0 16 137519 2206 None Ca None 1 17 137520 2206 None Ca None 2 18 137521 2206 None Ca None 3 19 137522 2206 None Ca None 4 20 137523 2208 Lancaster None None 5 21 137524 2208 Lancaster None None 6 22 137525 2208 Lancaster None None 7 23 137526 2208 Lancaster None None 8 24 137527 2208 None None Home 9 25 137528 2208 None None Home 10 26 137529 2208 None None Home 11 27 137530 2208 None None Home