pandasデータフレームでネストされたJsonをフラット化する

Question

Jsonファイルをpandasデータフレームにロードしようとしています。ネストされたjsonがいくつかあることがわかりました。以下はサンプルjsonです：

{'events': [{'id': 142896214, 'playerId': 37831, 'teamId': 3157, 'matchId': 2214569, 'matchPeriod': '1H', 'eventSec': 0.8935539999999946, 'eventId': 8, 'eventName': 'Pass', 'subEventId': 85, 'subEventName': 'Simple pass', 'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}], 'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}

次のコードを使用して、jsonをデータフレームにロードしました。

with open('EVENTS.json') as f: jsonstr = json.load(f) df = pd.io.json.json_normalize(jsonstr['events'])

以下はdf.head（）の出力です

しかし、位置やタグなど、ネストされた2つの列が見つかりました。

次のコードを使用してフラット化してみました。

Position_data = json_normalize(data =jsonstr['events'], record_path='positions', meta = ['x','y','x','y'] )

次のようなエラーが表示されました。

KeyError: "Try running with errors='ignore' as key 'x' is not always present"

位置とタグ（データがネストされているもの）をフラット化する方法を教えてください。

ありがとう、ゼップ

calestini · Accepted Answer

Jsonから複数の階層を展開するためのより一般的な方法を探している場合は、recursionを使用し、リスト内包表記を使用してデータの形状を変更できます。 1つの代替案を以下に示します。

def flatten_json(nested_json, exclude=['']): """Flatten json object with nested keys into a single level. Args: nested_json: A nested json object. exclude: Keys to exclude from output. Returns: The flattened json object if successful, None otherwise. """ out = {} def flatten(x, name='', exclude=exclude): if type(x) is dict: for a in x: if a not in exclude: flatten(x[a], name + a + '_') Elif type(x) is list: i = 0 for a in x: flatten(a, name + str(i) + '_') i += 1 else: out[name[:-1]] = x flatten(nested_json) return out

次に、ネストされたレベルに関係なく、データに適用できます。

新しいサンプルデータ

this_dict = {'events': [ {'id': 142896214, 'playerId': 37831, 'teamId': 3157, 'matchId': 2214569, 'matchPeriod': '1H', 'eventSec': 0.8935539999999946, 'eventId': 8, 'eventName': 'Pass', 'subEventId': 85, 'subEventName': 'Simple pass', 'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}], 'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}, {'id': 142896214, 'playerId': 37831, 'teamId': 3157, 'matchId': 2214569, 'matchPeriod': '1H', 'eventSec': 0.8935539999999946, 'eventId': 8, 'eventName': 'Pass', 'subEventId': 85, 'subEventName': 'Simple pass', 'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53},{'x': 51, 'y': 49}], 'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]} ]}

使用法

pd.DataFrame([flatten_json(x) for x in this_dict['events']]) Out[1]: id playerId teamId matchId matchPeriod eventSec eventId \ 0 142896214 37831 3157 2214569 1H 0.893554 8 1 142896214 37831 3157 2214569 1H 0.893554 8 eventName subEventId subEventName positions_0_x positions_0_y \ 0 Pass 85 Simple pass 51 49 1 Pass 85 Simple pass 51 49 positions_1_x positions_1_y tags_0_id tags_0_tag_label positions_2_x \ 0 40 53 1801 accurate NaN 1 40 53 1801 accurate 51.0 positions_2_y 0 NaN 1 49.0

これはflatten_jsonコードは私のものではありません、私はそれを見ましたこことここ元のソースの多くの確実性なしで。

Trenton McKinney · Answer

data = {'events': [{'id': 142896214, 'playerId': 37831, 'teamId': 3157, 'matchId': 2214569, 'matchPeriod': '1H', 'eventSec': 0.8935539999999946, 'eventId': 8, 'eventName': 'Pass', 'subEventId': 85, 'subEventName': 'Simple pass', 'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}], 'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}]}

DataFrameを作成します

df = pd.DataFrame.from_dict(data) df = df['events'].apply(pd.Series)

フラットなpositionsとpd.Series

df_p = df['positions'].apply(pd.Series) df_p_0 = df_p[0].apply(pd.Series) df_p_1 = df_p[1].apply(pd.Series)

名前をpositions[0]＆positions[1]：に変更

df_p_0.columns = ['pos_0_x', 'pos_0_y'] df_p_1.columns = ['pos_1_x', 'pos_1_y']

フラットなtagsとpd.Series：

df_t = df.tags.apply(pd.Series) df_t = df_t[0].apply(pd.Series) df_t_t = df_t.tag.apply(pd.Series)

名前の変更id＆label：

df_t = df_t.rename(columns={'id': 'tags_id'}) df_t_t.columns = ['tags_tag_label']

それらすべてをpd.concat：と組み合わせる

df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

古い列を削除します：

df_new = df_new.drop(['positions', 'tags'], axis=1)