mongodbコレクションからパンダのDataFrameにデータをロードするにはどうすればよいですか？

Question

私はpandas（まあ、「プログラミング」...にすべてのもの）を初めて使用しますが、試してみるように勧められています。mongodbデータベース-"test"- 「tweets」というコレクションです。ipythonでデータベースにアクセスします。

import sys import pymongo from pymongo import Connection connection = Connection() db = connection.test tweets = db.tweets

ツイート内のドキュメントのドキュメント構造は次のとおりです。

entities': {u'hashtags': [], u'symbols': [], u'urls': [], u'user_mentions': []}, u'favorite_count': 0, u'favorited': False, u'filter_level': u'medium', u'geo': {u'coordinates': [placeholder coordinate, -placeholder coordinate], u'type': u'Point'}, u'id': 349223842700472320L, u'id_str': u'349223842700472320', u'in_reply_to_screen_name': None, u'in_reply_to_status_id': None, u'in_reply_to_status_id_str': None, u'in_reply_to_user_id': None, u'in_reply_to_user_id_str': None, u'lang': u'en', u'place': {u'attributes': {}, u'bounding_box': {u'coordinates': [[[placeholder coordinate, placeholder coordinate], [-placeholder coordinate, placeholder coordinate], [-placeholder coordinate, placeholder coordinate], [-placeholder coordinate, placeholder coordinate]]], u'type': u'Polygon'}, u'country': u'placeholder country', u'country_code': u'example', u'full_name': u'name, xx', u'id': u'user id', u'name': u'name', u'place_type': u'city', u'url': u'http://api.Twitter.com/1/geo/id/1820d77fb3f65055.json'}, u'retweet_count': 0, u'retweeted': False, u'source': u'<a href="http://Twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', u'text': u'example text', u'truncated': False, u'user': {u'contributors_enabled': False, u'created_at': u'Sat Jan 22 13:42:59 +0000 2011', u'default_profile': False, u'default_profile_image': False, u'description': u'example description', u'favourites_count': 100, u'follow_request_sent': None, u'followers_count': 100, u'following': None, u'friends_count': 100, u'geo_enabled': True, u'id': placeholder_id, u'id_str': u'placeholder_id', u'is_translator': False, u'lang': u'en', u'listed_count': 0, u'location': u'example place', u'name': u'example name', u'notifications': None, u'profile_background_color': u'000000', u'profile_background_image_url': u'http://a0.twimg.com/images/themes/theme19/bg.gif', u'profile_background_image_url_https': u'https://si0.twimg.com/images/themes/theme19/bg.gif', u'profile_background_tile': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/241527685/1363314054', u'profile_image_url': u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg', u'profile_image_url_https': u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg', u'profile_link_color': u'000000', u'profile_sidebar_border_color': u'FFFFFF', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'profile_use_background_image': False, u'protected': False, u'screen_name': placeholder screen_name', u'statuses_count': xxxx, u'time_zone': u'placeholder time_zone', u'url': None, u'utc_offset': -21600, u'verified': False}}

今、私が理解している限り、パンダの主要なデータ構造（スプレッドシートのようなテーブル）はDataFrameと呼ばれています。「tweets」コレクションからパンダのDataFrameにデータをロードするにはどうすればよいですか？そして、どうすればデータベース内のサブドキュメントをクエリできますか？

waitingkuo · Accepted Answer

MongoDBから取得したカーソルをDataFrameに渡す前に理解する

import pandas as pd df = pd.DataFrame(list(tweets.find()))

Mark Unsworth · Answer

MongoDbに次のようなデータがある場合：

[ { "name": "Adam", "age": 27, "address":{ "number": 4, "street": "Main Road", "city": "Oxford" } }, { "name": "Steve", "age": 32, "address":{ "number": 78, "street": "High Street", "city": "Cambridge" } } ]

次のようにデータをデータフレームに直接入れることができます：

from pandas import DataFrame df = DataFrame(list(db.collection_name.find({}))

そして、あなたはこの出力を得るでしょう：

df.head() | | name | age | address | |----|---------|------|-----------------------------------------------------------| | 1 | "Steve" | 27 | {"number": 4, "street": "Main Road", "city": "Oxford"} | | 2 | "Adam" | 32 | {"number": 78, "street": "High St", "city": "Cambridge"} |

ただし、サブドキュメントは、サブドキュメントセル内にJSONとして表示されます。オブジェクトをフラット化してサブドキュメントのプロパティを個別のセルとして表示する場合は、パラメーターなしで json_normalize を使用できます。

from pandas.io.json import json_normalize datapoints = list(db.collection_name.find({}) df = json_normalize(datapoints) df.head()

これにより、データフレームが次の形式で提供されます。

| | name | age | address.number | address.street | address.city | |----|--------|------|----------------|----------------|--------------| | 1 | Thomas | 27 | 4 | "Main Road" | "Oxford" | | 2 | Mary | 32 | 78 | "High St" | "Cambridge" |

saimadhu.polamuri · Answer

このコードを使用して、MongoDBデータをpandas DataFrameにロードできます。これは私にとっては機能します。あなたにも期待しています。

import pymongo import pandas as pd from pymongo import Connection connection = Connection() db = connection.database_name input_data = db.collection_name data = pd.DataFrame(list(input_data.find()))