mongodbからパンダにデータをインポートする方法は？

Question

Mongodbのコレクションには、分析する必要がある大量のデータがあります。そのデータをパンダにインポートするにはどうすればよいですか？

pandasとnumpyが初めてです。

編集：mongodbコレクションには、日付と時刻でタグ付けされたセンサー値が含まれています。センサー値はfloatデータ型です。

サンプルデータ：

{ "_cls" : "SensorReport", "_id" : ObjectId("515a963b78f6a035d9fa531b"), "_types" : [ "SensorReport" ], "Readings" : [ { "a" : 0.958069536790466, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"), "b" : 6.296118156595, "_cls" : "Reading" }, { "a" : 0.95574014778624, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"), "b" : 6.29651468650064, "_cls" : "Reading" }, { "a" : 0.953648289182713, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"), "b" : 7.29679823731148, "_cls" : "Reading" }, { "a" : 0.955931884300997, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"), "b" : 6.29642922525632, "_cls" : "Reading" }, { "a" : 0.95821381, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"), "b" : 7.28956613, "_cls" : "Reading" }, { "a" : 4.95821335, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"), "b" : 6.28956574, "_cls" : "Reading" }, { "a" : 9.95821341, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"), "b" : 0.28956488, "_cls" : "Reading" }, { "a" : 1.95667927, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"), "b" : 0.29115237, "_cls" : "Reading" } ], "latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"), "sensorName" : "56847890-0", "reportCount" : 8 }

waitingkuo · Accepted Answer

pymongoはあなたに手を貸すかもしれません。以下は私が使用しているコードです：

import pandas as pd from pymongo import MongoClient def _connect_mongo(Host, port, username, password, db): """ A util for making a connection to mongo """ if username and password: mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, Host, port, db) conn = MongoClient(mongo_uri) else: conn = MongoClient(Host, port) return conn[db] def read_mongo(db, collection, query={}, Host='localhost', port=27017, username=None, password=None, no_id=True): """ Read from Mongo and Store into DataFrame """ # Connect to MongoDB db = _connect_mongo(Host=host, port=port, username=username, password=password, db=db) # Make a query to the specific DB and Collection cursor = db[collection].find(query) # Expand the cursor and construct the DataFrame df = pd.DataFrame(list(cursor)) # Delete the _id if no_id: del df['_id'] return df

saimadhu.polamuri · Answer

このコードを使用して、mongodbデータをpandas DataFrameにロードできます。わたしにはできる。うまくいけばあなたも。

import pymongo import pandas as pd from pymongo import MongoClient client = MongoClient() db = client.database_name collection = db.collection_name data = pd.DataFrame(list(collection.find()))

shx2 · Answer

Monary はまさにそれを行い、super fastです。（別のリンク）

このクールな投稿を参照してください。クイックチュートリアルといくつかのタイミングが含まれています。

Cy Bu · Answer

PEPによると、単純なものは複雑なものよりも優れています。

import pandas as pd df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

通常のmongoDBデータベースで作業する場合と同様に条件を含めることも、find_one（）を使用してデータベースなどから要素を1つだけ取得することもできます。

そして出来上がり！

fengwt · Answer

import pandas as pd from odo import odo data = odo('mongodb://localhost/db::collection', pd.DataFrame)

Dennis Golomazov · Answer

アウトオブコア（RAMに適合しない）データを効率的に（つまり、並列実行で）処理するには、 Python Blazeエコシステム：Blaze/Dask/Odoを試してください。

Blaze（および Odo ）には、MongoDBを処理するためのすぐに使用可能な関数があります。

最初に役立ついくつかの記事：

Blaze Expessionsの紹介（MongoDBクエリの例を使用）
ReproduceIt：Redditワード数
Dask ArrayとBlazeの違い

また、Blazeスタックでどのような驚くべきことが可能であるかを示す記事： BlazeとImpalaで17億のRedditコメントを分析する（本質的に、数秒で975 GbのRedditコメントを照会する）。

追伸私はこれらの技術のいずれとも提携していません。

Ikar Pohorsk&#253; · Answer

私が非常に便利だと思った別のオプションは次のとおりです。

from pandas.io.json import json_normalize cursor = my_collection.find() df = json_normalize(cursor)

このようにして、ネストされたmongodbドキュメントを無料で展開できます。

Deo Leung · Answer

を使用して

pandas.DataFrame(list(...))

イテレータ/ジェネレータの結果が大きい場合、大量のメモリを消費します

最後に小さなチャンクを生成して連結する方が良い

def iterator2dataframes(iterator, chunk_size: int): """Turn an iterator into multiple small pandas.DataFrame This is a balance between memory and efficiency """ records = [] frames = [] for i, record in enumerate(iterator): records.append(record) if i % chunk_size == chunk_size - 1: frames.append(pd.DataFrame(records)) records = [] if records: frames.append(pd.DataFrame(records)) return pd.concat(frames)

Jeff · Answer

http://docs.mongodb.org/manual/reference/mongoexport

csvにエクスポートしてread_csvまたはJSONを使用し、DataFrame.from_recordsを使用

Rafael Valero · Answer

waitingkuo によるこのすばらしい答えに続いて、。read_sql（）および。read_csv（）に沿ってチャンクサイズを使用してそれを行う可能性を追加したいと思います。 =。 Deu Leung からの答えを拡大します。「イテレータ」/「カーソル」の「レコード」を1つずつ実行することを避けます。以前のread_mongo関数を借ります。

def read_mongo(db, collection, query={}, Host='localhost', port=27017, username=None, password=None, chunksize = 100, no_id=True): """ Read from Mongo and Store into DataFrame """ # Connect to MongoDB #db = _connect_mongo(Host=host, port=port, username=username, password=password, db=db) client = MongoClient(Host=host, port=port) # Make a query to the specific DB and Collection db_aux = client[db] # Some variables to create the chunks skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize)) if len(skips_variable)<=1: skips_variable = [0,len(skips_variable)] # Iteration to create the dataframe in chunks. for i in range(1,len(skips_variable)): # Expand the cursor and construct the DataFrame #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]])) df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]])) if no_id: del df_aux['_id'] # Concatenate the chunks into a unique df if 'df' not in locals(): df = df_aux else: df = pd.concat([df, df_aux], ignore_index=True) return df

Jordy Cuan · Answer

paginationを使用したRafael Valero、waitingkuo、Deu Leungなどの同様のアプローチ

def read_mongo( # db, collection, query=None, # Host='localhost', port=27017, username=None, password=None, chunksize = 100, page_num=1, no_id=True): # Connect to MongoDB db = _connect_mongo(Host=host, port=port, username=username, password=password, db=db) # Calculate number of documents to skip skips = chunksize * (page_num - 1) # Sorry, this is in spanish # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es if not query: query = {} # Make a query to the specific DB and Collection cursor = db[collection].find(query).skip(skips).limit(chunksize) # Expand the cursor and construct the DataFrame df = pd.DataFrame(list(cursor)) # Delete the _id if no_id: del df['_id'] return df