Pythonを使用してMongoDBのbsondumpをJSONに変換するにはどうすればよいですか？

Question

そのため、MongoDBダンプから大量の.bsonがあります。コマンドラインで bsondump を使用して、出力をstdinとしてpythonにパイプしています。これはBSONから 'JSON'に正常に変換されますが、実際には文字列であり、一見正当なJSONではないようです。

たとえば、着信回線は次のようになります。

{ "_id" : ObjectId( "4d9b642b832a4c4fb2000000" ), "acted_at" : Date( 1302014955933 ), "created_at" : Date( 1302014955933 ), "updated_at" : Date( 1302014955933 ), "_platform_id" : 3, "guid" : 72106535190265857 }

私が信じているのは Mongo Extended JSON です。

私がそのような行を読んで行うとき：

json_line = json.dumps(line)

私は得る：

"{ \"_id\" : ObjectId( \"4d9b642b832a4c4fb2000000\" ), \"acted_at\" : Date( 1302014955933 ), \"created_at\" : Date( 1302014955933 ), \"updated_at\" : Date( 1302014955933 ), \"_platform_id\" : 3, \"guid\" : 72106535190265857 }
"

まだ<type 'str'>。

私も試しました

json_line = json.dumps(line, default=json_util.default)

（pymongo json_utilを参照してください-スパム検出は3番目のリンクを防ぎます）これは上記のダンプと同じように出力されるようです。ロードはエラーを出します：

json_line = json.loads(line, object_hook=json_util.object_hook) ValueError: No JSON object could be decoded

では、TenGen JSONの文字列を解析可能なJSONに変換するにはどうすればよいですか？（最終目標は、タブ区切りのデータを別のデータベースにストリーミングすることです）

Fabian Fagerholm · Accepted Answer

あなたが持っているのは、TenGenモードのMongo Extended JSONでのダンプです（ここを参照）。いくつかの可能な方法：

再度ダンプできる場合は、MongoDB REST APIを介して厳密な出力モードを使用します。これにより、現在のJSONではなく実際のJSONが得られるはずです。
http://pypi.python.org/pypi/bson/ からbsonを使用して、すでに持っているBSONをPythonデータ構造と次に、それらに対して必要な処理を実行します（JSONを出力する可能性があります）。
MongoDB Pythonバインディングを使用してデータベースに接続し、データをPythonに取り込み、必要な処理を実行します（必要に応じて、ローカルのMongoDBインスタンスをセットアップし、ダンプしたものをインポートできます）。その中にファイルします。）
Mongo ExtendedJSONをTenGenモードからStrictモードに変換します。それを行うための別のフィルターを開発することも（stdinから読み取り、TenGen構造をStrict構造に置き換え、結果をstdoutに出力する）、または入力を処理するときに行うこともできます。

Pythonと正規表現を使用した例を次に示します。

import json, re from bson import json_util with open("data.tengenjson", "rb") as f: # read the entire input; in a real application, # you would want to read a chunk at a time bsondata = f.read() # convert the TenGen JSON to Strict JSON # here, I just convert the ObjectId and Date structures, # but it's easy to extend to cover all structures listed at # http://www.mongodb.org/display/DOCS/Mongo+Extended+JSON jsondata = re.sub(r'ObjectId\s*$\s*\"(\S+)\"\s*$', r'{"$oid": "\1"}', bsondata) jsondata = re.sub(r'Date\s*$\s*(\S+)\s*$', r'{"$date": \1}', jsondata) # now we can parse this as JSON, and use MongoDB's object_hook # function to get rich Python data structures inside a dictionary data = json.loads(jsondata, object_hook=json_util.object_hook) # just print the output for demonstration, along with the type print(data) print(type(data)) # serialise to JSON and print print(json_util.dumps(data))

目標に応じて、これらのいずれかが妥当な出発点になるはずです。

bauman.space · Answer

bsonドキュメント全体をpythonメモリにロードするとコストがかかります。

ファイル全体をロードしてすべてをロードするのではなく、ストリーミングしたい場合は、このライブラリを試すことができます。

https://github.com/bauman/python-bson-streaming

from bsonstream import KeyValueBSONInput from sys import argv for file in argv[1:]: f = open(file, 'rb') stream = KeyValueBSONInput(fh=f, fast_string_prematch="somthing") #remove fast string match if not needed for id, dict_data in stream: if id: ...process dict_data...

Emily S · Answer

次のようにbsonファイルの行を変換できます。

>>> import bson >>> bs = open('file.bson', 'rb').read() >>> for valid_dict in bson.decode_all( bs ): ....

各valid_dict要素は、jsonに変換できる有効なpython dictです。

Maviles · Answer

データ型を取り除き、正規表現を使用して厳密なjsonを取得できます。

import json import re #This will outputs a iterator that converts each file line into a dict. def readBsonFile(filename): with open(filename, "r") as data_in: for line in data_in: # convert the TenGen JSON to Strict JSON jsondata = re.sub(r'\:\s*\S+\s*$\s*(\S+)\s*$', r':\1', line) # parse as JSON line_out = json.loads(jsondata) yield line_out