Pythonでファイル/ストリームから複数のJSON値を遅延して読み取るにはどうすればよいですか？

Question

Pythonのファイル/ストリームから複数のJSONオブジェクトを一度に1つずつ読み取りたいです。残念ながら、ファイルの終わりまでjson.load()ちょうど.read() s;それを使用して単一のオブジェクトを読み取ったり、オブジェクトを遅延的に反復したりする方法はないようです。

これを行う方法はありますか？標準ライブラリを使用するのが理想的ですが、サードパーティのライブラリがある場合は代わりに使用します。

現時点では、各オブジェクトを別々の行に配置してjson.loads(f.readline())を使用していますが、これを行う必要はありません。

使用例

example.py

import my_json as json import sys for o in json.iterload(sys.stdin): print("Working on a", type(o))

in.txt

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

セッション例

$ python3.2 example.py < in.txt Working on a dict Working on a int Working on a int Working on a list Working on a int Working on a int Working on a int

Nic Watson · Accepted Answer

これがはるかに簡単な解決策です。秘密は、正しく解析するために、試行され、失敗し、例外の情報を使用することです。唯一の制限は、ファイルがシーク可能でなければならないことです。

def stream_read_json(fn): import json start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except json.JSONDecodeError as e: f.seek(start_pos) json_str = f.read(e.pos) obj = json.loads(json_str) start_pos += e.pos yield obj

編集：これはPython> = 3.5でのみ機能することに注意してください。以前の場合、失敗はValueErrorを返し、文字列から位置を解析する必要があります。

def stream_read_json(fn): import json import re start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except ValueError as e: f.seek(start_pos) end_pos = int(re.match('Extra data: line \d+ column \d+ .*$char (\d+).*$', e.args[0]).groups()[0]) json_str = f.read(end_pos) obj = json.loads(json_str) start_pos += end_pos yield obj

Thomas K · Answer

JSONは一般に、この種の増分使用にはあまり適していません。複数のオブジェクトをシリアル化する標準的な方法はないため、ロット全体を解析することなく、一度に1つずつ簡単にロードできます。

使用している行ごとのオブジェクトソリューションは、他の場所でも見られます。スクレイピーは「JSONライン」と呼んでいます：

あなたはもう少しPython的にそれを行うことができます：

for jsonline in f: yield json.loads(jsonline) # or do the processing in this loop

これは最善の方法だと思います-サードパーティのライブラリに依存せず、何が起こっているかを理解するのは簡単です。私自身のコードの一部でも使用しました。

Krumelur · Answer

多分少し遅れたかもしれませんが、私はこの正確な問題を抱えていました（まあ、多かれ少なかれ）。これらの問題に対する私の標準的な解決策は通常、よく知られているルートオブジェクトで正規表現分割を行うことですが、私の場合は不可能でした。これを一般的に行う唯一の実行可能な方法は、適切なトークナイザーを実装することです。

一般的で十分なパフォーマンスのソリューションを見つけられなかったので、 splitstream モジュールを書いて、自分でこれをやめました。 JSONとXMLを理解し、解析のために連続ストリームを複数のチャンクに分割する事前トークン化機能です（実際の解析はユーザーに任されています）。何らかのパフォーマンスを引き出すために、Cモジュールとして記述されています。

例：

from splitstream import splitfile for jsonstr in splitfile(sys.stdin, format="json")): yield json.loads(jsonstr)

Benedict · Answer

これは実際には行でストリームする必要があるため、かなり厄介な問題ですが、パターンは複数行にわたってブレースと一致するだけでなく、jsonも一致します。これはjson事前解析の後にjson解析が続くものです。 Jsonは他の形式と比べて解析が容易なので、解析ライブラリを使用する必要は必ずしもありませんが、これらの矛盾する問題をどのように解決すればよいでしょうか？

救助のための発電機！

このような問題に対するジェネレーターの美しさは、遅延を維持しながら、問題の難易度を徐々に抽象化して積み重ねることができることです。また、ジェネレーターに値を返すメカニズム（send（））を使用することも検討しましたが、幸いなことに、それを使用する必要はありませんでした。

最初の問題を解決するには、re.finditerのストリーミングバージョンとして、ある種のstreamingfinditerが必要です。以下のこの試みは、一致を返す間、必要に応じて行を取り込みます（デバッグステートメントのコメントを外します）。次に、実際にそれをわずかに変更して、一致しない行と一致する行を生成しました（生成されたタプルの最初の部分で0または1としてマークされています）。

import re def streamingfinditer(pat,stream): for s in stream: # print "Read next line: " + s while 1: m = re.search(pat,s) if not m: yield (0,s) break yield (1,m.group()) s = re.split(pat,s,1)[1]

それにより、ブレースまで一致させ、ブレースのバランスをとるたびに考慮し、必要に応じて単純または複合オブジェクトを返すことができます。

braces='{}[]' whitespaceesc=' 	' bracesesc='\'+'\'.join(braces) balancemap=dict(Zip(braces,[1,-1,1,-1])) bracespat='['+bracesesc+']' nobracespat='[^'+bracesesc+']*' untilbracespat=nobracespat+bracespat def simpleorcompoundobjects(stream): obj = "" unbalanced = 0 for (c,m) in streamingfinditer(re.compile(untilbracespat),stream): if (c == 0): # remainder of line returned, nothing interesting if (unbalanced == 0): yield (0,m) else: obj += m if (c == 1): # match returned if (unbalanced == 0): yield (0,m[:-1]) obj += m[-1] else: obj += m unbalanced += balancemap[m[-1]] if (unbalanced == 0): yield (1,obj) obj=""

これにより、次のタプルが返されます。

(0,"String of simple non-braced objects easy to parse") (1,"{ 'Compound' : 'objects' }")

基本的にはこれが厄介な部分です。必要に応じて、最終レベルの解析を行う必要があります。たとえば、Jeremy Romanのiterload関数（ありがとう！）を使用して、1行の解析を実行できます。

def streamingiterload(stream): for c,o in simpleorcompoundobjects(stream): for x in iterload(o): yield x

試して：

of = open("test.json","w") of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 { } 2 9 78 4 5 { "animals" : [ "dog" , "lots of mice" , "cat" ] } """) of.close() // open & stream the json f = open("test.json","r") for o in streamingiterload(f.readlines()): print o f.close()

私はこれらの結果を取得します（そして、そのデバッグ行をオンにすると、必要に応じて行が表示されます）：

[u'hello'] {u'goodbye': 1} 1 2 {} 2 9 78 4 5 {u'animals': [u'dog', u'lots of mice', u'cat']}

これはすべての状況で機能するわけではありません。 jsonライブラリの実装により、パーサーを自分で再実装せずに完全に正しく動作するのは不可能です。

Jeremy Roman · Answer

確かにこれを行うことができます。あなたはただraw_decode 直接。この実装は、ファイル全体をメモリにロードし、その文字列を操作します（json.loadはありません）;大きなファイルがある場合は、必要に応じてファイルから読み取るように変更することができます。

import json from json.decoder import WHITESPACE def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs): if isinstance(string_or_fp, file): string = string_or_fp.read() else: string = str(string_or_fp) decoder = cls(**kwargs) idx = WHITESPACE.match(string, 0).end() while idx < len(string): obj, end = decoder.raw_decode(string, idx) yield obj idx = WHITESPACE.match(string, end).end()

使用法：リクエストどおり、ジェネレーターです。

Tarun Lalwani · Answer

それを行うより良い方法は、ステートマシンを使用することだと思います。以下は、以下のリンク上のNodeJSコードをPython= ~~3 (used nonlocal keyword only available in Python 3, code won't work on Python 2)~~

Edit-1：Python 2と互換性のあるコードを更新および作成しました

Edit-2：Python3のみのバージョンも更新および追加しました

https://Gist.github.com/creationix/5992451

Python 3のみのバージョン

# A streaming byte oriented JSON parser. Feed it a single byte at a time and # it will emit complete objects as it comes across them. Whitespace within and # between objects is ignored. This means it can parse newline delimited JSON. import math def json_machine(emit, next_func=None): def _value(byte_data): if not byte_data: return if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _value # Ignore whitespace if byte_data == 0x22: # " return string_machine(on_value) if byte_data == 0x2d or (0x30 <= byte_data < 0x40): # - or 0-9 return number_machine(byte_data, on_number) if byte_data == 0x7b: #: return object_machine(on_value) if byte_data == 0x5b: # [ return array_machine(on_value) if byte_data == 0x74: # t return constant_machine(TRUE, True, on_value) if byte_data == 0x66: # f return constant_machine(FALSE, False, on_value) if byte_data == 0x6e: # n return constant_machine(NULL, None, on_value) if next_func == _value: raise Exception("Unexpected 0x" + str(byte_data)) return next_func(byte_data) def on_value(value): emit(value) return next_func def on_number(number, byte): emit(number) return _value(byte) next_func = next_func or _value return _value TRUE = [0x72, 0x75, 0x65] FALSE = [0x61, 0x6c, 0x73, 0x65] NULL = [0x75, 0x6c, 0x6c] def constant_machine(bytes_data, value, emit): i = 0 length = len(bytes_data) def _constant(byte_data): nonlocal i if byte_data != bytes_data[i]: i += 1 raise Exception("Unexpected 0x" + str(byte_data)) i += 1 if i < length: return _constant return emit(value) return _constant def string_machine(emit): string = "" def _string(byte_data): nonlocal string if byte_data == 0x22: # " return emit(string) if byte_data == 0x5c: # \ return _escaped_string if byte_data & 0x80: # UTF-8 handling return utf8_machine(byte_data, on_char_code) if byte_data < 0x20: # ASCII control character raise Exception("Unexpected control character: 0x" + str(byte_data)) string += chr(byte_data) return _string def _escaped_string(byte_data): nonlocal string if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f: # " \ / string += chr(byte_data) return _string if byte_data == 0x62: # b string += "\b" return _string if byte_data == 0x66: # f string += "\f" return _string if byte_data == 0x6e: # n string += "
" return _string if byte_data == 0x72: # r string += "
" return _string if byte_data == 0x74: # t string += "	" return _string if byte_data == 0x75: # u return hex_machine(on_char_code) def on_char_code(char_code): nonlocal string string += chr(char_code) return _string return _string # Nestable state machine for UTF-8 Decoding. def utf8_machine(byte_data, emit): left = 0 num = 0 def _utf8(byte_data): nonlocal num, left if (byte_data & 0xc0) != 0x80: raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16)) left = left - 1 num |= (byte_data & 0x3f) << (left * 6) if left: return _utf8 return emit(num) if 0xc0 <= byte_data < 0xe0: # 2-byte UTF-8 Character left = 1 num = (byte_data & 0x1f) << 6 return _utf8 if 0xe0 <= byte_data < 0xf0: # 3-byte UTF-8 Character left = 2 num = (byte_data & 0xf) << 12 return _utf8 if 0xf0 <= byte_data < 0xf8: # 4-byte UTF-8 Character left = 3 num = (byte_data & 0x07) << 18 return _utf8 raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data)) # Nestable state machine for hex escaped characters def hex_machine(emit): left = 4 num = 0 def _hex(byte_data): nonlocal num, left if 0x30 <= byte_data < 0x40: i = byte_data - 0x30 Elif 0x61 <= byte_data <= 0x66: i = byte_data - 0x57 Elif 0x41 <= byte_data <= 0x46: i = byte_data - 0x37 else: raise Exception("Expected hex char in string hex escape") left -= 1 num |= i << (left * 4) if left: return _hex return emit(num) return _hex def number_machine(byte_data, emit): sign = 1 number = 0 decimal = 0 esign = 1 exponent = 0 def _mid(byte_data): if byte_data == 0x2e: # . return _decimal return _later(byte_data) def _number(byte_data): nonlocal number if 0x30 <= byte_data < 0x40: number = number * 10 + (byte_data - 0x30) return _number return _mid(byte_data) def _start(byte_data): if byte_data == 0x30: return _mid if 0x30 < byte_data < 0x40: return _number(byte_data) raise Exception("Invalid number: 0x" + str(byte_data)) if byte_data == 0x2d: # - sign = -1 return _start def _decimal(byte_data): nonlocal decimal if 0x30 <= byte_data < 0x40: decimal = (decimal + byte_data - 0x30) / 10 return _decimal return _later(byte_data) def _later(byte_data): if byte_data == 0x45 or byte_data == 0x65: # E e return _esign return _done(byte_data) def _esign(byte_data): nonlocal esign if byte_data == 0x2b: # + return _exponent if byte_data == 0x2d: # - esign = -1 return _exponent return _exponent(byte_data) def _exponent(byte_data): nonlocal exponent if 0x30 <= byte_data < 0x40: exponent = exponent * 10 + (byte_data - 0x30) return _exponent return _done(byte_data) def _done(byte_data): value = sign * (number + decimal) if exponent: value *= math.pow(10, esign * exponent) return emit(value, byte_data) return _start(byte_data) def array_machine(emit): array_data = [] def _array(byte_data): if byte_data == 0x5d: # ] return emit(array_data) return json_machine(on_value, _comma)(byte_data) def on_value(value): array_data.append(value) def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return json_machine(on_value, _comma) if byte_data == 0x5d: # ] return emit(array_data) raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body") return _array def object_machine(emit): object_data = {} key = None def _object(byte_data): if byte_data == 0x7d: # return emit(object_data) return _key(byte_data) def _key(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _object # Ignore whitespace if byte_data == 0x22: return string_machine(on_key) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_key(result): nonlocal key key = result return _colon def _colon(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _colon # Ignore whitespace if byte_data == 0x3a: # : return json_machine(on_value, _comma) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_value(value): object_data[key] = value def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return _key if byte_data == 0x7d: # return emit(object_data) raise Exception("Unexpected byte: 0x" + str(byte_data)) return _object

Python 2互換バージョン

# A streaming byte oriented JSON parser. Feed it a single byte at a time and # it will emit complete objects as it comes across them. Whitespace within and # between objects is ignored. This means it can parse newline delimited JSON. import math def json_machine(emit, next_func=None): def _value(byte_data): if not byte_data: return if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _value # Ignore whitespace if byte_data == 0x22: # " return string_machine(on_value) if byte_data == 0x2d or (0x30 <= byte_data < 0x40): # - or 0-9 return number_machine(byte_data, on_number) if byte_data == 0x7b: #: return object_machine(on_value) if byte_data == 0x5b: # [ return array_machine(on_value) if byte_data == 0x74: # t return constant_machine(TRUE, True, on_value) if byte_data == 0x66: # f return constant_machine(FALSE, False, on_value) if byte_data == 0x6e: # n return constant_machine(NULL, None, on_value) if next_func == _value: raise Exception("Unexpected 0x" + str(byte_data)) return next_func(byte_data) def on_value(value): emit(value) return next_func def on_number(number, byte): emit(number) return _value(byte) next_func = next_func or _value return _value TRUE = [0x72, 0x75, 0x65] FALSE = [0x61, 0x6c, 0x73, 0x65] NULL = [0x75, 0x6c, 0x6c] def constant_machine(bytes_data, value, emit): local_data = {"i": 0, "length": len(bytes_data)} def _constant(byte_data): # nonlocal i, length if byte_data != bytes_data[local_data["i"]]: local_data["i"] += 1 raise Exception("Unexpected 0x" + byte_data.toString(16)) local_data["i"] += 1 if local_data["i"] < local_data["length"]: return _constant return emit(value) return _constant def string_machine(emit): local_data = {"string": ""} def _string(byte_data): # nonlocal string if byte_data == 0x22: # " return emit(local_data["string"]) if byte_data == 0x5c: # \ return _escaped_string if byte_data & 0x80: # UTF-8 handling return utf8_machine(byte_data, on_char_code) if byte_data < 0x20: # ASCII control character raise Exception("Unexpected control character: 0x" + byte_data.toString(16)) local_data["string"] += chr(byte_data) return _string def _escaped_string(byte_data): # nonlocal string if byte_data == 0x22 or byte_data == 0x5c or byte_data == 0x2f: # " \ / local_data["string"] += chr(byte_data) return _string if byte_data == 0x62: # b local_data["string"] += "\b" return _string if byte_data == 0x66: # f local_data["string"] += "\f" return _string if byte_data == 0x6e: # n local_data["string"] += "
" return _string if byte_data == 0x72: # r local_data["string"] += "
" return _string if byte_data == 0x74: # t local_data["string"] += "	" return _string if byte_data == 0x75: # u return hex_machine(on_char_code) def on_char_code(char_code): # nonlocal string local_data["string"] += chr(char_code) return _string return _string # Nestable state machine for UTF-8 Decoding. def utf8_machine(byte_data, emit): local_data = {"left": 0, "num": 0} def _utf8(byte_data): # nonlocal num, left if (byte_data & 0xc0) != 0x80: raise Exception("Invalid byte in UTF-8 character: 0x" + byte_data.toString(16)) local_data["left"] -= 1 local_data["num"] |= (byte_data & 0x3f) << (local_data["left"] * 6) if local_data["left"]: return _utf8 return emit(local_data["num"]) if 0xc0 <= byte_data < 0xe0: # 2-byte UTF-8 Character local_data["left"] = 1 local_data["num"] = (byte_data & 0x1f) << 6 return _utf8 if 0xe0 <= byte_data < 0xf0: # 3-byte UTF-8 Character local_data["left"] = 2 local_data["num"] = (byte_data & 0xf) << 12 return _utf8 if 0xf0 <= byte_data < 0xf8: # 4-byte UTF-8 Character local_data["left"] = 3 local_data["num"] = (byte_data & 0x07) << 18 return _utf8 raise Exception("Invalid byte in UTF-8 string: 0x" + str(byte_data)) # Nestable state machine for hex escaped characters def hex_machine(emit): local_data = {"left": 4, "num": 0} def _hex(byte_data): # nonlocal num, left i = 0 # Parse the hex byte if 0x30 <= byte_data < 0x40: i = byte_data - 0x30 Elif 0x61 <= byte_data <= 0x66: i = byte_data - 0x57 Elif 0x41 <= byte_data <= 0x46: i = byte_data - 0x37 else: raise Exception("Expected hex char in string hex escape") local_data["left"] -= 1 local_data["num"] |= i << (local_data["left"] * 4) if local_data["left"]: return _hex return emit(local_data["num"]) return _hex def number_machine(byte_data, emit): local_data = {"sign": 1, "number": 0, "decimal": 0, "esign": 1, "exponent": 0} def _mid(byte_data): if byte_data == 0x2e: # . return _decimal return _later(byte_data) def _number(byte_data): # nonlocal number if 0x30 <= byte_data < 0x40: local_data["number"] = local_data["number"] * 10 + (byte_data - 0x30) return _number return _mid(byte_data) def _start(byte_data): if byte_data == 0x30: return _mid if 0x30 < byte_data < 0x40: return _number(byte_data) raise Exception("Invalid number: 0x" + byte_data.toString(16)) if byte_data == 0x2d: # - local_data["sign"] = -1 return _start def _decimal(byte_data): # nonlocal decimal if 0x30 <= byte_data < 0x40: local_data["decimal"] = (local_data["decimal"] + byte_data - 0x30) / 10 return _decimal return _later(byte_data) def _later(byte_data): if byte_data == 0x45 or byte_data == 0x65: # E e return _esign return _done(byte_data) def _esign(byte_data): # nonlocal esign if byte_data == 0x2b: # + return _exponent if byte_data == 0x2d: # - local_data["esign"] = -1 return _exponent return _exponent(byte_data) def _exponent(byte_data): # nonlocal exponent if 0x30 <= byte_data < 0x40: local_data["exponent"] = local_data["exponent"] * 10 + (byte_data - 0x30) return _exponent return _done(byte_data) def _done(byte_data): value = local_data["sign"] * (local_data["number"] + local_data["decimal"]) if local_data["exponent"]: value *= math.pow(10, local_data["esign"] * local_data["exponent"]) return emit(value, byte_data) return _start(byte_data) def array_machine(emit): local_data = {"array_data": []} def _array(byte_data): if byte_data == 0x5d: # ] return emit(local_data["array_data"]) return json_machine(on_value, _comma)(byte_data) def on_value(value): # nonlocal array_data local_data["array_data"].append(value) def _comma(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return json_machine(on_value, _comma) if byte_data == 0x5d: # ] return emit(local_data["array_data"]) raise Exception("Unexpected byte: 0x" + str(byte_data) + " in array body") return _array def object_machine(emit): local_data = {"object_data": {}, "key": ""} def _object(byte_data): # nonlocal object_data, key if byte_data == 0x7d: # return emit(local_data["object_data"]) return _key(byte_data) def _key(byte_data): if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _object # Ignore whitespace if byte_data == 0x22: return string_machine(on_key) raise Exception("Unexpected byte: 0x" + byte_data.toString(16)) def on_key(result): # nonlocal object_data, key local_data["key"] = result return _colon def _colon(byte_data): # nonlocal object_data, key if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _colon # Ignore whitespace if byte_data == 0x3a: # : return json_machine(on_value, _comma) raise Exception("Unexpected byte: 0x" + str(byte_data)) def on_value(value): # nonlocal object_data, key local_data["object_data"][local_data["key"]] = value def _comma(byte_data): # nonlocal object_data if byte_data == 0x09 or byte_data == 0x0a or byte_data == 0x0d or byte_data == 0x20: return _comma # Ignore whitespace if byte_data == 0x2c: # , return _key if byte_data == 0x7d: # return emit(local_data["object_data"]) raise Exception("Unexpected byte: 0x" + str(byte_data)) return _object

それをテストする

if __== "__main__": test_json = """[1,2,"3"] {"name": "tarun"} 1 2 3 [{"name":"a", "data": [1, null,2]}] """ def found_json(data): print(data) state = json_machine(found_json) for char in test_json: state = state(ord(char))

同じの出力は

[1, 2, '3'] {'name': 'tarun'} 1 2 3 [{'name': 'a', 'data': [1, None, 2]}]

wuliang · Answer

解決策を提供したいと思います。重要な考え方は、デコードを「試行」することです。失敗した場合はフィードを増やし、それ以外の場合はオフセット情報を使用して次のデコードを準備します。

ただし、現在のjsonモジュールはデコードされる文字列の先頭のSPACEを許容できないため、それらを削除する必要があります。

import sys import json def iterload(file): buffer = "" dec = json.JSONDecoder() for line in file: buffer = buffer.strip(" 

	") + line.strip(" 

	") while(True): try: r = dec.raw_decode(buffer) except: break yield r[0] buffer = buffer[r[1]:].strip(" 

	") for o in iterload(sys.stdin): print("Working on a", type(o), o)

=========================私はいくつかのtxtファイルをテストしましたが、うまく動作します。（in1.txt）

{"foo": ["bar", "baz"] } 1 2 [ ] 4 {"foo1": ["bar1", {"foo2":{"A":1, "B":3}, "DDD":4}] } 5 6

（in2.txt）

{"foo" : ["bar", "baz"] } 1 2 [ ] 4 5 6

（in.txt、イニシャル）

{"foo": ["bar", "baz"]} 1 2 [] 4 5 6

（ベネディクトのテストケースの出力）

python test.py < in.txt ('Working on a', <type 'list'>, [u'hello']) ('Working on a', <type 'dict'>, {u'goodbye': 1}) ('Working on a', <type 'int'>, 1) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'dict'>, {}) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'int'>, 9) ('Working on a', <type 'int'>, 78) ('Working on a', <type 'int'>, 4) ('Working on a', <type 'int'>, 5) ('Working on a', <type 'dict'>, {u'animals': [u'dog', u'lots of mice', u'cat']})

sigpwned · Answer

@wuilangのエレガントなソリューションを使用しました。単純なアプローチ-バイトを読み取り、デコードを試行し、バイトを読み取り、デコードを試行します...-動作しましたが、残念ながら非常に低速でした。

私の場合、ファイルから同じオブジェクトタイプの「きれいに印刷された」JSONオブジェクトを読み取ろうとしました。これにより、アプローチを最適化できました。正確に「}」を含む行が見つかった場合にのみデコードして、ファイルを行ごとに読み取ることができました。

def iterload(stream): buf = "" dec = json.JSONDecoder() for line in stream: line = line.rstrip() buf = buf + line if line == "}": yield dec.raw_decode(buf) buf = ""

文字列リテラルの改行をエスケープする1行に1つのコンパクトなJSONを使用している場合、このアプローチをさらに安全に単純化できます。

def iterload(stream): dec = json.JSONDecoder() for line in stream: yield dec.raw_decode(line)

明らかに、これらの単純なアプローチは非常に特定の種類のJSONでのみ機能します。ただし、これらの仮定が当てはまる場合、これらのソリューションは正しく迅速に機能します。

user3542882 · Answer

これが私のものです：

import simplejson as json from simplejson import JSONDecodeError class StreamJsonListLoader(): """ When you have a big JSON file containint a list, such as [{ ... }, { ... }, { ... }, ... ] And it's too big to be practically loaded into memory and parsed by json.load, This class comes to the rescue. It lets you lazy-load the large json list. """ def __init__(self, filename_or_stream): if type(filename_or_stream) == str: self.stream = open(filename_or_stream) else: self.stream = filename_or_stream if not self.stream.read(1) == '[': raise NotImplementedError('Only JSON-streams of lists (that start with a [) are supported.') def __iter__(self): return self def next(self): read_buffer = self.stream.read(1) while True: try: json_obj = json.loads(read_buffer) if not self.stream.read(1) in [',',']']: raise Exception('JSON seems to be malformed: object is not followed by comma (,) or end of list (]).') return json_obj except JSONDecodeError: next_char = self.stream.read(1) read_buffer += next_char while next_char != '}': next_char = self.stream.read(1) if next_char == '': raise StopIteration read_buffer += next_char

hetepeperfan · Answer

Json.JSONDecoderインスタンスを使用する場合は、raw_decodeメンバー関数。 python JSON値の表現と解析が停止した場所へのインデックスのタプルを返します。これにより、残りのJSON値を簡単にスライス（またはストリームオブジェクトでシーク）できます。入力中の異なるJSON値の間の空白をスキップするための追加のwhileループについてはそれほど満足していませんが、私の意見では仕事は完了しています。

import json def yield_multiple_value(f): ''' parses multiple JSON values from a file. ''' vals_str = f.read() decoder = json.JSONDecoder() try: nread = 0 while nread < len(vals_str): val, n = decoder.raw_decode(vals_str[nread:]) nread += n # Skip over whitespace because of bug, below. while nread < len(vals_str) and vals_str[nread].isspace(): nread += 1 yield val except json.JSONDecodeError as e: pass return

次のバージョンははるかに短く、すでに解析された文字列の一部を食べます。何らかの理由で、文字列の最初の文字が空白のときに2回目の呼び出しjson.JSONDecoder.raw_decode（）が失敗するようです。これは、上記のwhileloopの空白をスキップする理由でもあります...

def yield_multiple_value(f): ''' parses multiple JSON values from a file. ''' vals_str = f.read() decoder = json.JSONDecoder() while vals_str: val, n = decoder.raw_decode(vals_str) #remove the read characters from the start. vals_str = vals_str[n:] # remove leading white space because a second call to decoder.raw_decode() # fails when the string starts with whitespace, and # I don't understand why... vals_str = vals_str.lstrip() yield val return

Json.JSONDecoderクラスに関するドキュメントでは、メソッドraw_decode https://docs.python.org/3/library/json.html#encoders-and-decoders には次が含まれています。

これは、最後に無関係なデータを含む可能性のある文字列からJSONドキュメントをデコードするために使用できます。

そして、この無関係なデータは簡単に別のJSON値になる可能性があります。言い換えれば、この目的を念頭に置いてメソッドを記述することができます。

上の関数を使用するinput.txtを使用して、元の質問に示されている出力例を取得します。