Python 3-pickleは4GBを超えるバイトオブジェクトを処理できますか？

Question

これに基づいてコメントと参照ドキュメント、Python 3.4+からのPickle 4.0+は、4GBを超えるバイトオブジェクトをpickleできるはずです。

ただし、python 3.4.3またはpython 3.5.0b2をMac OS X 10.10.4で使用すると、大きなバイトをピクルするときにエラーが発生しますアレイ：

>>> import pickle >>> x = bytearray(8 * 1000 * 1000 * 1000) >>> fp = open("x.dat", "wb") >>> pickle.dump(x, fp, protocol = 4) Traceback (most recent call last): File "<stdin>", line 1, in <module> OSError: [Errno 22] Invalid argument

コードにバグがありますか、またはドキュメントを誤解していますか？

lunguini · Answer

issue 24658 の簡単な回避策を次に示します。つかいます pickle.loadsまたはpickle.dumpsおよびbytesオブジェクトをサイズ2**31 - 1を使用して、ファイルにファイルを出し入れします。

import pickle import os.path file_path = "pkl.pkl" n_bytes = 2**31 max_bytes = 2**31 - 1 data = bytearray(n_bytes) ## write bytes_out = pickle.dumps(data) with open(file_path, 'wb') as f_out: for idx in range(0, len(bytes_out), max_bytes): f_out.write(bytes_out[idx:idx+max_bytes]) ## read bytes_in = bytearray(0) input_size = os.path.getsize(file_path) with open(file_path, 'rb') as f_in: for _ in range(0, input_size, max_bytes): bytes_in += f_in.read(max_bytes) data2 = pickle.loads(bytes_in) assert(data == data2)

Martin Thoma · Answer

コメントで答えられたものを要約するには：

はい、Pythonは4GBを超えるバイトオブジェクトをピクルできます。観測されたエラーは、実装のバグが原因です（ Issue24658 を参照）。

Sam Cohan · Answer

完全な回避策がありますが、pickle.loadはもはや巨大なファイルをダンプしようとしません（Python 3.5.2）ですので、厳密にはpickle.dumpsのみがこれを必要とします正しく機能します。

import pickle class MacOSFile(object): def __init__(self, f): self.f = f def __getattr__(self, item): return getattr(self.f, item) def read(self, n): # print("reading total_bytes=%s" % n, flush=True) if n >= (1 << 31): buffer = bytearray(n) idx = 0 while idx < n: batch_size = min(n - idx, 1 << 31 - 1) # print("reading bytes [%s,%s)..." % (idx, idx + batch_size), end="", flush=True) buffer[idx:idx + batch_size] = self.f.read(batch_size) # print("done.", flush=True) idx += batch_size return buffer return self.f.read(n) def write(self, buffer): n = len(buffer) print("writing total_bytes=%s..." % n, flush=True) idx = 0 while idx < n: batch_size = min(n - idx, 1 << 31 - 1) print("writing bytes [%s, %s)... " % (idx, idx + batch_size), end="", flush=True) self.f.write(buffer[idx:idx + batch_size]) print("done.", flush=True) idx += batch_size def pickle_dump(obj, file_path): with open(file_path, "wb") as f: return pickle.dump(obj, MacOSFile(f), protocol=pickle.HIGHEST_PROTOCOL) def pickle_load(file_path): with open(file_path, "rb") as f: return pickle.load(MacOSFile(f))

markhor · Answer

bytes連結が実行される場合、2GBのチャンクでファイルを読み取るには必要な2倍のメモリが必要です。loading picklesへの私のアプローチはbytearrayに基づいています。

class MacOSFile(object): def __init__(self, f): self.f = f def __getattr__(self, item): return getattr(self.f, item) def read(self, n): if n >= (1 << 31): buffer = bytearray(n) pos = 0 while pos < n: size = min(n - pos, 1 << 31 - 1) chunk = self.f.read(size) buffer[pos:pos + size] = chunk pos += size return buffer return self.f.read(n)

使用法：

with open("/path", "rb") as fin: obj = pickle.load(MacOSFile(fin))

ihopethiswillfi · Answer

同じ問題があり、Python 3.6.8。

これはそれを行ったPRのようです： https://github.com/python/cpython/pull/9937

Yohan Obadia · Answer

ダンプのプロトコルを指定できます。 pickle.dump(obj,file,protocol=4)を実行すると動作するはずです。

raditya gumay · Answer

また、この問題を解決するために、コードをいくつかの反復に分割しました。この場合、tf-idfを計算してknn分類を行う必要があるデータが50.000あるとします。実行して50.000を直接反復すると、「そのエラー」が発生します。したがって、この問題を解決するには、チャンクします。

tokenized_documents = self.load_tokenized_preprocessing_documents() idf = self.load_idf_41227() doc_length = len(documents) for iteration in range(0, 9): tfidf_documents = [] for index in range(iteration, 4000): doc_tfidf = [] for term in idf.keys(): tf = self.term_frequency(term, tokenized_documents[index]) doc_tfidf.append(tf * idf[term]) doc = documents[index] tfidf = [doc_tfidf, doc[0], doc[1]] tfidf_documents.append(tfidf) print("{} from {} document {}".format(index, doc_length, doc[0])) self.save_tfidf_41227(tfidf_documents, iteration)