Pythonマルチプロセッシングでboto3クライアントを使用する方法は？

Question

コードは次のようになります。

import multiprocessing as mp from functools import partial import boto3 import numpy as np s3 = boto3.client('s3') def _something(**kwargs): # Some mixed integer programming stuff related to the variable archive return np.array(some_variable_related_to_archive) def do(s3): archive = np.load(s3.get_object('some_key')) # Simplified -- details not relevant pool = mp.pool() sub_process = partial(_something, slack=0.1) parts = np.array_split(archive, some_int) target_parts = np.array(things) out = pool.starmap(sub_process, [x for x in Zip(parts, target_parts)] # Error occurs at this line pool.close() pool.join() do(s3)

エラー：

_pickle.PicklingError: Can't pickle <class 'botocore.client.S3'>: attribute lookup S3 on botocore.client failed

Pythonマルチプロセッシングライブラリの使用経験は非常に限られています。S3クライアントがどの関数のパラメータでもない場合、上記のエラーがスローされる理由がわかりません。アーカイブファイルがS3からではなくディスクからロードされている場合、コードは正常に実行できます。

任意のヘルプ/ガイダンスをいただければ幸いです。

RNHTTR · Accepted Answer

Mp.starmap（）に渡されるオブジェクトはpickle化可能である必要があり、S3クライアントはpickle化可能ではありません。 mp.starmap（）を呼び出す関数の外部にS3クライアントのアクションを持ち込むと、問題を解決できます。

import multiprocessing as mp from functools import partial import boto3 import numpy as np s3 = boto3.client('s3') archive = np.load(s3.get_object('some_key')) # Simplified -- details not relevant # Move the s3 call here, outside of the do() function def _something(**kwargs): # Some mixed integer programming stuff related to the variable archive return np.array(some_variable_related_to_archive) def do(archive): # pass the previously loaded archive, and not the s3 object into the function pool = mp.pool() sub_process = partial(_something, slack=0.1) parts = np.array_split(archive, some_int) target_parts = np.array(things) out = pool.starmap(sub_process, [x for x in Zip(parts, target_parts)] # Error occurs at this line pool.close() pool.join() do(archive) # pass the previously loaded archive, and not the s3 object into the function

Pablo Andres Perez Quevedo · Answer

まあ、私はそれをかなり簡単な方法で解決しました。つまり、ではなく、より縮小された、より複雑でないオブジェクトを使用します。クラスを使用しましたバケット。

ただし、次の投稿を考慮する必要があります。マルチプロセッシングPool.map（）を使用する場合はpickle化できません。 boto3に関連するすべてのオブジェクトを関数のクラスの外に置きます。他のいくつかの投稿では、オーバーヘッドを回避するために、麻痺させようとしている関数内にs3オブジェクトと関数を配置することを提案していますが、私はまだ試していません。実際、情報をmsgpackファイルタイプに保存できるコードを紹介します。

私のコード例は次のとおりです（クラスまたは関数以外）。それが役に立てば幸い。

import pandas as pd import boto3 from pathos.pools import ProcessPool s3 = boto3.resource('s3') s3_bucket_name = 'bucket-name' s3_bucket = s3.Bucket(s3_bucket_name) def msgpack_dump_s3 (df, filename): try: s3_bucket.put_object(Body=df.to_msgpack(), Key=filename) print(module, filename + " successfully saved into s3 bucket '" + s3_bucket.name + "'") except Exception as e: # logging all the others as warning print(module, "Failed deleting bucket. Continuing. {}".format(e)) def msgpack_load_s3 (filename): try: return s3_bucket.Object(filename).get()['Body'].read() except ClientError as ex: if ex.response['Error']['Code'] == 'NoSuchKey': print(module, 'No object found - returning None') return None else: print(module, "Failed deleting bucket. Continuing. {}".format(ex)) raise ex except Exception as e: # logging all the others as warning print(module, "Failed deleting bucket. Continuing. {}".format(e)) return def upper_function(): def function_to_parallelize(filename): file = msgpack_load_s3(filename) if file is not None: df = pd.read_msgpack(file) #do somenthing print('			Saving updated info...') msgpack_dump_s3(df, filename) pool = ProcessPool(nodes=ncpus) # do an asynchronous map, then get the results results = pool.imap(function_to_parallelize, files) print("...") print(list(results)) """ while not results.ready(): time.sleep(5) print(".", end=' ')