マルチプロセッシング全体でpandasデータフレームの辞書を共有するpython

Question

python pandasデータフレームの辞書があります。この辞書の合計サイズは約2GBです。ただし、16のマルチプロセッシング（サブプロセス内）で共有すると私は辞書のデータを変更せずに読み取るだけです）、32GBのRAMが必要です。そこで、この辞書をコピーせずにマルチプロセッシング間で共有できるかどうかを尋ねたいのですが、manager.dict（に変換しようとしました）。しかし、時間がかかりすぎるようです。これを達成するための最も標準的な方法は何でしょうか？ありがとうございます。

bivouac0 · Accepted Answer

私が見つけた最善の解決策（そしてそれはいくつかのタイプの問題に対してのみ機能します）は、PythonのBaseManagerクラスとSyncManagerクラスを使用してクライアント/サーバーセットアップを使用することです。これを行うには、最初に、データのプロキシクラスを提供するサーバーをセットアップします。

DataServer.py

_#!/usr/bin/python from multiprocessing.managers import SyncManager import numpy # Global for storing the data to be served gData = {} # Proxy class to be shared with different processes # Don't put big data in here since that will force it to be piped to the # other process when instantiated there, instead just return a portion of # the global data when requested. class DataProxy(object): def __init__(self): pass def getData(self, key, default=None): global gData return gData.get(key, None) if __name__ == '__main__': port = 5000 print 'Simulate loading some data' for i in xrange(1000): gData[i] = numpy.random.Rand(1000) # Start the server on address(Host,port) print 'Serving data. Press <ctrl>-c to stop.' class myManager(SyncManager): pass myManager.register('DataProxy', DataProxy) mgr = myManager(address=('', port), authkey='DataProxy01') server = mgr.get_server() server.serve_forever() _

上記を1回実行し、実行したままにします。以下は、データへのアクセスに使用するクライアントクラスです。

DataClient.py

_from multiprocessing.managers import BaseManager import psutil #3rd party module for process info (not strictly required) # Grab the shared proxy class. All methods in that class will be availble here class DataClient(object): def __init__(self, port): assert self._checkForProcess('DataServer.py'), 'Must have DataServer running' class myManager(BaseManager): pass myManager.register('DataProxy') self.mgr = myManager(address=('localhost', port), authkey='DataProxy01') self.mgr.connect() self.proxy = self.mgr.DataProxy() # Verify the server is running (not required) @staticmethod def _checkForProcess(name): for proc in psutil.process_iter(): if proc.name() == name: return True return False _

以下は、マルチプロセッシングでこれを試すためのテストコードです。

TestMP.py

_#!/usr/bin/python import time import multiprocessing as mp import numpy from DataClient import * # Confusing, but the "proxy" will be global to each subprocess, # it's not shared across all processes. gProxy = None gMode = None gDummy = None def init(port, mode): global gProxy, gMode, gDummy gProxy = DataClient(port).proxy gMode = mode gDummy = numpy.random.Rand(1000) # Same as the dummy in the server #print 'Init proxy ', id(gProxy), 'in ', mp.current_process() def worker(key): global gProxy, gMode, gDummy if 0 == gMode: # get from proxy array = gProxy.getData(key) Elif 1 == gMode: # bypass retrieve to test difference array = gDummy else: assert 0, 'unknown mode: %s' % gMode for i in range(1000): x = sum(array) return x if __name__ == '__main__': port = 5000 maxkey = 1000 numpts = 100 for mode in [1, 0]: for nprocs in [16, 1]: if 0==mode: print 'Using client/server and %d processes' % nprocs if 1==mode: print 'Using local data and %d processes' % nprocs keys = [numpy.random.randint(0,maxkey) for k in xrange(numpts)] pool = mp.Pool(nprocs, initializer=init, initargs=(port,mode)) start = time.time() ret_data = pool.map(worker, keys, chunksize=1) print ' took %4.3f seconds' % (time.time()-start) pool.close() _

これを自分のマシンで実行すると、...

_Using local data and 16 processes took 0.695 seconds Using local data and 1 processes took 5.849 seconds Using client/server and 16 processes took 0.811 seconds Using client/server and 1 processes took 5.956 seconds _

これがマルチプロセッシングシステムで機能するかどうかは、データを取得する頻度によって異なります。各転送に関連する小さなオーバーヘッドがあります。これは、x=sum(array)ループの反復回数を減らすとわかります。ある時点で、データに取り組むよりもデータの取得に多くの時間を費やすことになります。

マルチプロセッシングに加えて、サーバープログラムに大きな配列データを1回ロードするだけで、サーバーを強制終了するまでロードされたままになるため、このパターンも気に入っています。つまり、データに対して多数の個別のスクリプトを実行でき、それらはすばやく実行されます。データがロードされるのを待つ必要はありません。

ここでのアプローチはデータベースの使用にいくらか似ていますが、文字列やintなどの単純なDBテーブルだけでなく、あらゆるタイプのpythonオブジェクトで作業できるという利点があります。 DBを使用すると、これらの単純なタイプの方が少し高速ですが、私にとっては、プログラムで作業する傾向があり、データがデータベースに簡単に移植できるとは限りません。