共有読み取り専用データは、マルチプロセッシングのために異なるプロセスにコピーされますか？

Question

私が持っているコードは、次のようなものです。

_glbl_array = # a 3 Gb array def my_func( args, def_param = glbl_array): #do stuff on args and def_param if __== '__main__': pool = Pool(processes=4) pool.map(my_func, range(1000)) _

異なるプロセスがglbl_arrayのコピーを取得せずに共有することを確認（または推奨）する方法はありますか。コピーを停止する方法がない場合は、マップされたアレイを使用しますが、アクセスパターンはあまり規則的ではないため、マップされたアレイはより低速になると予想されます。上記は最初に試すことのように思えました。これはLinux上にあります。 Stackoverflowからのアドバイスが欲しいだけで、システム管理者を困らせたくありません。 2番目のパラメーターがglbl_array.tostring()のような真の不変オブジェクトである場合に役立つと思いますか。

pv. · Accepted Answer

multiprocessingの共有メモリをNumpyと一緒にかなり簡単に使用できます。

_import multiprocessing import ctypes import numpy as np shared_array_base = multiprocessing.Array(ctypes.c_double, 10*10) shared_array = np.ctypeslib.as_array(shared_array_base.get_obj()) shared_array = shared_array.reshape(10, 10) #-- edited 2015-05-01: the assert check below checks the wrong thing # with recent versions of Numpy/multiprocessing. That no copy is made # is indicated by the fact that the program prints the output shown below. ## No copy was made ##assert shared_array.base.base is shared_array_base.get_obj() # Parallel processing def my_func(i, def_param=shared_array): shared_array[i,:] = i if __== '__main__': pool = multiprocessing.Pool(processes=4) pool.map(my_func, range(10)) print shared_array_

印刷する

_[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] [ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.] [ 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.] [ 4. 4. 4. 4. 4. 4. 4. 4. 4. 4.] [ 5. 5. 5. 5. 5. 5. 5. 5. 5. 5.] [ 6. 6. 6. 6. 6. 6. 6. 6. 6. 6.] [ 7. 7. 7. 7. 7. 7. 7. 7. 7. 7.] [ 8. 8. 8. 8. 8. 8. 8. 8. 8. 8.] [ 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.]]_

ただし、Linuxではfork()にコピーオンライトセマンティクスがあるため、_multiprocessing.Array_を使用しなくても、データは書き込まない限りコピーされません。

taku-y · Answer

次のコードは、Win7およびMacで動作します（Linuxで動作する可能性がありますが、テストされていません）。

import multiprocessing import ctypes import numpy as np #-- edited 2015-05-01: the assert check below checks the wrong thing # with recent versions of Numpy/multiprocessing. That no copy is made # is indicated by the fact that the program prints the output shown below. ## No copy was made ##assert shared_array.base.base is shared_array_base.get_obj() shared_array = None def init(shared_array_base): global shared_array shared_array = np.ctypeslib.as_array(shared_array_base.get_obj()) shared_array = shared_array.reshape(10, 10) # Parallel processing def my_func(i): shared_array[i, :] = i if __== '__main__': shared_array_base = multiprocessing.Array(ctypes.c_double, 10*10) pool = multiprocessing.Pool(processes=4, initializer=init, initargs=(shared_array_base,)) pool.map(my_func, range(10)) shared_array = np.ctypeslib.as_array(shared_array_base.get_obj()) shared_array = shared_array.reshape(10, 10) print shared_array

Brian White · Answer

fork()をサポートしていない（CygWinを使用しない限り）Windowsを使用している場合、pvの答えは機能しません。グローバルは子プロセスで利用できません。

代わりに、Pool/Processの初期化中に共有メモリを渡す必要があります。

#! /usr/bin/python import time from multiprocessing import Process, Queue, Array def f(q,a): m = q.get() print m print a[0], a[1], a[2] m = q.get() print m print a[0], a[1], a[2] if __== '__main__': a = Array('B', (1, 2, 3), lock=False) q = Queue() p = Process(target=f, args=(q,a)) p.start() q.put([1, 2, 3]) time.sleep(1) a[0:3] = (4, 5, 6) q.put([4, 5, 6]) p.join()

（それはnumpyではなく、良いコードではありませんが、ポイントを示しています;-)

RichardB · Answer

Windowsで効率的に機能し、不規則なアクセスパターン、分岐、および共有メモリマトリックスとプロセスローカルデータの組み合わせに基づいて異なるマトリックスを分析する必要がある他のシナリオに適したオプションを探している場合、 ParallelRegression パッケージのmathDictツールキットは、この正確な状況を処理するように設計されています。