複数のテンソルフローセッションを同時に実行する

Question

64 CPUを搭載したCentOS 7マシンでTensorFlowの複数のセッションを同時に実行しようとしています。私の同僚は、彼が次の2つのコードブロックを使用して、4コアを使用するマシンで並列スピードアップを生成できると報告しています。

mnist.py

import numpy as np import input_data from PIL import Image import tensorflow as tf import time def main(randint): print 'Set new seed:', randint np.random.seed(randint) tf.set_random_seed(randint) mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) # Setting up the softmax architecture x = tf.placeholder("float", [None, 784]) W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) y = tf.nn.softmax(tf.matmul(x, W) + b) # Setting up the cost function y_ = tf.placeholder("float", [None, 10]) cross_entropy = -tf.reduce_sum(y_*tf.log(y)) train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy) # Initialization init = tf.initialize_all_variables() sess = tf.Session( config=tf.ConfigProto( inter_op_parallelism_threads=1, intra_op_parallelism_threads=1 ) ) sess.run(init) for i in range(1000): batch_xs, batch_ys = mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) print sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels}) if __name__ == "__main__": t1 = time.time() main(0) t2 = time.time() print "time spent: {0:.2f}".format(t2 - t1)

parallel.py

import multiprocessing import numpy as np import mnist import time t1 = time.time() p1 = multiprocessing.Process(target=mnist.main,args=(np.random.randint(10000000),)) p2 = multiprocessing.Process(target=mnist.main,args=(np.random.randint(10000000),)) p3 = multiprocessing.Process(target=mnist.main,args=(np.random.randint(10000000),)) p1.start() p2.start() p3.start() p1.join() p2.join() p3.join() t2 = time.time() print "time spent: {0:.2f}".format(t2 - t1)

特に、彼は観察すると言います

Running a single process took: 39.54 seconds Running three processes took: 54.16 seconds

ただし、コードを実行すると：

python mnist.py ==> Time spent: 5.14 python parallel.py ==> Time spent: 37.65

ご覧のように、私はマルチプロセッシングを使用することで大幅な速度低下を引き起こしていますが、私の同僚はそうではありません。なぜこれが発生しているのか、それを修正するために何ができるのかについて誰かが洞察を持っていますか？

編集

次に出力例を示します。データのロードは並行して発生しているように見えますが、個々のモデルをトレーニングすると、出力が非常に逐次的に見えることに注意してください（プログラムの実行時にtopのCPU使用率を調べることで確認できます）。

#$ python parallel.py Set new seed: 9672406 Extracting MNIST_data/train-images-idx3-ubyte.gz Set new seed: 4790824 Extracting MNIST_data/train-images-idx3-ubyte.gz Set new seed: 8011659 Extracting MNIST_data/train-images-idx3-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/train-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-images-idx3-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz Extracting MNIST_data/t10k-labels-idx1-ubyte.gz I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 1 I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 1 0.9136 I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 1 I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 1 0.9149 I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 1 I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 1 0.8931 time spent: 41.36

別の編集

問題がマルチプロセシングではなく、TensorFlowにあるように見えることを確認したいとします。次のように、mnist.pyの内容を大きなループに置き換えました。

def main(randint): c = 0 for i in xrange(100000000): c += i

出力の場合：

#$ python mnist.py ==> time spent: 5.16 #$ python parallel.py ==> time spent: 4.86

したがって、ここでの問題はマルチプロセッシング自体にあるのではないと思います。

Guy Coder · Answer

OPのコメントから（ ser1936768 ）：

良い知らせがあります。少なくとも私のシステムでは、TFの他のインスタンスが起動するのに十分なほど、試用プログラムが実行されなかったことがわかりました。実行時間の長いサンプルプログラムをメインに配置すると、実際には同時計算が表示されます

Robert Nishihara · Answer

これは、並列および分散Python用のライブラリである Ray を使用してエレガントに実行できます。これにより、単一のPythonスクリプトからモデルを並列でトレーニングできます。

これには、「クラス」を「アクター」に変換することで「クラス」を並列化できるという利点があります。これは、通常のPythonマルチプロセッシングでは困難な場合があります。これは重要です。 TensorFlowグラフ。アクターを作成してからtrainメソッドを複数回呼び出すと、グラフの初期化のコストが償却されます。

import numpy as np from tensorflow.examples.tutorials.mnist import input_data from PIL import Image import ray import tensorflow as tf import time @ray.remote class TrainingActor(object): def __init__(self, seed): print('Set new seed:', seed) np.random.seed(seed) tf.set_random_seed(seed) self.mnist = input_data.read_data_sets('MNIST_data/', one_hot=True) # Setting up the softmax architecture. self.x = tf.placeholder('float', [None, 784]) W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) self.y = tf.nn.softmax(tf.matmul(self.x, W) + b) # Setting up the cost function. self.y_ = tf.placeholder('float', [None, 10]) cross_entropy = -tf.reduce_sum(self.y_*tf.log(self.y)) self.train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy) # Initialization self.init = tf.initialize_all_variables() self.sess = tf.Session( config=tf.ConfigProto( inter_op_parallelism_threads=1, intra_op_parallelism_threads=1 ) ) def train(self): self.sess.run(self.init) for i in range(1000): batch_xs, batch_ys = self.mnist.train.next_batch(100) self.sess.run(self.train_step, feed_dict={self.x: batch_xs, self.y_: batch_ys}) correct_prediction = tf.equal(tf.argmax(self.y, 1), tf.argmax(self.y_, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, 'float')) return self.sess.run(accuracy, feed_dict={self.x: self.mnist.test.images, self.y_: self.mnist.test.labels}) if __name__ == '__main__': # Start Ray. ray.init() # Create 3 actors. training_actors = [TrainingActor.remote(seed) for seed in range(3)] # Make them all train in parallel. accuracy_ids = [actor.train.remote() for actor in training_actors] print(ray.get(accuracy_ids)) # Start new training runs in parallel. accuracy_ids = [actor.train.remote() for actor in training_actors] print(ray.get(accuracy_ids))

各アクターにデータセットを読み取らせるのではなく、データセットのコピーを1つだけ作成する場合は、次のように書き直すことができます。内部的には、これは Plasma共有メモリオブジェクトストアおよび Apache Arrowデータ形式を使用します。

@ray.remote class TrainingActor(object): def __init__(self, mnist, seed): self.mnist = mnist ... ... if __name__ == "__main__": ray.init() # Read the mnist dataset and put it into shared memory once # so that workers don't create their own copies. mnist = input_data.read_data_sets('MNIST_data/', one_hot=True) mnist_id = ray.put(mnist) training_actors = [TrainingActor.remote(mnist_id, seed) for seed in range(3)]

詳細は Rayのドキュメントで確認できます。注：私はRay開発者の1人です。

Yaroslav Bulatov · Answer

1つの可能性は、セッションがそれぞれ64コアを使用しようとして互いに踏みつけていることです。おそらく、各セッションのNUM_CORESを低い値に設定してみてください

sess = tf.Session( tf.ConfigProto(inter_op_parallelism_threads=NUM_CORES, intra_op_parallelism_threads=NUM_CORES))