ResourceExhaustedErrorを理解する：形状を持つテンソルを割り当てるときのOOM

Question

私はテンソルフローを使用して思考のスキップモデルを実装しようとしています。現在のバージョンが配置されていますここ。

現在、マシンの1つのGPU（合計2つのGPU）を使用しており、GPU情報は

_2017-09-06 11:29:32.657299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate (GHz) 1.683 pciBusID 0000:02:00.0 Total memory: 10.91GiB Free memory: 10.75GiB _

しかし、モデルにデータをフィードしようとしたときにOOMを取得しました。私は次のようにデバッグしようとします：

sess.run(tf.global_variables_initializer())を実行した直後に次のスニペットを使用します

_ logger.info('Total: {} params'.format( np.sum([ np.prod(v.get_shape().as_list()) for v in tf.trainable_variables() ]))) _

_2017-09-06 11:29:51,333 INFO main main.py:127 - Total: 62968629 params_を取得しました。すべてが_240Mb_を使用している場合、おおよそ_tf.float32_になります。 _tf.global_variables_の出力は

_[<tf.Variable 'embedding/embedding_matrix:0' shape=(155229, 200) dtype=float32_ref>, <tf.Variable 'encoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>, <tf.Variable 'encoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>, <tf.Variable 'encoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>, <tf.Variable 'encoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>, <tf.Variable 'decoder/weights:0' shape=(200, 155229) dtype=float32_ref>, <tf.Variable 'decoder/biases:0' shape=(155229,) dtype=float32_ref>, <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>, <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>, <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>, <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>, <tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>, <tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>, <tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>, <tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>, <tf.Variable 'global_step:0' shape=() dtype=int32_ref>] _

私のトレーニングフレーズには、_(164652, 3, 30)_、つまり_sample_size x 3 x time_step_という形のデータ配列があります。ここでの_3_は、前の文、現在の文、次の文を意味します。このトレーニングデータのサイズは約_57Mb_で、loaderに保存されます。次に、ジェネレータ関数を記述して文を取得します。

_def iter_batches(self, batch_size=128, time_major=True, shuffle=True): num_samples = len(self._sentences) if shuffle: samples = self._sentences[np.random.permutation(num_samples)] else: samples = self._sentences batch_start = 0 while batch_start < num_samples: batch = samples[batch_start:batch_start + batch_size] lens = (batch != self._vocab[self._vocab.pad_token]).sum(axis=2) y, x, z = batch[:, 0, :], batch[:, 1, :], batch[:, 2, :] if time_major: yield (y.T, lens[:, 0]), (x.T, lens[:, 1]), (z.T, lens[:, 2]) else: yield (y, lens[:, 0]), (x, lens[:, 1]), (z, lens[:, 2]) batch_start += batch_size _

トレーニングループは次のようになります

_for Epoch in num_epochs: batches = loader.iter_batches(batch_size=args.batch_size) try: (y, y_lens), (x, x_lens), (z, z_lens) = next(batches) _, summaries, loss_val = sess.run( [train_op, train_summary_op, st.loss], feed_dict={ st.inputs: x, st.sequence_length: x_lens, st.previous_targets: y, st.previous_target_lengths: y_lens, st.next_targets: z, st.next_target_lengths: z_lens }) except StopIteraton: ... _

次に、OOMを取得しました。 tryの本文全体（データをフィードしない）をコメント化すると、スクリプトは正常に実行されます。

なぜこのような小さなデータスケールでOOMを取得したのか、私にはわかりません。 _nvidia-smi_を使用する

_Wed Sep 6 12:03:37 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.59 Driver Version: 384.59 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 0% 44C P2 60W / 275W | 10623MiB / 11172MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 0% 43C P2 62W / 275W | 10621MiB / 11171MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 32748 C python3 10613MiB | | 1 32748 C python3 10611MiB | +-----------------------------------------------------------------------------+ _

私はできませんテンソルフロー以降の私のスクリプトのactualGPU使用状況を確認できません最初は常にすべてのメモリを盗みます。ここでの実際の問題は、これをデバッグする方法がわからないことです。

StackOverflowでOOMに関するいくつかの投稿を読みました。それらのほとんどは、大きなテストセットデータをモデルに供給し、データを小さなバッチで供給することで問題を回避できるときに発生しました。しかし、エラーが_[3840 x 155229]_のサイズの行列を割り当てようとするだけなので、なぜこのような小さなデータとパラメーターの組み合わせが私の11Gb 1080Tiで問題になるのかわかりません。（デコーダーの出力行列3840 = 30(time_steps) x 128(batch_size)、_155229_はvocab_sizeです）。

_2017-09-06 12:14:45.787566: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ********************************************************************************************xxxxxxxx 2017-09-06 12:14:45.787597: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[3840,155229] 2017-09-06 12:14:45.788735: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[3840,155229] [[Node: decoder/previous_decoder/Add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/previous_decoder/MatMul, decoder/biases/read)]] 2017-09-06 12:14:45.790453: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2857 get requests, put_count=2078 evicted_count=1000 eviction_rate=0.481232 and unsatisfied allocation rate=0.657683 2017-09-06 12:14:45.790482: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110 Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1139, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1121, in _run_fn status, run_metadata) File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3840,155229] [[Node: decoder/previous_decoder/Add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/previous_decoder/MatMul, decoder/biases/read)]] [[Node: GradientDescent/update/_146 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="Edge_2166_GradientDescent/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]] During handling of the above exception, another exception occurred: _

どんな助けでもありがたいです。前もって感謝します。

Deniz Beker · Accepted Answer

問題を1つずつ分けましょう。

すべてのメモリを事前に割り当てるテンソルフローについては、次のコードスニペットを使用して、テンソルフローに必要なときにいつでもメモリを割り当てさせることができます。状況がわかるように。

_gpu_options = tf.GPUOptions(allow_growth=True) session = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options)) _

これは、必要に応じて、tf.Session()ではなくtf.InteractiveSession()と同じように機能します。

サイズについての2番目のことは、ネットワークのサイズに関する情報がないため、何が問題なのかを推定することはできません。ただし、すべてのネットワークを段階的にデバッグすることもできます。たとえば、1つのレイヤーのみでネットワークを作成し、その出力を取得し、セッションとフィードの値を1回作成し、消費するメモリの量を視覚化します。メモリ不足になるポイントが見つかるまで、このデバッグセッションを繰り返します。

3840 x 155229の出力は本当に大きな出力であることに注意してください。これは、約6億個のニューロンを意味し、1層あたり最大2.22 GBです。同様のサイズのレイヤーがある場合、それらすべてが合計されてGPUメモリがかなり高速にいっぱいになります。

また、これは順方向専用です。このレイヤーをトレーニングに使用している場合、オプティマイザーによって追加されたバックプロパゲーションとレイヤーは、このサイズを2倍にします。したがって、トレーニングの場合、出力レイヤーにのみ5 GBを消費します。

モデルをGPUに合わせるために、ネットワークを修正し、バッチサイズ/パラメーター数を減らすことをお勧めします

rocksyne · Answer

これは技術的に意味をなさないかもしれませんが、しばらく実験した後、これは私が見つけたものです。

環境：Ubuntu 16.04

nvidia-smi

インストールされているNvidiaグラフィックカードの合計メモリ消費量を取得します。例はこの画像に示すとおりです

ニューラルネットワークを実行すると、消費量は次のように変化する可能性があります

通常、メモリ消費はPythonに与えられます。何らかの奇妙な理由で、このプロセスが正常に終了しない場合、メモリは解放されません。ニューラルネットワークアプリケーションの別のインスタンスを実行しようとすると、メモリ割り当てエラーが発生します。難しい方法は、プロセスIDを使用してこのプロセスを終了する方法を理解しようとすることです。例、プロセスID 2794で、次のことができます

Sudo kill -9 2794

簡単な方法は、コンピュータを再起動して再試行することです。ただし、コード関連のバグの場合は機能しません。

上記のプロセスが機能しない場合は、GPUまたはCPUメモリに適合しないデータバッチサイズを使用している可能性があります。

できることは、入力データのバッチサイズまたは空間次元（長さ、幅、奥行き）を減らすことです。これは機能する可能性がありますが、RAMが不足する可能性があります。

RAM=を節約する最も確実な方法は、関数発生器を使用することであり、それ自体が主題です。

Jibin Mathew · Answer

メモリを使い果たしています。バッチサイズを減らすことができます。これにより、トレーニングプロセスが遅くなりますが、データを適合させることができます。