強度1のエッジマトリックスを持つデバイスインターコネクトStreamExecutorとは

Question

4枚のNVIDIA GTX 1080グラフィックカードがあり、セッションを初期化すると、次のコンソール出力が表示されます。

Adding visible gpu devices: 0, 1, 2, 3 Device interconnect StreamExecutor with strength 1 Edge matrix: 0 1 2 3 0: N Y N N 1: Y N N N 2: N N N Y 3: N N Y N

また、2枚のNVIDIA M60 Teslaグラフィックカードがあり、初期化は次のようになります。

Adding visible gpu devices: 0, 1, 2, 3 Device interconnect StreamExecutor with strength 1 Edge matrix: 0 1 2 3 0: N N N N 1: N N N N 2: N N N N 3: N N N N

そして、前回の1080 gpuの1.6から1.8への更新以降、この出力が変更されていることに気付きました。これは次のように見えました（正確に思い出せず、思い出だけです）：

 Adding visible gpu devices: 0, 1, 2, 3 Device interconnect StreamExecutor with strength 1 Edge matrix: 0 1 2 3 0 1 2 3 0: Y N N N 0: N N Y N 1: N Y N N or 1: N N N Y 2: N N Y N 2: Y N N N 3: N N N Y 3: N Y N N

私の質問は：

これは何ですかデバイスの相互接続？
計算能力にどのような影響を与えますか？
gPUごとに異なるのはなぜですか？
ハードウェアの理由（障害、ドライバーの不整合など）により、時間の経過とともに変化する可能性がありますか？

McAngus · Accepted Answer

TL; DR

このデバイス相互接続とは何ですか？

コメントでAlmog Davidが述べたように、これは、1つのGPUが他のGPUに直接メモリアクセスできるかどうかを示します。

計算能力にどのような影響を与えますか？

これが持つ唯一の効果は、マルチGPUトレーニングの場合です。 2つのGPUにデバイス相互接続がある場合、データ転送は高速になります。

gPUごとに異なるのはなぜですか？

これは、ハードウェアセットアップのトポロジに依存します。マザーボードには、同じバスで接続された非常に多くのPCI-eスロットしかありません。（nvidia-smi topo -mでトポロジを確認してください）

ハードウェアの理由（障害、ドライバーの不整合など）により、時間の経過とともに変化する可能性がありますか？

NVIDIAがデフォルトの列挙スキームを変更しない限り、時間の経過とともに順序が変わることはないと思います。もう少し詳細がありますこちら

説明

このメッセージは BaseGPUDeviceFactory::CreateDevices 関数で生成されます。デバイスの各ペアを指定された順序で繰り返し、 cuDeviceCanAccessPeer 。 Almog Davidがコメントで述べているように、これはデバイス間でDMA=を実行できるかどうかを示しています。

少しのテストを実行して、順序が重要であることを確認できます。次のスニペットを検討してください。

#test.py import tensorflow as tf #allow growth to take up minimal resources config = tf.ConfigProto() config.gpu_options.allow_growth = True sess = tf.Session(config=config)

CUDA_VISIBLE_DEVICESの異なるデバイス順序で出力を確認しましょう

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py ... 2019-03-26 15:26:16.111423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 Edge matrix: 2019-03-26 15:26:18.635965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N 2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N 2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y 2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N ... $ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py ... 2019-03-26 15:26:30.090493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3 2019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 Edge matrix: 2019-03-26 15:26:32.758349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3 2019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N Y 2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N Y N 2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N Y N N 2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y N N N ...

nvidia-smi topo -mを実行すると、接続の詳細な説明を取得できます。例えば：

 GPU0 GPU1 GPU2 GPU3 CPU Affinity GPU0 X PHB SYS SYS 0-7,16-23 GPU1 PHB X SYS SYS 0-7,16-23 GPU2 SYS SYS X PHB 8-15,24-31 GPU3 SYS SYS PHB X 8-15,24-31 Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge) PIX = Connection traversing a single PCIe switch NV# = Connection traversing a bonded set of # NVLinks

リストの下位に行くほど、転送が速くなると思います。