AMD24コアサーバーのメモリ帯域幅

Question

サーバー上のLinuxで表示されているメモリ帯域幅が正常かどうかを判断するには、いくつかの助けが必要です。サーバーの仕様は次のとおりです。

HP ProLiant DL165 G7 2x AMD Opteron 6164 HE 12-Core 40 GB RAM (10 x 4GB DDR1333) Debian 6.0

このサーバーでmbwを使用すると、次の数値が得られます。

foo1:~# mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.58047 MiB: 1024.00000 Copy: 1764.082 MiB/s 1 Method: MEMCPY Elapsed: 0.58012 MiB: 1024.00000 Copy: 1765.152 MiB/s 2 Method: MEMCPY Elapsed: 0.58010 MiB: 1024.00000 Copy: 1765.201 MiB/s AVG Method: MEMCPY Elapsed: 0.58023 MiB: 1024.00000 Copy: 1764.811 MiB/s 0 Method: DUMB Elapsed: 0.36174 MiB: 1024.00000 Copy: 2830.778 MiB/s 1 Method: DUMB Elapsed: 0.35869 MiB: 1024.00000 Copy: 2854.817 MiB/s 2 Method: DUMB Elapsed: 0.35848 MiB: 1024.00000 Copy: 2856.481 MiB/s AVG Method: DUMB Elapsed: 0.35964 MiB: 1024.00000 Copy: 2847.310 MiB/s 0 Method: MCBLOCK Elapsed: 0.23546 MiB: 1024.00000 Copy: 4348.860 MiB/s 1 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.230 MiB/s 2 Method: MCBLOCK Elapsed: 0.23544 MiB: 1024.00000 Copy: 4349.359 MiB/s AVG Method: MCBLOCK Elapsed: 0.23545 MiB: 1024.00000 Copy: 4349.149 MiB/s

他のサーバーの1つ（Intel Xeon E3-1270に基づく）：

foo2:~# mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.18960 MiB: 1024.00000 Copy: 5400.901 MiB/s 1 Method: MEMCPY Elapsed: 0.18922 MiB: 1024.00000 Copy: 5411.690 MiB/s 2 Method: MEMCPY Elapsed: 0.18944 MiB: 1024.00000 Copy: 5405.491 MiB/s AVG Method: MEMCPY Elapsed: 0.18942 MiB: 1024.00000 Copy: 5406.024 MiB/s 0 Method: DUMB Elapsed: 0.14838 MiB: 1024.00000 Copy: 6901.200 MiB/s 1 Method: DUMB Elapsed: 0.14818 MiB: 1024.00000 Copy: 6910.561 MiB/s 2 Method: DUMB Elapsed: 0.14820 MiB: 1024.00000 Copy: 6909.628 MiB/s AVG Method: DUMB Elapsed: 0.14825 MiB: 1024.00000 Copy: 6907.127 MiB/s 0 Method: MCBLOCK Elapsed: 0.04362 MiB: 1024.00000 Copy: 23477.623 MiB/s 1 Method: MCBLOCK Elapsed: 0.04262 MiB: 1024.00000 Copy: 24025.151 MiB/s 2 Method: MCBLOCK Elapsed: 0.04258 MiB: 1024.00000 Copy: 24048.849 MiB/s AVG Method: MCBLOCK Elapsed: 0.04294 MiB: 1024.00000 Copy: 23847.599 MiB/s

参考までに、Intelベースのラップトップで得られるものは次のとおりです。

laptop:~$ mbw -n 3 1024 Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory. Using 262144 bytes as blocks for memcpy block copy test. Getting down to business... Doing 3 runs per test. 0 Method: MEMCPY Elapsed: 0.40566 MiB: 1024.00000 Copy: 2524.269 MiB/s 1 Method: MEMCPY Elapsed: 0.38458 MiB: 1024.00000 Copy: 2662.638 MiB/s 2 Method: MEMCPY Elapsed: 0.38876 MiB: 1024.00000 Copy: 2634.043 MiB/s AVG Method: MEMCPY Elapsed: 0.39300 MiB: 1024.00000 Copy: 2605.600 MiB/s 0 Method: DUMB Elapsed: 0.30707 MiB: 1024.00000 Copy: 3334.745 MiB/s 1 Method: DUMB Elapsed: 0.30425 MiB: 1024.00000 Copy: 3365.653 MiB/s 2 Method: DUMB Elapsed: 0.30342 MiB: 1024.00000 Copy: 3374.849 MiB/s AVG Method: DUMB Elapsed: 0.30491 MiB: 1024.00000 Copy: 3358.328 MiB/s 0 Method: MCBLOCK Elapsed: 0.07875 MiB: 1024.00000 Copy: 13003.670 MiB/s 1 Method: MCBLOCK Elapsed: 0.08374 MiB: 1024.00000 Copy: 12228.034 MiB/s 2 Method: MCBLOCK Elapsed: 0.07635 MiB: 1024.00000 Copy: 13411.216 MiB/s AVG Method: MCBLOCK Elapsed: 0.07961 MiB: 1024.00000 Copy: 12862.006 MiB/s

したがって、mbwによると、私のラップトップはサーバーより3倍高速です!!!これを説明するのを手伝ってください。また、RAMディスクをマウントし、ddを使用してベンチマークを試みましたが、同様の違いが得られるため、mbwのせいではないと思います。

BIOS設定を確認しましたが、メモリはフルスピードで動作しているようです。ホスティング会社によると、モジュールはすべてOKです。

これはNUMAと関係がありますか？ Nodeこのサーバーではインターリーブが無効になっているようです。有効にする（つまりNUMAをオフにする）と違いがありますか？

foo1:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 node 0 size: 8190 MB node 0 free: 7898 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 12288 MB node 1 free: 12073 MB node 2 cpus: 18 19 20 21 22 23 node 2 size: 12288 MB node 2 free: 12034 MB node 3 cpus: 12 13 14 15 16 17 node 3 size: 8192 MB node 3 free: 8032 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10

更新：

BIOSでNUMA（Linuxブートではnuma = off）を無効にし、ECCを無効にしました。変更はありませんが、上記と同じ番号です。

更新2：

dmidecodeに従ったメモリのレイアウトは次のとおりです。

PROC 1 DIMM 1 PROC 1 DIMM 4 PROC 1 DIMM 7 PROC 1 DIMM 10 PROC 1 DIMM 12 PROC 2 DIMM 1 PROC 2 DIMM 4 PROC 2 DIMM 7 PROC 2 DIMM 10 PROC 2 DIMM 12

これらはすべて 4GB Samsungモジュール（部品番号M393B5270CH0-CH9）

このサーバーのメモリを装着する方法に関するHPのドキュメントを確認しました。正しく理解できれば、現在DIMM12にあるモジュールをDIMM3スロットに配置する必要があります。そのような設定ミスは、私が得ている結果を説明できますか？

更新3：

2つのモジュールを削除して、1-4-7-10に配置された各側（4-4）に4x4GBを取得しました。残念ながら、ベンチマークに違いは見られません。サーバーは4つのチャネルすべてを使用できるようにすべきではありませんか？また、複数のスレッドを使用してstreamベンチマークを試してみましたが、結果は非常に残念です。私が知ることができる唯一のことは、ホスティング会社にサーバー全体を交換するように依頼することです...

更新4：

昨日streamで最後のセットアップ（32 GB）をテストしたとき、今日は素晴らしい結果が得られているので、何か間違ったことをしたに違いありません。

foo1:~# ./stream ------------------------------------------------------------- STREAM version $Revision: 5.9 $ ------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION Word. ------------------------------------------------------------- Array size = 2000000, Offset = 0 Total memory required = 45.8 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------- Number of Threads requested = 24 ------------------------------------------------------------- Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... Printing one line per active thread.... ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 703 microseconds. (= 703 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 36873.0022 0.0009 0.0009 0.0010 Scale: 34699.5160 0.0009 0.0009 0.0010 Add: 30868.8427 0.0016 0.0016 0.0017 Triad: 25558.7904 0.0019 0.0019 0.0020 ------------------------------------------------------------- Solution Validates -------------------------------------------------------------

（mbwはシングルスレッドモードでのみ実行されるため、放棄しました。それでも、このサーバーで同じくだらない結果が得られます）。

したがって、問題は、以下で指摘する@chxのように、サーバーをシングルチャネルモードで実行するように強制した最後の2つの4GBモジュールであったに違いありません。今残っている唯一の質問は、40 GBを使用しても、全帯域幅を取得できるかどうかです。 2 x 8GB + 6 x 4GBを使用できますか？大きなモジュールをどのチャネルに配置するかは重要ですか？

chx · Accepted Answer

CPUごとに4-4または8-8ではなく5-5モジュールを使用して、システムをシングルチャネル（！）モードで動作させるように強制しています。それが理由です。 1〜1を削除して、報告してください。

6164はG34ソケットCPUであり、メモリモジュールが正しくセットアップされていればクアッドチャネルで動作できます。あなたのセットアップは可能な限り最悪です。