C ++ 11を有効にしたときのstd :: vectorパフォーマンスの回帰

Question

C++ 11を有効にすると、小さなC++スニペットで興味深いパフォーマンスの低下が見つかりました。

_#include <vector> struct Item { int a; int b; }; int main() { const std::size_t num_items = 10000000; std::vector<Item> container; container.reserve(num_items); for (std::size_t i = 0; i < num_items; ++i) { container.Push_back(Item()); } return 0; } _

G ++（GCC）4.8.2 20131219（プレリリース）およびC++ 03を使用すると、次のようになります。

_milian:/tmp$ g++ -O3 main.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 35.206824 task-clock # 0.988 CPUs utilized ( +- 1.23% ) 4 context-switches # 0.116 K/sec ( +- 4.38% ) 0 cpu-migrations # 0.006 K/sec ( +- 66.67% ) 849 page-faults # 0.024 M/sec ( +- 6.02% ) 95,693,808 cycles # 2.718 GHz ( +- 1.14% ) [49.72%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 95,282,359 instructions # 1.00 insns per cycle ( +- 0.65% ) [75.27%] 30,104,021 branches # 855.062 M/sec ( +- 0.87% ) [77.46%] 6,038 branch-misses # 0.02% of all branches ( +- 25.73% ) [75.53%] 0.035648729 seconds time elapsed ( +- 1.22% ) _

一方、C++ 11を有効にすると、パフォーマンスが大幅に低下します。

_milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 86.485313 task-clock # 0.994 CPUs utilized ( +- 0.50% ) 9 context-switches # 0.104 K/sec ( +- 1.66% ) 2 cpu-migrations # 0.017 K/sec ( +- 26.76% ) 798 page-faults # 0.009 M/sec ( +- 8.54% ) 237,982,690 cycles # 2.752 GHz ( +- 0.41% ) [51.32%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 135,730,319 instructions # 0.57 insns per cycle ( +- 0.32% ) [75.77%] 30,880,156 branches # 357.057 M/sec ( +- 0.25% ) [75.76%] 4,188 branch-misses # 0.01% of all branches ( +- 7.59% ) [74.08%] 0.087016724 seconds time elapsed ( +- 0.50% ) _

誰かがこれを説明できますか？これまでの私の経験では、特にC++ 11を有効にするとSTLが高速になります。移動のセマンティクスに感謝します。

EDIT：提案されているように、代わりにcontainer.emplace_back();を使用すると、パフォーマンスはC++ 03バージョンと同等になります。 C++ 03バージョンはどのようにして_Push_back_に対して同じことを達成できますか？

_milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 36.229348 task-clock # 0.988 CPUs utilized ( +- 0.81% ) 4 context-switches # 0.116 K/sec ( +- 3.17% ) 1 cpu-migrations # 0.017 K/sec ( +- 36.85% ) 798 page-faults # 0.022 M/sec ( +- 8.54% ) 94,488,818 cycles # 2.608 GHz ( +- 1.11% ) [50.44%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 94,851,411 instructions # 1.00 insns per cycle ( +- 0.98% ) [75.22%] 30,468,562 branches # 840.991 M/sec ( +- 1.07% ) [76.71%] 2,723 branch-misses # 0.01% of all branches ( +- 9.84% ) [74.81%] 0.036678068 seconds time elapsed ( +- 0.80% ) _

Ali · Accepted Answer

投稿で書いたオプションを使用して、結果をマシンで再現できます。

ただし、リンク時間の最適化（gcc 4.7.2に_-flto_フラグも渡す）も有効にすると、結果は同じになります：

（元のコードをcontainer.Push_back(Item());でコンパイルしています）

_$ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 35.426793 task-clock # 0.986 CPUs utilized ( +- 1.75% ) 4 context-switches # 0.116 K/sec ( +- 5.69% ) 0 CPU-migrations # 0.006 K/sec ( +- 66.67% ) 19,801 page-faults # 0.559 M/sec 99,028,466 cycles # 2.795 GHz ( +- 1.89% ) [77.53%] 50,721,061 stalled-cycles-frontend # 51.22% frontend cycles idle ( +- 3.74% ) [79.47%] 25,585,331 stalled-cycles-backend # 25.84% backend cycles idle ( +- 4.90% ) [73.07%] 141,947,224 instructions # 1.43 insns per cycle # 0.36 stalled cycles per insn ( +- 0.52% ) [88.72%] 37,697,368 branches # 1064.092 M/sec ( +- 0.52% ) [88.75%] 26,700 branch-misses # 0.07% of all branches ( +- 3.91% ) [83.64%] 0.035943226 seconds time elapsed ( +- 1.79% ) $ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 35.510495 task-clock # 0.988 CPUs utilized ( +- 2.54% ) 4 context-switches # 0.101 K/sec ( +- 7.41% ) 0 CPU-migrations # 0.003 K/sec ( +-100.00% ) 19,801 page-faults # 0.558 M/sec ( +- 0.00% ) 98,463,570 cycles # 2.773 GHz ( +- 1.09% ) [77.71%] 50,079,978 stalled-cycles-frontend # 50.86% frontend cycles idle ( +- 2.20% ) [79.41%] 26,270,699 stalled-cycles-backend # 26.68% backend cycles idle ( +- 8.91% ) [74.43%] 141,427,211 instructions # 1.44 insns per cycle # 0.35 stalled cycles per insn ( +- 0.23% ) [87.66%] 37,366,375 branches # 1052.263 M/sec ( +- 0.48% ) [88.61%] 26,621 branch-misses # 0.07% of all branches ( +- 5.28% ) [83.26%] 0.035953916 seconds time elapsed _

理由については、生成されたアセンブリコード（_g++ -std=c++11 -O3 -S regr.cpp_）を確認する必要があります。 C++ 11モードでは、生成されたコードはC++ 98モードの場合よりもかなり乱雑になり、関数のインライン化
void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&)
fails C++ 11モードでデフォルトの_inline-limit_を使用。

この失敗したインラインにはドミノ効果があります。この関数が呼び出されているためではなく（呼び出されていません！）、準備する必要があるためです。If呼び出された場合、関数の引数（_Item.a_および_Item.b_）は既に適切な場所になければなりません。これはかなり厄介なコードにつながります。

インライン化が成功するの場合に生成されるコードの関連部分は次のとおりです。

_.L42: testq %rbx, %rbx # container$D13376$_M_impl$_M_finish je .L3 #, movl $0, (%rbx) #, container$D13376$_M_impl$_M_finish_136->a movl $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b .L3: addq $8, %rbx #, container$D13376$_M_impl$_M_finish subq $1, %rbp #, ivtmp.106 je .L41 #, .L14: cmpq %rbx, %rdx # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage jne .L42 #, _

これは、素敵でコンパクトなforループです。それでは、これを failed inline の場合と比較してみましょう。

_.L49: testq %rax, %rax # D.15772 je .L26 #, movq 16(%rsp), %rdx # D.13379, D.13379 movq %rdx, (%rax) # D.13379, *D.15772_60 .L26: addq $8, %rax #, tmp75 subq $1, %rbx #, ivtmp.117 movq %rax, 40(%rsp) # tmp75, container.D.13376._M_impl._M_finish je .L48 #, .L28: movq 40(%rsp), %rax # container.D.13376._M_impl._M_finish, D.15772 cmpq 48(%rsp), %rax # container.D.13376._M_impl._M_end_of_storage, D.15772 movl $0, 16(%rsp) #, D.13379.a movl $0, 20(%rsp) #, D.13379.b jne .L49 #, leaq 16(%rsp), %rsi #, leaq 32(%rsp), %rdi #, call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_ # _

このコードは乱雑であり、前のケースよりもループ内で多くのことが行われています。関数call（最後の行を表示）の前に、引数を適切に配置する必要があります。

_leaq 16(%rsp), %rsi #, leaq 32(%rsp), %rdi #, call _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_ # _

これが実際に実行されることはありませんが、ループは前のものを配置します。

_movl $0, 16(%rsp) #, D.13379.a movl $0, 20(%rsp) #, D.13379.b _

これは厄介なコードになります。インライン化が成功したために関数callがない場合、ループ内に移動命令は2つしかありません。 _%rsp_（スタックポインター）。ただし、インライン化が失敗した場合、6回の移動が発生し、_%rsp_が大きく混乱します。

私の理論を実証するために（_-finline-limit_に注意してください）、両方ともC++ 11モードで：

_ $ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 84.739057 task-clock # 0.993 CPUs utilized ( +- 1.34% ) 8 context-switches # 0.096 K/sec ( +- 2.22% ) 1 CPU-migrations # 0.009 K/sec ( +- 64.01% ) 19,801 page-faults # 0.234 M/sec 266,809,312 cycles # 3.149 GHz ( +- 0.58% ) [81.20%] 206,804,948 stalled-cycles-frontend # 77.51% frontend cycles idle ( +- 0.91% ) [81.25%] 129,078,683 stalled-cycles-backend # 48.38% backend cycles idle ( +- 1.37% ) [69.49%] 183,130,306 instructions # 0.69 insns per cycle # 1.13 stalled cycles per insn ( +- 0.85% ) [85.35%] 38,759,720 branches # 457.401 M/sec ( +- 0.29% ) [85.43%] 24,527 branch-misses # 0.06% of all branches ( +- 2.66% ) [83.52%] 0.085359326 seconds time elapsed ( +- 1.31% ) $ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out Performance counter stats for './a.out' (10 runs): 37.790325 task-clock # 0.990 CPUs utilized ( +- 2.06% ) 4 context-switches # 0.098 K/sec ( +- 5.77% ) 0 CPU-migrations # 0.011 K/sec ( +- 55.28% ) 19,801 page-faults # 0.524 M/sec 104,699,973 cycles # 2.771 GHz ( +- 2.04% ) [78.91%] 58,023,151 stalled-cycles-frontend # 55.42% frontend cycles idle ( +- 4.03% ) [78.88%] 30,572,036 stalled-cycles-backend # 29.20% backend cycles idle ( +- 5.31% ) [71.40%] 140,669,773 instructions # 1.34 insns per cycle # 0.41 stalled cycles per insn ( +- 1.40% ) [88.14%] 38,117,067 branches # 1008.646 M/sec ( +- 0.65% ) [89.38%] 27,519 branch-misses # 0.07% of all branches ( +- 4.01% ) [86.16%] 0.038187580 seconds time elapsed ( +- 2.05% ) _

実際、コンパイラーに関数をインライン化するために少しだけ努力するように依頼すると、パフォーマンスの違いはなくなります。

それでは、この話から何が得られるのでしょうか？インラインに失敗すると多大なコストがかかる可能性があり、コンパイラー機能を最大限に活用する必要があります。リンク時間の最適化のみを推奨できます。 2.5x）そして、私がする必要があるのは_-flto_フラグを渡すことだけです。それはかなり良いことです！ ;）

ただし、インラインキーワードでコードを破棄することはお勧めしません。コンパイラーに何をすべきかを決定させます。（オプティマイザーは、インラインキーワードを空白として扱うことができます。）

すばらしい質問、+ 1！