> vs.> =バブルソートでパフォーマンスが大幅に異なる

Question

偶然何かに出会った。最初はこの場合のように分岐予測が間違っているのではないかと思っていましたが、分岐予測が原因でこの現象が発生する理由は説明できません。

Java）にバブルソートの2つのバージョンを実装し、いくつかのパフォーマンステストを行いました。

_import Java.util.Random; public class BubbleSortAnnomaly { public static void main(String... args) { final int ARRAY_SIZE = Integer.parseInt(args[0]); final int LIMIT = Integer.parseInt(args[1]); final int RUNS = Integer.parseInt(args[2]); int[] a = new int[ARRAY_SIZE]; int[] b = new int[ARRAY_SIZE]; Random r = new Random(); for (int run = 0; RUNS > run; ++run) { for (int i = 0; i < ARRAY_SIZE; i++) { a[i] = r.nextInt(LIMIT); b[i] = a[i]; } System.out.print("Sorting with sortA: "); long start = System.nanoTime(); int swaps = bubbleSortA(a); System.out.println( (System.nanoTime() - start) + " ns. " + "It used " + swaps + " swaps."); System.out.print("Sorting with sortB: "); start = System.nanoTime(); swaps = bubbleSortB(b); System.out.println( (System.nanoTime() - start) + " ns. " + "It used " + swaps + " swaps."); } } public static int bubbleSortA(int[] a) { int counter = 0; for (int i = a.length - 1; i >= 0; --i) { for (int j = 0; j < i; ++j) { if (a[j] > a[j + 1]) { swap(a, j, j + 1); ++counter; } } } return (counter); } public static int bubbleSortB(int[] a) { int counter = 0; for (int i = a.length - 1; i >= 0; --i) { for (int j = 0; j < i; ++j) { if (a[j] >= a[j + 1]) { swap(a, j, j + 1); ++counter; } } } return (counter); } private static void swap(int[] a, int j, int i) { int h = a[i]; a[i] = a[j]; a[j] = h; } } _

ご覧のとおり、これら2つの並べ替え方法の唯一の違いは、_>_と_>=_です。 _Java BubbleSortAnnomaly 50000 10 10_を使用してプログラムを実行する場合、sortBはsortAよりも低速であると当然予想されます。これは、より多くのswap(...) sを実行する必要があるためです。しかし、3つの異なるマシンで次の（または同様の）出力が得られました。

_Sorting with sortA: 4.214 seconds. It used 564960211 swaps. Sorting with sortB: 2.278 seconds. It used 1249750569 swaps. Sorting with sortA: 4.199 seconds. It used 563355818 swaps. Sorting with sortB: 2.254 seconds. It used 1249750348 swaps. Sorting with sortA: 4.189 seconds. It used 560825110 swaps. Sorting with sortB: 2.264 seconds. It used 1249749572 swaps. Sorting with sortA: 4.17 seconds. It used 561924561 swaps. Sorting with sortB: 2.256 seconds. It used 1249749766 swaps. Sorting with sortA: 4.198 seconds. It used 562613693 swaps. Sorting with sortB: 2.266 seconds. It used 1249749880 swaps. Sorting with sortA: 4.19 seconds. It used 561658723 swaps. Sorting with sortB: 2.281 seconds. It used 1249751070 swaps. Sorting with sortA: 4.193 seconds. It used 564986461 swaps. Sorting with sortB: 2.266 seconds. It used 1249749681 swaps. Sorting with sortA: 4.203 seconds. It used 562526980 swaps. Sorting with sortB: 2.27 seconds. It used 1249749609 swaps. Sorting with sortA: 4.176 seconds. It used 561070571 swaps. Sorting with sortB: 2.241 seconds. It used 1249749831 swaps. Sorting with sortA: 4.191 seconds. It used 559883210 swaps. Sorting with sortB: 2.257 seconds. It used 1249749371 swaps. _

LIMITのパラメーターを、たとえば_50000_（_Java BubbleSortAnnomaly 50000 50000 10_）に設定すると、期待した結果が得られます。

_Sorting with sortA: 3.983 seconds. It used 625941897 swaps. Sorting with sortB: 4.658 seconds. It used 789391382 swaps. _

プログラムをC++に移植して、この問題がJava固有のものかどうかを判断しました。これがC++コードです。

_#include <cstdlib> #include <iostream> #include <omp.h> #ifndef ARRAY_SIZE #define ARRAY_SIZE 50000 #endif #ifndef LIMIT #define LIMIT 10 #endif #ifndef RUNS #define RUNS 10 #endif void swap(int * a, int i, int j) { int h = a[i]; a[i] = a[j]; a[j] = h; } int bubbleSortA(int * a) { const int LAST = ARRAY_SIZE - 1; int counter = 0; for (int i = LAST; 0 < i; --i) { for (int j = 0; j < i; ++j) { int next = j + 1; if (a[j] > a[next]) { swap(a, j, next); ++counter; } } } return (counter); } int bubbleSortB(int * a) { const int LAST = ARRAY_SIZE - 1; int counter = 0; for (int i = LAST; 0 < i; --i) { for (int j = 0; j < i; ++j) { int next = j + 1; if (a[j] >= a[next]) { swap(a, j, next); ++counter; } } } return (counter); } int main() { int * a = (int *) malloc(ARRAY_SIZE * sizeof(int)); int * b = (int *) malloc(ARRAY_SIZE * sizeof(int)); for (int run = 0; RUNS > run; ++run) { for (int idx = 0; ARRAY_SIZE > idx; ++idx) { a[idx] = std::Rand() % LIMIT; b[idx] = a[idx]; } std::cout << "Sorting with sortA: "; double start = omp_get_wtime(); int swaps = bubbleSortA(a); std::cout << (omp_get_wtime() - start) << " seconds. It used " << swaps << " swaps." << std::endl; std::cout << "Sorting with sortB: "; start = omp_get_wtime(); swaps = bubbleSortB(b); std::cout << (omp_get_wtime() - start) << " seconds. It used " << swaps << " swaps." << std::endl; } free(a); free(b); return (0); } _

このプログラムは同じ動作を示しています。誰かがここで何が起こっているのか正確に説明できますか？

最初にsortBを実行し、次にsortAを実行しても結果は変わりません。

uesp · Accepted Answer

確かに分岐予測のせいかもしれません。スワップの数を内部ソートの反復数と比較すると、次のようになります。

制限= 1

A = 560Mスワップ/ 1250Mループ
B = 1250Mスワップ/ 1250Mループ（ループよりも0.02％少ないスワップ）

制限= 500

A = 627Mスワップ/ 1250Mループ
B = 850Mスワップ/ 1250Mループ

したがって、Limit == 10ケーススワップは、Bソートで99.98％の時間実行されます。これは、分岐予測子にとって明らかに有利です。の中に Limit == 50000スワップがランダムにヒットするのは68％だけなので、分岐予測子はあまり効果がありません。

Petr · Answer

これは確かにブランチの予測ミスで説明できると思います。

たとえば、LIMIT = 11とsortBについて考えてみます。外側のループの最初の反復で、10に等しい要素の1つに非常にすばやくつまずきます。そのため、a[j]=10があり、したがってa[j]は>=a[next]したがって、10より大きい要素はありません。したがって、スワップを実行し、jで1つのステップを実行して、a[j]=10をもう一度検索します（同じスワップ値）。したがって、もう一度a[j]>=a[next]となります。最初のいくつかを除くすべての比較が真になります。同様に、外側のループの次の反復で実行されます。

sortAでは同じではありません。ほぼ同じように開始され、a[j]=10に遭遇し、同様の方法でいくつかのスワップを実行しますが、a[next]=10が見つかった時点までです。その後、条件はfalseになり、スワップは行われません。など：a[next]=10でつまずくたびに、条件はfalseであり、スワップは行われません。したがって、この条件は11のうち10回true（0〜9のa[next]の値）であり、11のうち1つのケースではfalseです。分岐予測が失敗するという奇妙なことは何もありません。

fala · Answer

提供されているC++コードを使用して（時間カウントは削除されました）perf statコマンドブラッシュミス理論を裏付ける結果を得ました。

Limit = 10、BubbleSortBは分岐予測（0.01％ミス）から非常にメリットがありますが、Limit = 50000分岐予測は、BubbleSortA（それぞれ12.69％と12.76％のミス）よりもさらに失敗します（15.65％のミス）。

BubbleSortA制限= 10：

Performance counter stats for './bubbleA.out': 46670.947364 task-clock # 0.998 CPUs utilized 73 context-switches # 0.000 M/sec 28 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 117,298,787,242 cycles # 2.513 GHz 117,471,719,598 instructions # 1.00 insns per cycle 25,104,504,912 branches # 537.904 M/sec 3,185,376,029 branch-misses # 12.69% of all branches 46.779031563 seconds time elapsed

BubbleSortA制限= 50000：

Performance counter stats for './bubbleA.out': 46023.785539 task-clock # 0.998 CPUs utilized 59 context-switches # 0.000 M/sec 8 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 118,261,821,200 cycles # 2.570 GHz 119,230,362,230 instructions # 1.01 insns per cycle 25,089,204,844 branches # 545.136 M/sec 3,200,514,556 branch-misses # 12.76% of all branches 46.126274884 seconds time elapsed

BubbleSortB制限= 10：

Performance counter stats for './bubbleB.out': 26091.323705 task-clock # 0.998 CPUs utilized 28 context-switches # 0.000 M/sec 2 CPU-migrations # 0.000 M/sec 379 page-faults # 0.000 M/sec 64,822,368,062 cycles # 2.484 GHz 137,780,774,165 instructions # 2.13 insns per cycle 25,052,329,633 branches # 960.179 M/sec 3,019,138 branch-misses # 0.01% of all branches 26.149447493 seconds time elapsed

BubbleSortB Limit = 50000：

Performance counter stats for './bubbleB.out': 51644.210268 task-clock # 0.983 CPUs utilized 2,138 context-switches # 0.000 M/sec 69 CPU-migrations # 0.000 M/sec 378 page-faults # 0.000 M/sec 144,600,738,759 cycles # 2.800 GHz 124,273,104,207 instructions # 0.86 insns per cycle 25,104,320,436 branches # 486.101 M/sec 3,929,572,460 branch-misses # 15.65% of all branches 52.511233236 seconds time elapsed

Captain Man · Answer

編集2：ほとんどの場合、この答えは間違っている可能性があります。上記のすべてが正しいと私が言った場合、低いのはまだ当てはまりますが、下部は当てはまりませんほとんどのプロセッサアーキテクチャについては、コメントを参照してください。ただし、これはまだ理論的には可能ですこれを実行する一部のOS /アーキテクチャにJVMが存在する可能性がありますが、そのJVMはおそらく十分に実装されていないか、奇妙なアーキテクチャです。また、これは理論的には可能なことのほとんどが理論的に可能であるという意味で可能であるので、最後の部分は塩の粒で取ります。

最初に、C++についてはわかりませんが、Javaについて話すことができます。

ここにいくつかのコードがあります、

public class Example { public static boolean less(final int a, final int b) { return a < b; } public static boolean lessOrEqual(final int a, final int b) { return a <= b; } }

javap -cを実行すると、バイトコードが表示されます

public class Example { public Example(); Code: 0: aload_0 1: invokespecial #8 // Method Java/lang/Object."<init>":()V 4: return public static boolean less(int, int); Code: 0: iload_0 1: iload_1 2: if_icmpge 7 5: iconst_1 6: ireturn 7: iconst_0 8: ireturn public static boolean lessOrEqual(int, int); Code: 0: iload_0 1: iload_1 2: if_icmpgt 7 5: iconst_1 6: ireturn 7: iconst_0 8: ireturn }

唯一の違いは、if_icmpge（より大きい/等しい場合）とif_icmpgt（より大きい場合）です。

上記のすべてが事実であり、残りは、if_icmpgeとif_icmpgtが、私がアセンブリ言語を専攻した大学のコースに基づいてどのように処理されるかについての私の推測です。より良い答えを得るには、JVMがこれらをどのように処理するかを調べる必要があります。私の推測では、C++も同様の操作にコンパイルされます。

編集：if_i<cond>のドキュメントはここです

コンピュータが数値を比較する方法は、数値を相互に減算し、その数値が0かどうかを確認することです。そのため、a < bを実行すると、bからaを減算し、結果が値の符号をチェックして0未満（b - a < 0）。 a <= bを実行するには、追加のステップを実行して1（b - a - 1 < 0）を減算する必要があります。

通常、これは非常にわずかな違いですが、これはのコードではありません。これは異常なバブルソートです！ O（n ^ 2）は、最も内側のループにあるため、この特定の比較を行っている平均回数です。

はい、それは分岐予測と関係があるかもしれませんが、私にはわかりません。私はその専門家ではありませんが、これは重要ではない役割を果たすこともあると思います。