整数インデックスで配列テーブルを検索する最も速い方法は何ですか？

Question

大量のデータを移動するビデオ処理アプリケーションがあります。

本質的に多くの計算は一度だけ計算する必要があり、再利用できるので、物事をスピードアップするために、ルックアップテーブルを作成しました。

ただし、すべてのルックアップが処理時間の30％を占めるようになりました。 RAMが遅いのではないかと思いますが、それでももう少し最適化してみたいと思います。

現在、私は以下を持っています：

public readonly int[] largeArray = new int[3000*2000]; public readonly int[] lookUp = new int[width*height];

次に、ポインターp（width * y + xと同等）を使用してルックアップを実行し、結果をフェッチします。

int[] newResults = new int[width*height]; int p = 0; for (int y = 0; y < height; y++) { for (int x = 0; x < width; x++, p++) { newResults[p] = largeArray[lookUp[p]]; } }

最適化するために配列全体をコピーすることはできないことに注意してください。また、アプリケーションは非常にマルチスレッド化されています。

関数スタックの短縮が進んでいたため、ゲッターはなく、読み取り専用配列から直接取得しました。

私もushortへの変換を試みましたが、遅くなるようです（Wordのサイズが原因であると理解しています）。

IntPtrの方が高速でしょうか？どうすればいいですか？

以下は、時間分布のスクリーンショットです。

Marc Gravell · Accepted Answer

ここで行っていることは、事実上「収集」であるように見えます。最近のCPUには、特にVPGATHER**。これは.NET Core 3で公開されており、shouldは以下のように機能します。これはシングルループのシナリオです（おそらくここからダブルループバージョンを取得できます）。

最初の結果：

AVX enabled: False; slow loop from 0 e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb for 524288 loops: 1524ms AVX enabled: True; slow loop from 1024 e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb for 524288 loops: 667ms

コード：

using System; using System.Diagnostics; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; static class P { static int Gather(int[] source, int[] index, int[] results, bool avx) { // normally you wouldn't have avx as a parameter; that is just so // I can turn it off and on for the test; likewise the "int" return // here is so I can monitor (in the test) how much we did in the "old" // loop, vs AVX2; in real code this would be void return int y = 0; if (Avx2.IsSupported && avx) { var iv = MemoryMarshal.Cast<int, Vector256<int>>(index); var rv = MemoryMarshal.Cast<int, Vector256<int>>(results); unsafe { fixed (int* sPtr = source) { // note: here I'm assuming we are trying to fill "results" in // a single outer loop; for a double-loop, you'll probably need // to slice the spans for (int i = 0; i < rv.Length; i++) { rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4); } } } // move past everything we've processed via SIMD y += rv.Length * Vector256<int>.Count; } // now do anything left, which includes anything not aligned to 256 bits, // plus the "no AVX2" scenario int result = y; int end = results.Length; // hoist, since this is not the JIT recognized pattern for (; y < end; y++) { results[y] = source[index[y]]; } return result; } static void Main() { // invent some random data var Rand = new Random(12345); int size = 1024 * 512; int[] data = new int[size]; for (int i = 0; i < data.Length; i++) data[i] = Rand.Next(255); // build a fake index int[] index = new int[1024]; for (int i = 0; i < index.Length; i++) index[i] = Rand.Next(size); int[] results = new int[1024]; void GatherLocal(bool avx) { // prove that we're getting the same data Array.Clear(results, 0, results.Length); int from = Gather(data, index, results, avx); Console.WriteLine($"AVX enabled: {avx}; slow loop from {from}"); for (int i = 0; i < 32; i++) { Console.Write(results[i].ToString("x2")); } Console.WriteLine(); const int TimeLoop = 1024 * 512; var watch = Stopwatch.StartNew(); for (int i = 0; i < TimeLoop; i++) Gather(data, index, results, avx); watch.Stop(); Console.WriteLine($"for {TimeLoop} loops: {watch.ElapsedMilliseconds}ms"); Console.WriteLine(); } GatherLocal(false); if (Avx2.IsSupported) GatherLocal(true); } }