リスト内のベクトルXに最も近いN個のベクトルを見つける最も速い方法は？

Question

ソートされていないリストに大量（〜20,000）の大きな（〜200次元）ベクトルがあります。同じサイズの新しいベクトルを作成できます。リストから既存のベクトルに最も近い（コサイン類似性によって定義される）上位N（通常は10程度）を見つけたいと思います。現在のところ、私のアプローチは、ベクトル差分のリストを生成し、それを並べ替えて上位10件を取得することですが、比較ごとにかなりの時間（2または3秒）を要します。一度にたくさんやってみます。

また、ベクトルリストの前処理を利用することもできます（n次元ベクトルのリストをソートできるかどうかわからないため、ソートされていないというだけです）。それらは常に同じサイズのベクトルになります。コンテキストでは、ベクトルはWord2vecの結果です。

John Forkosh · Answer

私は同じ問題を抱えていました-N個の整数のうち最小のものを見つける。最小のものを見つけるにはN個の比較が必要であることに注意してください。したがって、私のソリューションでは、単純なバブルソートのN ^ 2ではなく、n * N個の比較を使用して最小のものを見つけます。これがCコードと、この投稿用に書いた簡単なテストドライバーです...

/* --- standard headers --- */ #include <stdio.h> #include <stdlib.h> /*=========================================================================== * Function: nsmallest ( n, nx, x ) * Purpose: finds the n smallest values in x[nx], returning their indexes * -------------------------------------------------------------------------- * Arguments: n (I) int containing number of smallest x[nx]'s * whose indexes are to be returned * nx (I) int containing number of values in x[nx] * x (I) int* containing nx values, the indexes * of whose smallest n values are to be returned * Returns: (int *) list of indexes containing the smallest * n values in x[nx]. * -------------------------------------------------------------------------- * Notes: o *=========================================================================*/ int *nsmallest ( int n, int nx, int *x ) { static int indexes[999]; /* returned indexes of n smallest x[nx]'s */ int ix = 0, /* x[] index */ index=0, jndex=0, /* indexes[] indexes */ nindexes = 1; /* number of smallest x[]'s found so far */ indexes[0] = 0; /* init indexes[] list with first x[] */ for ( ix=1; ix<nx; ix++ ) { /* search for n smallest x[nx]'s */ for ( index=0; index<nindexes; index++ ) { /* compare x[ix] to indexes[] */ if ( x[ix] < x[indexes[index]] ) { /* put ix before indexes[index] */ for ( jndex=nindexes-1; jndex>=index; jndex-- ) /* work backwards */ indexes[jndex+1] = indexes[jndex]; /* move each indexes[] "down" */ indexes[index] = ix; /* put current ix in now-vacant slot */ break; /* no need for further comparisons */ } /* --- end-of-if(x[ix]<x[indexes[index]]) --- */ } /* --- end-of-for(index) --- */ if ( nindexes < n ) { /* still need more smallest x[nx]'s */ if ( index >= nindexes ) indexes[nindexes] = ix; /* ix in last slot */ nindexes++; } /* count another smallest x[nx] */ } /* --- end-of-for(ix) --- */ return ( indexes ); /* indexes of n smallest x[nx]'s to caller */ } /* --- end-of-function nsmallest() --- */ #ifdef TESTDRIVE int main ( int argc, char *argv[] ) { int n = ( argc>1? atoi(argv[1]) : 10 ), nx = ( argc>2? atoi(argv[2]) : 9999 ), seed = ( argc>3? atoi(argv[3]) : 987654321 ); double xmax = ( argc>4? (double)atoi(argv[4]) : 999999.0 ); int x[99999], ix=0, *indexes=NULL; srand(seed); for ( ix=0; ix<nx; ix++ ) x[ix] = (int)( xmax*((double)Rand())/((double)Rand_MAX) ); indexes = nsmallest(n,nx,x); printf("%d smallest x[%d]'s...
",n,nx); for ( ix=0; ix<n; ix++ ) printf(" %d) x[%d] = %d
", ix+1,indexes[ix],x[indexes[ix]]); exit ( 0 ); } /* --- end-of-main() nsmallest test driver --- */ #endif /* ------------------------ end-of-file nsmallest.c ---------------------- */

編集私は自分の目的のために上記のことをすぐにすぐに書いていましたが、特にタイムクリティカルではなかったので、投稿されたコードは私にとっては問題ありませんでした。しかし、投稿してもう一度見てみると、「インデックス」ループが最小から最大に順方向に移動していることに気づきました。つまり、候補番号ごとにループ全体を通過しないと、その番号を破棄できません。

だから、主にキックだけのために、私はそれを最大のものから最小のものへと書き直しました。その後、候補がリスト内の最大の小さい数よりすでに大きい場合、候補はすぐに破棄できます。また、私は（これはごくわずかな改善でしたが）「jndex」ループを置き換えました。これは、新しく見つかった少数の「空き」を1つのmemmove（）で置き換えます。

そして今、999000のうち上位150の数値のテスト（このテストではx []配列サイズを増やしました）で、時間は0.337秒から0.012秒に減少しました。基本的に、すべての数値はすぐに破棄されます。これは、すでに最小のものよりも小さい候補数に遭遇することはまれであるためです。したがって、n個の最小数を見つけるために、Nより少しだけ多く、以前のn * Nよりもはるかに少ない比較を行っているだけです。