数百万の3Dポイント：特定のポイントに最も近い10個のポイントを見つける方法

Question

3-dの点は（x、y、z）で定義されます。任意の2点（X、Y、Z）と（x、y、z）間の距離dは、d = Sqrt [（X-x）^ 2 +（Y-y）^ 2 +（Z-z）^ 2]です。現在、ファイルには100万のエントリがあり、各エントリは特定の順序ではなく、スペースのあるポイントです。任意の点（a、b、c）が与えられると、その点に最も近い10点を見つけます。 100万ポイントをどのように保存し、そのデータ構造から10ポイントをどのように取得しますか。

jfs · Accepted Answer

百万ポイントは少数です。ここでは最も簡単なアプローチが機能します（KDTreeに基づくコードは低速です（1つのポイントのみを照会する場合））。

ブルートフォースアプローチ（時間〜1秒）

#!/usr/bin/env python import numpy NDIM = 3 # number of dimensions # read points into array a = numpy.fromfile('million_3D_points.txt', sep=' ') a.shape = a.size / NDIM, NDIM point = numpy.random.uniform(0, 100, NDIM) # choose random point print 'point:', point d = ((a-point)**2).sum(axis=1) # compute distances ndx = d.argsort() # indirect sort # print 10 nearest points to the chosen one import pprint pprint.pprint(Zip(a[ndx[:10]], d[ndx[:10]]))

それを実行します：

$ time python nearest.py point: [ 69.06310224 2.23409409 50.41979143] [(array([ 69., 2., 50.]), 0.23500677815852947), (array([ 69., 2., 51.]), 0.39542392750839772), (array([ 69., 3., 50.]), 0.76681859086988302), (array([ 69., 3., 50.]), 0.76681859086988302), (array([ 69., 3., 51.]), 0.9272357402197513), (array([ 70., 2., 50.]), 1.1088022980015722), (array([ 70., 2., 51.]), 1.2692194473514404), (array([ 70., 2., 51.]), 1.2692194473514404), (array([ 70., 3., 51.]), 1.801031260062794), (array([ 69., 1., 51.]), 1.8636121147970444)] real 0m1.122s user 0m1.010s sys 0m0.120s

100万個の3Dポイントを生成するスクリプトは次のとおりです。

#!/usr/bin/env python import random for _ in xrange(10**6): print ' '.join(str(random.randrange(100)) for _ in range(3))

出力：

$ head million_3D_points.txt 18 56 26 19 35 74 47 43 71 82 63 28 43 82 0 34 40 16 75 85 69 88 58 3 0 63 90 81 78 98

そのコードを使用して、より複雑なデータ構造とアルゴリズムをテストできます（たとえば、実際に消費するメモリが少ないか、上記の最も単純なアプローチよりも速いかなど）。現時点では、それが作業コードを含む唯一の答えであることは注目に値します。

KDTree （時間〜1.4秒）に基づくソリューション

#!/usr/bin/env python import numpy NDIM = 3 # number of dimensions # read points into array a = numpy.fromfile('million_3D_points.txt', sep=' ') a.shape = a.size / NDIM, NDIM point = [ 69.06310224, 2.23409409, 50.41979143] # use the same point as above print 'point:', point from scipy.spatial import KDTree # find 10 nearest points tree = KDTree(a, leafsize=a.shape[0]+1) distances, ndx = tree.query([point], k=10) # print 10 nearest points to the chosen one print a[ndx]

それを実行します：

$ time python nearest_kdtree.py point: [69.063102240000006, 2.2340940900000001, 50.419791429999997] [[[ 69. 2. 50.] [ 69. 2. 51.] [ 69. 3. 50.] [ 69. 3. 50.] [ 69. 3. 51.] [ 70. 2. 50.] [ 70. 2. 51.] [ 70. 2. 51.] [ 70. 3. 51.] [ 69. 1. 51.]]] real 0m1.359s user 0m1.280s sys 0m0.080s

C++での部分ソート（時間〜1.1秒）

// $ g++ nearest.cc && (time ./a.out < million_3D_points.txt ) #include <algorithm> #include <iostream> #include <vector> #include <boost/lambda/lambda.hpp> // _1 #include <boost/lambda/bind.hpp> // bind() #include <boost/Tuple/tuple_io.hpp> namespace { typedef double coord_t; typedef boost::Tuple<coord_t,coord_t,coord_t> point_t; coord_t distance_sq(const point_t& a, const point_t& b) { // or boost::geometry::distance coord_t x = a.get<0>() - b.get<0>(); coord_t y = a.get<1>() - b.get<1>(); coord_t z = a.get<2>() - b.get<2>(); return x*x + y*y + z*z; } } int main() { using namespace std; using namespace boost::lambda; // _1, _2, bind() // read array from stdin vector<point_t> points; cin.exceptions(ios::badbit); // throw exception on bad input while(cin) { coord_t x,y,z; cin >> x >> y >> z; points.Push_back(boost::make_Tuple(x,y,z)); } // use point value from previous examples point_t point(69.06310224, 2.23409409, 50.41979143); cout << "point: " << point << endl; // 1.14s // find 10 nearest points using partial_sort() // Complexity: O(N)*log(m) comparisons (O(N)*log(N) worst case for the implementation) const size_t m = 10; partial_sort(points.begin(), points.begin() + m, points.end(), bind(less<coord_t>(), // compare by distance to the point bind(distance_sq, _1, point), bind(distance_sq, _2, point))); for_each(points.begin(), points.begin() + m, cout << _1 << "
"); // 1.16s }

それを実行します：

g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt ) point: (69.0631 2.23409 50.4198) (69 2 50) (69 2 51) (69 3 50) (69 3 50) (69 3 51) (70 2 50) (70 2 51) (70 2 51) (70 3 51) (69 1 51) real 0m1.152s user 0m1.140s sys 0m0.010s

C++の優先度キュー（時間〜1.2秒）

#include <algorithm> // make_heap #include <functional> // binary_function<> #include <iostream> #include <boost/range.hpp> // boost::begin(), boost::end() #include <boost/tr1/Tuple.hpp> // get<>, Tuple<>, cout << namespace { typedef double coord_t; typedef std::tr1::Tuple<coord_t,coord_t,coord_t> point_t; // calculate distance (squared) between points `a` & `b` coord_t distance_sq(const point_t& a, const point_t& b) { // boost::geometry::distance() squared using std::tr1::get; coord_t x = get<0>(a) - get<0>(b); coord_t y = get<1>(a) - get<1>(b); coord_t z = get<2>(a) - get<2>(b); return x*x + y*y + z*z; } // read from input stream `in` to the point `point_out` std::istream& getpoint(std::istream& in, point_t& point_out) { using std::tr1::get; return (in >> get<0>(point_out) >> get<1>(point_out) >> get<2>(point_out)); } // Adaptable binary predicate that defines whether the first // argument is nearer than the second one to given reference point template<class T> class less_distance : public std::binary_function<T, T, bool> { const T& point; public: less_distance(const T& reference_point) : point(reference_point) {} bool operator () (const T& a, const T& b) const { return distance_sq(a, point) < distance_sq(b, point); } }; } int main() { using namespace std; // use point value from previous examples point_t point(69.06310224, 2.23409409, 50.41979143); cout << "point: " << point << endl; const size_t nneighbours = 10; // number of nearest neighbours to find point_t points[nneighbours+1]; // populate `points` for (size_t i = 0; getpoint(cin, points[i]) && i < nneighbours; ++i) ; less_distance<point_t> less_distance_point(point); make_heap (boost::begin(points), boost::end(points), less_distance_point); // Complexity: O(N*log(m)) while(getpoint(cin, points[nneighbours])) { // add points[-1] to the heap; O(log(m)) Push_heap(boost::begin(points), boost::end(points), less_distance_point); // remove (move to last position) the most distant from the // `point` point; O(log(m)) pop_heap (boost::begin(points), boost::end(points), less_distance_point); } // print results Push_heap (boost::begin(points), boost::end(points), less_distance_point); // O(m*log(m)) sort_heap (boost::begin(points), boost::end(points), less_distance_point); for (size_t i = 0; i < nneighbours; ++i) { cout << points[i] << ' ' << distance_sq(points[i], point) << '
'; } }

それを実行します：

$ g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt ) point: (69.0631 2.23409 50.4198) (69 2 50) 0.235007 (69 2 51) 0.395424 (69 3 50) 0.766819 (69 3 50) 0.766819 (69 3 51) 0.927236 (70 2 50) 1.1088 (70 2 51) 1.26922 (70 2 51) 1.26922 (70 3 51) 1.80103 (69 1 51) 1.86361 real 0m1.174s user 0m1.180s sys 0m0.000s

線形検索ベースのアプローチ（時間〜1.15秒）

// $ g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt ) #include <algorithm> // sort #include <functional> // binary_function<> #include <iostream> #include <boost/foreach.hpp> #include <boost/range.hpp> // begin(), end() #include <boost/tr1/Tuple.hpp> // get<>, Tuple<>, cout << #define foreach BOOST_FOREACH namespace { typedef double coord_t; typedef std::tr1::Tuple<coord_t,coord_t,coord_t> point_t; // calculate distance (squared) between points `a` & `b` coord_t distance_sq(const point_t& a, const point_t& b); // read from input stream `in` to the point `point_out` std::istream& getpoint(std::istream& in, point_t& point_out); // Adaptable binary predicate that defines whether the first // argument is nearer than the second one to given reference point class less_distance : public std::binary_function<point_t, point_t, bool> { const point_t& point; public: explicit less_distance(const point_t& reference_point) : point(reference_point) {} bool operator () (const point_t& a, const point_t& b) const { return distance_sq(a, point) < distance_sq(b, point); } }; } int main() { using namespace std; // use point value from previous examples point_t point(69.06310224, 2.23409409, 50.41979143); cout << "point: " << point << endl; less_distance nearer(point); const size_t nneighbours = 10; // number of nearest neighbours to find point_t points[nneighbours]; // populate `points` foreach (point_t& p, points) if (! getpoint(cin, p)) break; // Complexity: O(N*m) point_t current_point; while(cin) { getpoint(cin, current_point); //NOTE: `cin` fails after the last //point, so one can't lift it up to //the while condition // move to the last position the most distant from the // `point` point; O(m) foreach (point_t& p, points) if (nearer(current_point, p)) // found point that is nearer to the `point` //NOTE: could use insert (on sorted sequence) & break instead //of swap but in that case it might be better to use //heap-based algorithm altogether std::swap(current_point, p); } // print results; O(m*log(m)) sort(boost::begin(points), boost::end(points), nearer); foreach (point_t p, points) cout << p << ' ' << distance_sq(p, point) << '
'; } namespace { coord_t distance_sq(const point_t& a, const point_t& b) { // boost::geometry::distance() squared using std::tr1::get; coord_t x = get<0>(a) - get<0>(b); coord_t y = get<1>(a) - get<1>(b); coord_t z = get<2>(a) - get<2>(b); return x*x + y*y + z*z; } std::istream& getpoint(std::istream& in, point_t& point_out) { using std::tr1::get; return (in >> get<0>(point_out) >> get<1>(point_out) >> get<2>(point_out)); } }

測定によると、ほとんどの時間はファイルからの配列の読み取りに費やされており、実際の計算には時間がかかりません。

Will · Answer

すでに100万のエントリがファイルにある場合、それらをすべてメモリ内のデータ構造にロードする必要はありません。これまでに見つかった上位10個のポイントを配列に保持し、百万個を超えるポイントをスキャンして、移動しながら上位10個のリストを更新するだけです。

これは、ポイント数でO(n)です。

mipadi · Answer

k-dimensional tree （kd-tree）にポイントを保存できます。 Kdツリーは、最近傍検索（特定のポイントに最も近いnポイントの検索）向けに最適化されています。

Krystian · Answer

これは、やりすぎないかどうかをテストする難しい質問だと思います。

すでに上で挙げた最も単純なアルゴリズムを考えてみましょう。これまでのベスト10の候補者のテーブルを保持し、すべてのポイントを1つずつ調べます。これまでのベスト10のどれよりも近いポイントを見つけた場合は、それを置き換えます。複雑さは何ですか？さて、ファイルから各ポイントを見て一度、その距離（または実際の距離の2乗）を計算し、10番目に近いポイントと比較する必要があります。それが良い場合、10-best-so-farテーブルの適切な場所に挿入します。

では、複雑さは何ですか？各ポイントを1回見るので、距離のn回の計算とn回の比較になります。ポイントが優れている場合は、正しい位置に挿入する必要があります。これにはさらに比較が必要ですが、最適な候補のテーブルのサイズは一定であるため、一定の要因になります。

最終的に、線形時間O(n)の点数）で実行されるアルゴリズムになります。

しかし、そのようなアルゴリズムの下限とは何ですか？入力データに順序がない場合、toを使用して各ポイントを調べ、最も近いものではないかどうかを確認します。したがって、私が見る限り、下限はOmega（n）であるため、上記のアルゴリズムが最適です。

Agnel Kurian · Answer

距離を計算する必要はありません。距離の2乗だけでニーズを満たすことができます。速くなるはずです。つまり、sqrtビットをスキップできます。

David Z · Answer

これは宿題の質問ではありませんか？ ;-)

私の考えでは、すべてのポイントを反復処理し、それらを最小ヒープまたは制限された優先度キューに入れ、ターゲットからの距離によってキー設定します。

Kai · Answer

この質問は、本質的に space partitioning algorithm の知識や直感をテストすることです。 octree にデータを保存するのが最善の策だと思います。通常、この種の問題（数百万の頂点の保存、レイトレーシング、衝突の検出など）を処理する3Dエンジンで使用されます。最悪のシナリオでは、ルックアップ時間はlog(n)のオーダーになります（私は信じています）。

Jeff Meatball Yang · Answer

簡単なアルゴリズム：

ポイントをタプルのリストとして保存し、ポイントをスキャンして距離を計算し、「最も近い」リストを保持します。

より創造的：

ポイントを領域（「0,0,0」から「50,50,50」、または「0,0,0」から「-20、-20、-20」で記述されるキューブなど）にグループ化します。ターゲットポイントからそれらに「インデックスを付ける」ことができます。ターゲットがどのキューブにあるかを確認し、そのキューブ内のポイントのみを検索します。そのキューブに10個未満のポイントがある場合は、「隣の」キューブなどをチェックします。

さらに考えてみると、これはあまり良いアルゴリズムではありません。ターゲットポイントが10ポイントよりもキューブの壁に近い場合は、隣接するキューブも検索する必要があります。

Kd-treeアプローチを使用して最も近いノードを見つけ、その最も近いノードを削除（またはマーク）して、新しい最も近いノードを再検索します。すすぎ、繰り返します。

Kirk Broadhurst · Answer

2つのポイントP1（x1、y1、z1）およびP2（x2、y2、z2）について、ポイント間の距離がdの場合、以下のすべてが真でなければなりません。

|x1 - x2| <= d |y1 - y2| <= d |z1 - z2| <= d

セット全体を反復するときに最も近い10を保持しますが、10番目に近いものまでの距離も保持します。表示するすべてのポイントの距離を計算する前に、これらの3つの条件を使用して複雑さを大幅に軽減します。

Yiling · Answer

基本的には私の上の最初の2つの答えの組み合わせです。ポイントはファイル内にあるため、メモリ内に保持する必要はありません。配列または最小ヒープの代わりに、最大ヒープを使用します。これは、10番目に近いポイントよりも短い距離のみをチェックするためです。配列の場合、新しく計算された各距離を、保持した10個すべての距離と比較する必要があります。最小ヒープの場合、新しく計算された距離ごとに3つの比較を実行する必要があります。最大ヒープでは、新しく計算された距離がルートノードよりも大きい場合に1回だけ比較を実行します。

Unreason · Answer

この質問にはさらに定義が必要です。

1）データを事前にインデックス化するアルゴリズムに関する決定は、データ全体をメモリに保持できるかどうかによって大きく異なります。

Kdtreeとoctreeを使用すると、メモリにデータを保持する必要がなく、メモリフットプリントが低いだけでなく、ファイル全体を読み取る必要がないため、その事実によるパフォーマンス上のメリットがあります。

Bruteforceでは、ファイル全体を読み取って、検索する新しいポイントごとに距離を再計算する必要があります。

それでも、これはあなたにとって重要ではないかもしれません。

2）もう1つの要因は、ポイントを検索する回数です。

JF Sebastianが述べているように、大規模なデータセットでもブルートフォースが高速になる場合がありますが、彼のベンチマークではディスクからデータセット全体を読み取ることを考慮していることに注意してくださいそして、彼らは1つの検索のみを測定すること。

Rubys · Answer

それぞれの距離を計算し、O(n)時間でSelect（1..10、n）を実行します。これは単純なアルゴリズムだと思います。