ファンデルモンド行列を計算する効率的な方法

Question

かなり大きな1D配列の Vandermonde matrix を計算しています。これを行うための自然でクリーンな方法は、 np.vander() を使用することです。しかし、これは約です。 2.5倍遅いリスト内包に基づくアプローチより。

In [43]: x = np.arange(5000) In [44]: N = 4 In [45]: %timeit np.vander(x, N, increasing=True) 155 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) # one of the listed approaches from the documentation In [46]: %timeit np.flip(np.column_stack([x**(N-1-i) for i in range(N)]), axis=1) 65.3 µs ± 235 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In [47]: np.all(np.vander(x, N, increasing=True) == np.flip(np.column_stack([x**(N-1-i) for i in range(N)]), axis=1)) Out[47]: True

ボトルネックがどこにあるのか、ネイティブ np.vander() の実装が〜2.5x遅い理由を理解しようとしています。

私の実装では効率が重要です。したがって、さらに高速な代替手段も歓迎します。

Paul Panzer · Accepted Answer

これまでに投稿されたものよりも（私のコンピューター上で）かなり高速なメソッドがいくつかあります。

私が思う最も重要な観察は、それは本当にあなたが望む程度に大きく依存するということです。べき乗（小さい整数の指数の場合は特別な場合だと思います）は、小さい指数の範囲でのみ意味があります。指数が多いほど、乗算ベースのアプローチがうまくいきます。

multiply.accumulateベースのメソッド（ma）を強調したいと思います。これは、numpyの組み込みアプローチに似ていますが、より高速です（チェックを省略したためではありません-nnc、numpy-no -チェックはこれを示しています）。最小の指数範囲を除くすべての場合、実際には私にとって最速です。

私が理解していない理由で、numpyの実装は、私の知る限りでは遅くて不必要な3つのことを行います。（1）ベースベクトルのかなりの数のコピーを作成します。（2）それはそれらを非隣接にします。（3）バッファリングを強制すると私が信じているインプレースでの蓄積を行います。

私が言及したいもう一つのことは、intの狭い範囲（out_e_1は本質的にmaの手動バージョン）の最速は、単純な予防策によって2倍以上遅くなるということですより大きなdtypeに昇格すること（safe_e_1おそらく少し誤称）。

ブロードキャストベースのメソッドはbc_*と呼ばれます。ここで、*はブロードキャスト軸を示します（ベースの場合はb、expの場合はe）。「チート」は結果が連続していないことを意味します。

タイミング（ベスト3）：

rep=100 n_b=5000 n_e=4 b_tp=<class 'numpy.int32'> e_tp=<class 'numpy.int32'> vander 0.16699657 ms bc_b 0.09595204 ms bc_e 0.07959786 ms ma 0.10755240 ms nnc 0.16459018 ms out_e_1 0.02037535 ms out_e_2 0.02656622 ms safe_e_1 0.04652272 ms safe_e_2 0.04081079 ms cheat bc_e_cheat 0.04668466 ms rep=100 n_b=5000 n_e=8 b_tp=<class 'numpy.int32'> e_tp=<class 'numpy.int32'> vander 0.25086462 ms bc_b apparently failed bc_e apparently failed ma 0.15843041 ms nnc 0.24713077 ms out_e_1 apparently failed out_e_2 apparently failed safe_e_1 0.15970622 ms safe_e_2 0.19672418 ms bc_e_cheat apparently failed rep=100 n_b=5000 n_e=4 b_tp=<class 'float'> e_tp=<class 'numpy.int32'> vander 0.16225773 ms bc_b 0.53315020 ms bc_e 0.56200830 ms ma 0.07626799 ms nnc 0.16059748 ms out_e_1 0.03653416 ms out_e_2 0.04043702 ms safe_e_1 0.04060494 ms safe_e_2 0.04104209 ms cheat bc_e_cheat 0.52966076 ms rep=100 n_b=5000 n_e=8 b_tp=<class 'float'> e_tp=<class 'numpy.int32'> vander 0.24542852 ms bc_b 2.03353578 ms bc_e 2.04281270 ms ma 0.11075758 ms nnc 0.24212880 ms out_e_1 0.14809043 ms out_e_2 0.19261359 ms safe_e_1 0.15206112 ms safe_e_2 0.19308420 ms cheat bc_e_cheat 1.99176601 ms

コード：

import numpy as np import types from timeit import repeat prom={np.dtype(np.int32): np.dtype(np.int64), np.dtype(float): np.dtype(float)} def RI(k, N, dt, top=100): return np.random.randint(0, top if top else N, (k, N)).astype(dt) def RA(k, N, dt, top=None): return np.add.outer(np.zeros((k,), int), np.arange(N)%(top if top else N)).astype(dt) def RU(k, N, dt, top=100): return (np.random.random((k, N))*(top if top else N)).astype(dt) def data(k, N_b, N_e, dt_b, dt_e, b_fun=RI, e_fun=RA): b = list(b_fun(k, N_b, dt_b)) e = list(e_fun(k, N_e, dt_e)) return b, e def f_vander(b, e): return np.vander(b, len(e), increasing=True) def f_bc_b(b, e): return b[:, None]**e def f_bc_e(b, e): return np.ascontiguousarray((b**e[:, None]).T) def f_ma(b, e): out = np.empty((len(b), len(e)), prom[b.dtype]) out[:, 0] = 1 np.multiply.accumulate(np.broadcast_to(b, (len(e)-1, len(b))), axis=0, out=out[:, 1:].T) return out def f_nnc(b, e): out = np.empty((len(b), len(e)), prom[b.dtype]) out[:, 0] = 1 out[:, 1:] = b[:, None] np.multiply.accumulate(out[:, 1:], out=out[:, 1:], axis=1) return out def f_out_e_1(b, e): out = np.empty((len(b), len(e)), b.dtype) out[:, 0] = 1 out[:, 1] = b out[:, 2] = c = b*b for i in range(3, len(e)): c*=b out[:, i] = c return out def f_out_e_2(b, e): out = np.empty((len(b), len(e)), b.dtype) out[:, 0] = 1 out[:, 1] = b out[:, 2] = b*b for i in range(3, len(e)): out[:, i] = out[:, i-1] * b return out def f_safe_e_1(b, e): out = np.empty((len(b), len(e)), prom[b.dtype]) out[:, 0] = 1 out[:, 1] = b out[:, 2] = c = (b*b).astype(prom[b.dtype]) for i in range(3, len(e)): c*=b out[:, i] = c return out def f_safe_e_2(b, e): out = np.empty((len(b), len(e)), prom[b.dtype]) out[:, 0] = 1 out[:, 1] = b out[:, 2] = b*b for i in range(3, len(e)): out[:, i] = out[:, i-1] * b return out def f_bc_e_cheat(b, e): return (b**e[:, None]).T for params in [(100, 5000, 4, np.int32, np.int32), (100, 5000, 8, np.int32, np.int32), (100, 5000, 4, float, np.int32), (100, 5000, 8, float, np.int32)]: k = params[0] dat = data(*params) ref = f_vander(dat[0][0], dat[1][0]) print('rep={} n_b={} n_e={} b_tp={} e_tp={}'.format(*params)) for name, func in list(globals().items()): if not name.startswith('f_') or not isinstance(func, types.FunctionType): continue try: assert np.allclose(ref, func(dat[0][0], dat[1][0])) if not func(dat[0][0], dat[1][0]).flags.c_contiguous: print('cheat', end=' ') print("{:16s}{:16.8f} ms".format(name[2:], np.min(repeat( 'f(b.pop(), e.pop())', setup='b, e = data(*p)', globals={'f':func, 'data':data, 'p':params}, number=k)) * 1000 / k)) except: print("{:16s} apparently failed".format(name[2:]))

cs95 · Answer

放送されたべき乗はどうですか？

%timeit (x ** np.arange(N)[:, None]).T 43 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

サニティーチェック -

np.all((x ** np.arange(N)[:, None]).T == np.vander(x, N, increasing=True)) True

ここでの注意点は、この高速化は、入力配列xのdtypeがintである場合にのみ可能であるということです。 @Warren Weckesserがコメントで指摘したように、浮動小数点配列の場合、ブロードキャストされたべき乗は遅くなります。

np.vanderが遅い理由については、ソースコード -をご覧ください。

x = asarray(x) if x.ndim != 1: raise ValueError("x must be a one-dimensional array or sequence.") if N is None: N = len(x) v = empty((len(x), N), dtype=promote_types(x.dtype, int)) tmp = v[:, ::-1] if not increasing else v if N > 0: tmp[:, 0] = 1 if N > 1: tmp[:, 1:] = x[:, None] multiply.accumulate(tmp[:, 1:], out=tmp[:, 1:], axis=1) return v

この関数は、あなた以外の多くのユースケースに対応する必要があるため、信頼性は高いが低速な、より一般化された計算方法を使用します（特にmultiply.accumulateを指しています）。

興味深いことに、私はファンデルモンド行列を計算する別の方法を見つけました。

%timeit x[:, None] ** np.arange(N) 150 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

それは同じことをしますが、とても遅いです。答えは、操作がブロードキャストされるという事実にありますが、非効率的です。

反対に、float配列の場合、これは実際に最高のパフォーマンスを発揮することになります。