numpy配列でモードを見つける最も効率的な方法

Question

整数（正または負の両方）を含む2D配列があります。各行は特定の空間サイトの経時的な値を表し、各列は特定の時間のさまざまな空間サイトの値を表します。

したがって、配列が次のような場合：

1 3 4 2 2 7 5 2 2 1 4 1 3 3 2 2 1 1

結果は

1 3 2 2 2 1

モードに複数の値がある場合、いずれか（ランダムに選択された）がモードとして設定される場合があることに注意してください。

私は一度に1つのモードを見つける列を繰り返すことができますが、numpyがそれを行うための組み込み関数を持っていることを望んでいました。または、ループせずに効率的にそれを見つけるトリックがある場合。

fgb · Accepted Answer

チェック scipy.stats.mode() （@ tom10のコメントに触発）：

import numpy as np from scipy import stats a = np.array([[1, 3, 4, 2, 2, 7], [5, 2, 2, 1, 4, 1], [3, 3, 2, 2, 1, 1]]) m = stats.mode(a) print(m)

出力：

ModeResult(mode=array([[1, 3, 2, 2, 1, 1]]), count=array([[1, 2, 2, 2, 1, 2]]))

ご覧のとおり、モードとカウントの両方を返します。 m[0]から直接モードを選択できます：

print(m[0])

出力：

[[1 3 2 2 1 1]]

Devin Cairns · Answer

更新

scipy.stats.mode関数はこの投稿以降大幅に最適化されており、推奨される方法です

旧回答

これは、軸に沿ってモードを計算するためにそれほど多くないので、トリッキーな問題です。解決策は、numpy.bincountがTrueであるnumpy.uniqueとともに、return_countsが便利な1次元配列の場合は簡単です。私が見る最も一般的なn次元関数はscipy.stats.modeです。ただし、特に多くの一意の値を持つ大きな配列の場合は非常に遅くなります。解決策として、私はこの機能を開発し、頻繁に使用しています：

import numpy def mode(ndarray, axis=0): # Check inputs ndarray = numpy.asarray(ndarray) ndim = ndarray.ndim if ndarray.size == 1: return (ndarray[0], 1) Elif ndarray.size == 0: raise Exception('Cannot compute mode on empty array') try: axis = range(ndarray.ndim)[axis] except: raise Exception('Axis "{}" incompatible with the {}-dimension array'.format(axis, ndim)) # If array is 1-D and numpy version is > 1.9 numpy.unique will suffice if all([ndim == 1, int(numpy.__version__.split('.')[0]) >= 1, int(numpy.__version__.split('.')[1]) >= 9]): modals, counts = numpy.unique(ndarray, return_counts=True) index = numpy.argmax(counts) return modals[index], counts[index] # Sort array sort = numpy.sort(ndarray, axis=axis) # Create array to transpose along the axis and get padding shape transpose = numpy.roll(numpy.arange(ndim)[::-1], axis) shape = list(sort.shape) shape[axis] = 1 # Create a boolean array along strides of unique values strides = numpy.concatenate([numpy.zeros(shape=shape, dtype='bool'), numpy.diff(sort, axis=axis) == 0, numpy.zeros(shape=shape, dtype='bool')], axis=axis).transpose(transpose).ravel() # Count the stride lengths counts = numpy.cumsum(strides) counts[~strides] = numpy.concatenate([[0], numpy.diff(counts[~strides])]) counts[strides] = 0 # Get shape of padded counts and slice to return to the original shape shape = numpy.array(sort.shape) shape[axis] += 1 shape = shape[transpose] slices = [slice(None)] * ndim slices[axis] = slice(1, None) # Reshape and compute final counts counts = counts.reshape(shape).transpose(transpose)[slices] + 1 # Find maximum counts and return modals/counts slices = [slice(None, i) for i in sort.shape] del slices[axis] index = numpy.ogrid[slices] index.insert(axis, numpy.argmax(counts, axis=axis)) return sort[index], counts[index]

結果：

In [2]: a = numpy.array([[1, 3, 4, 2, 2, 7], [5, 2, 2, 1, 4, 1], [3, 3, 2, 2, 1, 1]]) In [3]: mode(a) Out[3]: (array([1, 3, 2, 2, 1, 1]), array([1, 2, 2, 2, 1, 2]))

いくつかのベンチマーク：

In [4]: import scipy.stats In [5]: a = numpy.random.randint(1,10,(1000,1000)) In [6]: %timeit scipy.stats.mode(a) 10 loops, best of 3: 41.6 ms per loop In [7]: %timeit mode(a) 10 loops, best of 3: 46.7 ms per loop In [8]: a = numpy.random.randint(1,500,(1000,1000)) In [9]: %timeit scipy.stats.mode(a) 1 loops, best of 3: 1.01 s per loop In [10]: %timeit mode(a) 10 loops, best of 3: 80 ms per loop In [11]: a = numpy.random.random((200,200)) In [12]: %timeit scipy.stats.mode(a) 1 loops, best of 3: 3.26 s per loop In [13]: %timeit mode(a) 1000 loops, best of 3: 1.75 ms per loop

編集：より多くの背景を提供し、よりメモリ効率の良いアプローチに変更しました

Lean Bravo · Answer

このメソッドを展開して、分布の中心から値がどれだけ離れているかを確認するために実際の配列のインデックスが必要なデータのモードを見つけることに適用されます。

(_, idx, counts) = np.unique(a, return_index=True, return_counts=True) index = idx[np.argmax(counts)] mode = a[index]

Len（np.argmax（counts））> 1の場合はモードを破棄することを忘れないでください。また、実際にデータの中央分布を表すかどうかを検証するために、標準偏差間隔内に収まるかどうかを確認できます。

Ali_Ayub · Answer

非常に簡単な方法は、Counterクラスを使用することだと思います。次に、 here のように、Counterインスタンスのmost_common（）関数を使用できます。

1次元配列の場合：

import numpy as np from collections import Counter nparr = np.arange(10) nparr[2] = 6 nparr[3] = 6 #6 is now the mode mode = Counter(nparr).most_common(1) # mode will be [(6,3)] to give the count of the most occurring value, so -> print(mode[0][0])

多次元配列の場合（わずかな違い）：

import numpy as np from collections import Counter nparr = np.arange(10) nparr[2] = 6 nparr[3] = 6 nparr = nparr.reshape((10,2,5)) #same thing but we add this to reshape into ndarray mode = Counter(nparr.flatten()).most_common(1) # just use .flatten() method # mode will be [(6,3)] to give the count of the most occurring value, so -> print(mode[0][0])

これは効率的な実装である場合とそうでない場合がありますが、便利です。

zeliha_bektas · Answer

from collections import Counter n = int(input()) data = sorted([int(i) for i in input().split()]) sorted(sorted(Counter(data).items()), key = lambda x: x[1], reverse = True)[0][0] print(Mean)

Counter(data)は頻度をカウントし、defaultdictを返します。 sorted(Counter(data).items())は、頻度ではなくキーを使用してソートします。最後に、key = lambda x: x[1]でソートされた別の周波数を使用して周波数をソートする必要があります。逆は、Pythonに、頻度を最大から最小にソートするように指示します。

Def_Os · Answer

onlyがnumpyを使用する（scipyもCounterクラスでもない）きちんとしたソリューション：

A = np.array([[1,3,4,2,2,7], [5,2,2,1,4,1], [3,3,2,2,1,1]]) np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=A)

array（[1、3、2、2、1、1]）