一次元の観測データで異常値を検出するPythonの方法

Question

指定されたデータについて、外れ値（95％の信頼レベルまたは95％の分位関数または必要なもので定義される）をnan値として設定します。以下は、現在使用しているデータとコードです。誰かが私をさらに説明できたら嬉しいです。

import numpy as np, matplotlib.pyplot as plt data = np.random.Rand(1000)+5.0 plt.plot(data) plt.xlabel('observation number') plt.ylabel('recorded value') plt.show()

Joe Kington · Answer

percentileを使用する際の問題は、外れ値として識別されるポイントがサンプルサイズの関数であることです。

外れ値をテストする方法は非常に多くありますが、それらをどのように分類するかを考えてください。理想的には、アプリオリ情報を使用する必要があります（たとえば、「この値より上/下の値はすべて非現実的です...」）

ただし、一般的であまり合理的でない外れ値テストは、「絶対偏差の中央値」に基づいてポイントを削除することです。

N次元の場合の実装を次に示します（ここにある論文のコードから： https://github.com/joferkington/oost_paper_code/blob/master/utilities.py ）：

def is_outlier(points, thresh=3.5): """ Returns a boolean array with True if points are outliers and False otherwise. Parameters: ----------- points : An numobservations by numdimensions array of observations thresh : The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers. Returns: -------- mask : A numobservations-length boolean array. References: ---------- Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and Handle Outliers", The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. """ if len(points.shape) == 1: points = points[:,None] median = np.median(points, axis=0) diff = np.sum((points - median)**2, axis=-1) diff = np.sqrt(diff) med_abs_deviation = np.median(diff) modified_z_score = 0.6745 * diff / med_abs_deviation return modified_z_score > thresh

これは以前の回答の1つと非常に似ていますが、サンプルサイズの影響を詳細に説明したかったのです。

パーセンタイルベースの外れ値テスト（@CTZhuの答えに似ています）と、さまざまなサンプルサイズの中央値絶対偏差（MAD）テストを比較してみましょう。

import numpy as np import matplotlib.pyplot as plt import seaborn as sns def main(): for num in [10, 50, 100, 1000]: # Generate some data x = np.random.normal(0, 0.5, num-3) # Add three outliers... x = np.r_[x, -3, -10, 12] plot(x) plt.show() def mad_based_outlier(points, thresh=3.5): if len(points.shape) == 1: points = points[:,None] median = np.median(points, axis=0) diff = np.sum((points - median)**2, axis=-1) diff = np.sqrt(diff) med_abs_deviation = np.median(diff) modified_z_score = 0.6745 * diff / med_abs_deviation return modified_z_score > thresh def percentile_based_outlier(data, threshold=95): diff = (100 - threshold) / 2.0 minval, maxval = np.percentile(data, [diff, 100 - diff]) return (data < minval) | (data > maxval) def plot(x): fig, axes = plt.subplots(nrows=2) for ax, func in Zip(axes, [percentile_based_outlier, mad_based_outlier]): sns.distplot(x, ax=ax, rug=True, hist=False) outliers = x[func(x)] ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False) kwargs = dict(y=0.95, x=0.05, ha='left', va='top') axes[0].set_title('Percentile-based Outliers', **kwargs) axes[1].set_title('MAD-based Outliers', **kwargs) fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14) main()

enter image description here

MADベースの分類子はサンプルサイズに関係なく正しく機能しますが、パーセンタイルベースの分類子は実際に外れ値であるかどうかにかかわらず、サンプルサイズが大きいほど多くのポイントを分類します。

shivangi dhakad · Answer

1次元データの外れ値の検出は、その分布に依存します

1-正規分布：

データ値は、期待される範囲にほぼ均等に分布します：この場合、平均を含むすべてのメソッドを簡単に使用できます。標準偏差（95％または99.7％）は、正規分布データ（中心極限定理と標本平均の標本分布）に応じて、非常に効果的な方法です。カーンアカデミーの統計と確率-サンプリング配布ライブラリで説明されています。

もう1つの方法は、平均ではなくデータポイントの信頼区間が必要な場合の予測区間です。

データ値はある範囲にランダムに分布しています：平均は「異常値」（非常に小さいまたは典型的ではないデータセットの大きな値）中央値は、数値データセットの中心を測定する別の方法です。

中央値絶対偏差-中央値からすべてのポイントの距離を中央値で測定する方法 http：//www.itl .nist.gov/div898/handbook/eda/section3/eda35h.htm -上記のJoe Kingtonの回答で説明されているように、良い説明があります

2-対称分布：Zスコアの計算としきい値がそれに応じて変更される場合は、やはり絶対偏差の中央値が良い方法です

説明： http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/

3-非対称分布：ダブルMAD-ダブル中央値絶対偏差上記のリンクの説明

参照用のmy pythonコードを添付：

 def is_outlier_doubleMAD(self,points): """ FOR ASSYMMETRIC DISTRIBUTION Returns : filtered array excluding the outliers Parameters : the actual data Points array Calculates median to divide data into 2 halves.(skew conditions handled) Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer) Only difference being , the thresholds are now the median distance of the right and left median with the actual data median """ if len(points.shape) == 1: points = points[:,None] median = np.median(points, axis=0) medianIndex = (points.size/2) leftData = np.copy(points[0:medianIndex]) rightData = np.copy(points[medianIndex:points.size]) median1 = np.median(leftData, axis=0) diff1 = np.sum((leftData - median1)**2, axis=-1) diff1 = np.sqrt(diff1) median2 = np.median(rightData, axis=0) diff2 = np.sum((rightData - median2)**2, axis=-1) diff2 = np.sqrt(diff2) med_abs_deviation1 = max(np.median(diff1),0.000001) med_abs_deviation2 = max(np.median(diff2),0.000001) threshold1 = ((median-median1)/med_abs_deviation1)*3 threshold2 = ((median2-median)/med_abs_deviation2)*3 #if any threshold is 0 -> no outliers if threshold1==0: threshold1 = sys.maxint if threshold2==0: threshold2 = sys.maxint #multiplied by a factor so that only the outermost points are removed modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1 modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2 filtered1 = [] i = 0 for data in modified_z_score1: if data < threshold1: filtered1.append(leftData[i]) i += 1 i = 0 filtered2 = [] for data in modified_z_score2: if data < threshold2: filtered2.append(rightData[i]) i += 1 filtered = filtered1 + filtered2 return filtered

sergeyf · Answer

http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers のコードを適合させました。JoeKingtonと同じ結果になりますが、L1を使用していますL2距離の代わりに距離であり、非対称分布をサポートしています。元のRコードにはJoeの0.6745乗数がありませんでしたので、このスレッド内で一貫性を保つために乗数も追加しました。必要かどうかは100％確実ではありませんが、リンゴ同士の比較を行います。

def doubleMADsfromMedian(y,thresh=3.5): # warning: this function does not check for NAs # nor does it address issues when # more than 50% of your data have identical values m = np.median(y) abs_dev = np.abs(y - m) left_mad = np.median(abs_dev[y <= m]) right_mad = np.median(abs_dev[y >= m]) y_mad = left_mad * np.ones(len(y)) y_mad[y > m] = right_mad modified_z_score = 0.6745 * abs_dev / y_mad modified_z_score[y == m] = 0 return modified_z_score > thresh

CT Zhu · Answer

つかいます np.percentile @Martinが示唆したように：

percentiles = np.percentile(data, [2.5, 97.5]) # or =>, <= for within 95% data[(percentiles[0]<data) & (percentiles[1]>data)] # set the outliners to np.nan data[(percentiles[0]>data) | (percentiles[1]<data)] = np.nan

jimseeve · Answer

2つの標準偏差（または1.96）の外側にあるものを削除することもできます。

import random def outliers(tmp): """tmp is a list of numbers""" outs = [] mean = sum(tmp)/(1.0*len(tmp)) var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp)) std = var**0.5 outs = [tmp[i] for i in range(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std] return outs lst = [random.randrange(-10, 55) for _ in range(40)] print lst print outliers(lst)