使用方法pandas IQRでフィルター？

Question

IQR（Q1-1.5IQRとQ3 + 1.5IQRの間の値）で列をフィルタリングする組み込みの方法はありますか？また、pandasで推奨される他の一般的なフィルタリングも歓迎します。

Romain · Answer

私の知る限り、最もコンパクトな表記法はqueryメソッドによってもたらされるようです。

# Some test data np.random.seed(33454) df = ( # A standard distribution pd.DataFrame({'nb': np.random.randint(0, 100, 20)}) # Adding some outliers .append(pd.DataFrame({'nb': np.random.randint(100, 200, 2)})) # Reseting the index .reset_index(drop=True) ) # Computing IQR Q1 = df['nb'].quantile(0.25) Q3 = df['nb'].quantile(0.75) IQR = Q3 - Q1 # Filtering Values between Q1-1.5IQR and Q3+1.5IQR filtered = df.query('(@Q1 - 1.5 * @IQR) <= nb <= (@Q3 + 1.5 * @IQR)')

次に、結果をプロットして違いを確認します。左の箱ひげ図の外れ値（183の十字）は、フィルター処理された系列にはもう現れないことがわかります。

# Ploting the result to check the difference df.join(filtered, rsuffix='_filtered').boxplot()

この回答から、このトピックで post を書いたので、より多くの情報を見つけることができました。

MLenthousiast · Answer

Series.between（）を使用する別のアプローチ：

iqr = df['col'][df['col'].between(df['col'].quantile(.25), df['col'].quantile(.75), inclusive=True)]

引き出された：

q1 = df['col'].quantile(.25) q3 = df['col'].quantile(.75) mask = d['col'].between(q1, q2, inclusive=True) iqr = d.loc[mask, 'col']

kd88 · Answer

これにより、df列のIQRにあるcolumnのサブセットが得られます。

def subset_by_iqr(df, column, whisker_width=1.5): """Remove outliers from a dataframe by column, including optional whiskers, removing rows for which the column value are less than Q1-1.5IQR or greater than Q3+1.5IQR. Args: df (`:obj:pd.DataFrame`): A pandas dataframe to subset column (str): Name of the column to calculate the subset from. whisker_width (float): Optional, loosen the IQR filter by a factor of `whisker_width` * IQR. Returns: (`:obj:pd.DataFrame`): Filtered dataframe """ # Calculate Q1, Q2 and IQR q1 = df[column].quantile(0.25) q3 = df[column].quantile(0.75) iqr = q3 - q1 # Apply filter with respect to IQR, including optional whiskers filter = (df[column] >= q1 - whisker_width*iqr) & (df[column] <= q3 + whisker_width*iqr) return df.loc[filter] # Example for whiskers = 1.5, as requested by the OP df_filtered = subset_by_iqr(df, 'column_name', whisker_width=1.5)

Ayyasamy · Answer

また、IQRを計算して、以下のコードを使用することもできます。 IQRの下限と上限に基づいて、各列に表示される外れ値の値を置き換えます。このコードは、データフレームの各列を通過し、外れ値を見つけるために行のすべての値を調べるのではなく、外れ値のみをフィルタリングすることで1つずつ機能します。

関数：

 def mod_outlier(df): df1 = df.copy() df = df._get_numeric_data() q1 = df.quantile(0.25) q3 = df.quantile(0.75) iqr = q3 - q1 lower_bound = q1 -(1.5 * iqr) upper_bound = q3 +(1.5 * iqr) for col in col_vals: for i in range(0,len(df[col])): if df[col][i] < lower_bound[col]: df[col][i] = lower_bound[col] if df[col][i] > upper_bound[col]: df[col][i] = upper_bound[col] for col in col_vals: df1[col] = df[col] return(df1)

関数呼び出し：

df = mod_outlier(df)

danek · Answer

別のアプローチではSeries.clipを使用します。

q = s.quantile([.25, .75]) s = s[~s.clip(*q).isin(q)]

詳細は次のとおりです。

s = pd.Series(np.randon.randn(100)) q = s.quantile([.25, .75]) # calculate lower and upper bounds s = s.clip(*q) # assigns values outside boundary to boundary values s = s[~s.isin(q)] # take only observations within bounds

これを使用してデータフレーム全体dfをフィルタリングするのは簡単です。

def iqr(df, colname, bounds = [.25, .75]): s = df[colname] q = s.quantile(bounds) return df[~s.clip(*q).isin(q)]

注：メソッドは境界自体を除外します。