特定のパーセンタイルを超えるすべてのデータを排除する

Question

pandas DataFrameと呼ばれるdataと呼ばれる列を持つmsがあります。data.msは95％パーセンタイルを超えています。今のところ、私はこれをやっています：

limit = data.ms.describe(90)['95%'] valid_data = data[data['ms'] < limit]

動作しますが、それを任意のパーセンタイルに一般化したいです。それを行う最良の方法は何ですか？

Phillip Cloud · Accepted Answer

Series.quantile() メソッドを使用します。

In [48]: cols = list('abc') In [49]: df = DataFrame(randn(10, len(cols)), columns=cols) In [50]: df.a.quantile(0.95) Out[50]: 1.5776961953820687

dfの行を除外するにはdf.aは、95番目の百分位数以上です。

In [72]: df[df.a < df.a.quantile(.95)] Out[72]: a b c 0 -1.044 -0.247 -1.149 2 0.395 0.591 0.764 3 -0.564 -2.059 0.232 4 -0.707 -0.736 -1.345 5 0.978 -0.099 0.521 6 -0.974 0.272 -0.649 7 1.228 0.619 -0.849 8 -0.170 0.458 -0.515 9 1.465 1.019 0.966

2diabolos.com · Answer

numpyはPandasよりもはるかに高速です。

numpy.percentile(df.a,95) # attention : the percentile is given in percent (5 = 5%)

は同等ですが、次の3倍の速度です。

df.a.quantile(.95) # as you already noticed here it is ".95" not "95"

あなたのコードのために、それは与える：

df[df.a < np.percentile(df.a,95)]