Pandasのデータフレームのサブセットのランダムサンプル

Question

100,000エントリのデータフレームがあり、それを1000エントリの100セクションに分割するとします。

100個のセクションのうち1つだけのサイズ50のランダムサンプルを取得する方法を教えてください。データセットはすでに順序付けられており、最初の1000個の結果が最初のセクション、次のセクション、次のように続きます。

どうもありがとう

Andy Hayden · Answer

sample method *を使用できます：

In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6], [7, 8]], columns=["A", "B"]) In [12]: df.sample(2) Out[12]: A B 0 1 2 2 5 6 In [13]: df.sample(2) Out[13]: A B 3 7 8 0 1 2

*セクションDataFramesのいずれかで。

注：DataFrameのサイズよりも大きいサンプルサイズがある場合、置換サンプルを使用しない限りエラーが発生します。

In [14]: df.sample(5) ValueError: Cannot take a larger sample than population when 'replace=False' In [15]: df.sample(5, replace=True) Out[15]: A B 0 1 2 1 3 4 2 5 6 3 7 8 1 3 4

jpjandrade · Answer

1つの解決策は、numpyのchoice関数を使用することです。

100のうち50エントリが必要だとすると、次を使用できます。

import numpy as np chosen_idx = np.random.choice(1000, replace=False, size=50) df_trimmed = df.iloc[chosen_idx]

もちろん、これはブロック構造を考慮していません。たとえば、ブロックiから50アイテムのサンプルが必要な場合は、次のようにします。

import numpy as np block_start_idx = 1000 * i chosen_idx = np.random.choice(1000, replace=False, size=50) df_trimmed_from_block_i = df.iloc[block_start_idx + chosen_idx]

GeneralCode · Answer

これは、再帰に最適な場所です。

def main2(): rows = 8 # say you have 8 rows, real data will need len(rows) for int rands = [] for i in range(rows): gen = fun(rands) rands.append(gen) print(rands) # now range through random values def fun(rands): gen = np.random.randint(0, 8) if gen in rands: a = fun(rands) return a else: return gen if __name__ == "__main__": main2()

output: [6, 0, 7, 1, 3, 5, 4, 2]