大きなCSVファイルから小さなランダムサンプルをPythonデータフレームに読み込みます

Question

読みたいCSVファイルがメインメモリに収まりません。いくつかの（〜10K）ランダムな行を読み取って、選択したデータフレームに関する簡単な統計を行うにはどうすればよいですか？

dlm · Answer

CSVファイルにヘッダーがないと仮定します。

import pandas import random n = 1000000 #number of records in file s = 10000 #desired sample size filename = "data.txt" skip = sorted(random.sample(xrange(n),n-s)) df = pandas.read_csv(filename, skiprows=skip)

read_csvにkeeprowsがある場合、またはskiprowsがリストの代わりにコールバックfuncを取得した場合、より良いでしょう。

ヘッダーと不明なファイル長：

import pandas import random filename = "data.txt" n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header) s = 10000 #desired sample size skip = sorted(random.sample(xrange(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list df = pandas.read_csv(filename, skiprows=skip)

exp1orer · Answer

@dlmの answer は素晴らしいですが、v0.20.0以降、 skiprowsはcallableを受け入れます。呼び出し可能オブジェクトは、引数として行番号を受け取ります。

行数ではなく、行のパーセントを指定できる場合-）、ファイルサイズを取得する必要はなく、ファイルを1回読み取るだけで済みます。最初の行のヘッダーを想定：

import pandas as pd import random p = 0.01 # 1% of the lines # keep the header, then take only 1% of lines # if random from [0,1] interval is greater than 0.01 the row will be skipped df = pd.read_csv( filename, header=0, skiprows=lambda i: i>0 and random.random() > p )

または、nth行ごとに取得する場合：

n = 100 # every 100th line = 1% of the lines df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

Bar · Answer

これはPandasにはありませんが、bashを使用すると同じ結果がはるかに速く得られます。

shuf -n 100000 data/original.tsv > data/sample.tsv

shufコマンドは入力をシャッフルし、-n引数は、出力に必要な行数を示します。

関連する質問： https://unix.stackexchange.com/q/108581

利用可能な7M行csvのベンチマーク here （2008）：

一番上の答え：

def pd_read(): filename = "2008.csv" n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header) s = 100000 #desired sample size skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list df = pandas.read_csv(filename, skiprows=skip) df.to_csv("temp.csv") %time pd_read() CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s Wall time: 18.9 s

shufの使用中：

time shuf -n 100000 2008.csv > temp.csv real 0m1.583s user 0m1.445s sys 0m0.136s

したがって、shufは約12倍高速であり、重要なことはファイル全体をメモリに読み込まないことです。

desktable · Answer

ファイル内の行数を事前にカウントする必要がないアルゴリズムは次のとおりです。そのため、ファイルを1回読み取るだけで済みます。

M個のサンプルが必要だとします。まず、アルゴリズムは最初のm個のサンプルを保持します。アルゴリズムは、確率m/iでi番目のサンプル（i> m）を検出すると、そのサンプルを使用して、既に選択されているサンプルをランダムに置き換えます。

そうすることで、i> mの場合、最初のiサンプルからランダムに選択されたmサンプルのサブセットが常にあります。

以下のコードを参照してください：

import random n_samples = 10 samples = [] for i, line in enumerate(f): if i < n_samples: samples.append(line) Elif random.random() < n_samples * 1. / (i+1): samples[random.randint(0, n_samples-1)] = line

queise · Answer

次のコードは、最初にヘッダーを読み取り、次に他の行のランダムサンプルを読み取ります。

import pandas as pd import numpy as np filename = 'hugedatafile.csv' nlinesfile = 10000000 nlinesrandomsample = 10000 lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False) df = pd.read_csv(filename, skiprows=lines2skip)

Vagner Guedes · Answer

パンダなし！

import random from os import fstat from sys import exit f = open('/usr/share/dict/words') # Number of lines to be read lines_to_read = 100 # Minimum and maximum bytes that will be randomly skipped min_bytes_to_skip = 10000 max_bytes_to_skip = 1000000 def is_EOF(): return f.tell() >= fstat(f.fileno()).st_size # To accumulate the read lines sampled_lines = [] for n in xrange(lines_to_read): bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip) f.seek(bytes_to_skip, 1) # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line # Skip current entire line f.readline() if not is_EOF(): sampled_lines.append(f.readline()) else: # Go to the begginig of the file ... f.seek(0, 0) # ... and skip lines again f.seek(bytes_to_skip, 1) # If it has reached the EOF again if is_EOF(): print "You have skipped more lines than your file has" print "Reduce the values of:" print " min_bytes_to_skip" print " max_bytes_to_skip" exit(1) else: f.readline() sampled_lines.append(f.readline()) print sampled_lines

最終的に、sampled_linesリストになります。どのような統計情報を意味しますか？

Zhongjun &#39;Mark&#39; Jin · Answer

use subsample

pip install subsample subsample -n 1000 file.csv > file_1000_sample.csv

Joran Beasley · Answer

class magic_checker: def __init__(self,target_count): self.target = target_count self.count = 0 def __eq__(self,x): self.count += 1 return self.count >= self.target min_target=100000 max_target = min_target*2 nlines = randint(100,1000) seek_target = randint(min_target,max_target) with open("big.csv") as f: f.seek(seek_target) f.readline() #discard this line Rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines))) #do something to process the lines you got returned .. perhaps just a split print Rand_lines print Rand_lines[0].split(",")

そのような何かがうまくいくと思う