Pythonで安価に行数を取得する方法

Question

私はPythonで大きなファイル（何十万もの行）の行数を取得する必要があります。メモリ的にも時間的にも最も効率的な方法は何ですか？

現時点で私はします：

def file_len(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1

それ以上のことが可能ですか？

Yuval Adam · Accepted Answer

あなたはそれ以上良くなることはできません。

結局のところ、どんな解決策でもファイル全体を読み、あなたが持っているの数を把握し、そしてその結果を返さなければならないでしょう。

ファイル全体を読み込まずにそれを実行するためのより良い方法がありますか？よくわからない...最善の解決策は常にI/Oバウンドになるだろう、あなたができる最善のことはあなたが不要なメモリを使用しないことを確認することですが、それはあなたがそれをカバーしているようです。

Kyle · Answer

1行、おそらくかなり速いです：

num_lines = sum(1 for line in open('myfile.txt'))

Ryan Ginstrom · Answer

私はメモリマップファイルが最速の解決策になると思います。私は4つの関数を試しました。OP（opcount）によって投稿された関数です。ファイル内の行に対する単純な繰り返し（simplecount）。メモリマップフィールド（mmap）付きのreadline（mapcount）。 Mykola Kharechkoが提供するバッファ読み取りソリューション（bufcount）。

各関数を5回実行して、120万行のテキストファイルの平均実行時間を計算しました。

Windows XP、Python 2.5、2GB RAM、2GHz AMDプロセッサ

これが私の結果です。

mapcount : 0.465599966049 simplecount : 0.756399965286 bufcount : 0.546800041199 opcount : 0.718600034714

編集：Python 2.6用の数字：

mapcount : 0.471799945831 simplecount : 0.634400033951 bufcount : 0.468800067902 opcount : 0.602999973297

そのため、Windows/Python 2.6ではバッファ読み取り戦略が最も速いようです。

これがコードです：

from __future__ import with_statement import time import mmap import random from collections import defaultdict def mapcount(filename): f = open(filename, "r+") buf = mmap.mmap(f.fileno(), 0) lines = 0 readline = buf.readline while readline(): lines += 1 return lines def simplecount(filename): lines = 0 for line in open(filename): lines += 1 return lines def bufcount(filename): f = open(filename) lines = 0 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) while buf: lines += buf.count('
') buf = read_f(buf_size) return lines def opcount(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1 counts = defaultdict(list) for i in range(5): for func in [mapcount, simplecount, bufcount, opcount]: start_time = time.time() assert func("big_file.txt") == 1209138 counts[func].append(time.time() - start_time) for key, vals in counts.items(): print key.__name__, ":", sum(vals) / float(len(vals))

Michael Bacon · Answer

私の評判スコアが少し跳ね上がるまで、私はこれを同様の質問に投稿しなければなりませんでした。

これらの解決策はすべて、これをかなり速くするための1つの方法、つまりバッファなし（生）のインタフェースを使用する方法、バイト配列を使用する方法、および独自のバッファリングを使用する方法を無視します。（これはPython 3にのみ適用されます。Python2では、rawインターフェースはデフォルトで使用される場合と使用されない場合がありますが、Python 3では、デフォルトでUnicodeになります。）

タイミングツールの修正版を使用して、私は次のコードが提供されるソリューションのどれよりも速い（そしてわずかにもっとPythonic）と思います：

def rawcount(filename): f = open(filename, 'rb') lines = 0 buf_size = 1024 * 1024 read_f = f.raw.read buf = read_f(buf_size) while buf: lines += buf.count(b'
') buf = read_f(buf_size) return lines

独立したジェネレータ関数を使用して、これはsmidgeをより速く走らせます：

def _make_gen(reader): b = reader(1024 * 1024) while b: yield b b = reader(1024*1024) def rawgencount(filename): f = open(filename, 'rb') f_gen = _make_gen(f.raw.read) return sum( buf.count(b'
') for buf in f_gen )

これはitertoolsを使用してインラインでジェネレータ式を使用して完全に実行できますが、かなり奇妙に見えます。

from itertools import (takewhile,repeat) def rawincount(filename): f = open(filename, 'rb') bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None))) return sum( buf.count(b'
') for buf in bufgen )

これが私のタイミングです。

function average, s min, s ratio rawincount 0.0043 0.0041 1.00 rawgencount 0.0044 0.0042 1.01 rawcount 0.0048 0.0045 1.09 bufcount 0.008 0.0068 1.64 wccount 0.01 0.0097 2.35 itercount 0.014 0.014 3.41 opcount 0.02 0.02 4.83 kylecount 0.021 0.021 5.05 simplecount 0.022 0.022 5.25 mapcount 0.037 0.031 7.46

&#211;lafur Waage · Answer

サブプロセスを実行してwc -l filenameを実行することができます

import subprocess def file_len(fname): p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, stderr=subprocess.PIPE) result, err = p.communicate() if p.returncode != 0: raise IOError(err) return int(result.strip().split()[0])

Martlark · Answer

これは、マルチプロセッシングライブラリを使用してマシン/コア間で行数を分散するためのpythonプログラムです。私のテストでは、8コアのWindows 64サーバーを使用して、2000万行のファイルを26秒から7秒にカウントアップしました。注意：メモリマッピングを使用しないと、処理が大幅に遅くなります。

import multiprocessing, sys, time, os, mmap import logging, logging.handlers def init_logger(pid): console_format = 'P{0} %(levelname)s %(message)s'.format(pid) logger = logging.getLogger() # New logger at root level logger.setLevel( logging.INFO ) logger.handlers.append( logging.StreamHandler() ) logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) ) def getFileLineCount( queues, pid, processes, file1 ): init_logger(pid) logging.info( 'start' ) physical_file = open(file1, "r") # mmap.mmap(fileno, length[, tagname[, access[, offset]]] m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ ) #work out file size to divide up line counting fSize = os.stat(file1).st_size chunk = (fSize / processes) + 1 lines = 0 #get where I start and stop _seedStart = chunk * (pid) _seekEnd = chunk * (pid+1) seekStart = int(_seedStart) seekEnd = int(_seekEnd) if seekEnd < int(_seekEnd + 1): seekEnd += 1 if _seedStart < int(seekStart + 1): seekStart += 1 if seekEnd > fSize: seekEnd = fSize #find where to start if pid > 0: m1.seek( seekStart ) #read next line l1 = m1.readline() # need to use readline with memory mapped files seekStart = m1.tell() #tell previous rank my seek start to make their seek end if pid > 0: queues[pid-1].put( seekStart ) if pid < processes-1: seekEnd = queues[pid].get() m1.seek( seekStart ) l1 = m1.readline() while len(l1) > 0: lines += 1 l1 = m1.readline() if m1.tell() > seekEnd or len(l1) == 0: break logging.info( 'done' ) # add up the results if pid == 0: for p in range(1,processes): lines += queues[0].get() queues[0].put(lines) # the total lines counted else: queues[0].put(lines) m1.close() physical_file.close() if __== '__main__': init_logger( 'main' ) if len(sys.argv) > 1: file_name = sys.argv[1] else: logging.fatal( 'parameters required: file-name [processes]' ) exit() t = time.time() processes = multiprocessing.cpu_count() if len(sys.argv) > 2: processes = int(sys.argv[2]) queues=[] # a queue for each process for pid in range(processes): queues.append( multiprocessing.Queue() ) jobs=[] prev_pipe = 0 for pid in range(processes): p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) ) p.start() jobs.append(p) jobs[0].join() #wait for counting to finish lines = queues[0].get() logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )

Daniel Lee · Answer

私は次のようにPythonのファイルオブジェクトメソッドreadlinesを使います。

with open(input_file) as foo: lines = len(foo.readlines())

これはファイルを開き、ファイル内に行のリストを作成し、リストの長さを数え、それを変数に保存してファイルを再び閉じます。

radtek · Answer

これは私が使っているものです、とてもきれいに見えます：

import subprocess def count_file_lines(file_path): """ Counts the number of lines in a file using wc utility. :param file_path: path to file :return: int, no of lines """ num = subprocess.check_output(['wc', '-l', file_path]) num = num.split(' ') return int(num[0])

更新：これは純粋なpythonを使うよりわずかに速いですが、メモリ使用量を犠牲にします。サブプロセスは、コマンドを実行している間、親プロセスと同じメモリ使用量で新しいプロセスを分岐します。

pkit · Answer

def file_len(full_path): """ Count number of lines in a file.""" f = open(full_path) nr_of_lines = sum(1 for line in f) f.close() return nr_of_lines

Scott Persinger · Answer

このバージョンでは、一定のバッファを再利用するため、わずかな（4〜8％）改善されているので、メモリやGCのオーバーヘッドを回避できます。

lines = 0 buffer = bytearray(2048) with open(filename) as f: while f.readinto(buffer) > 0: lines += buffer.count('
')

あなたはバッファサイズで遊ぶことができて多分少し改善を見ることができます。

ChillarAnand · Answer

カイルの答え

num_lines = sum(1 for line in open('my_file.txt'))

おそらく最善です、これのための代替手段は

num_lines = len(open('my_file.txt').read().splitlines())

これは両方のパフォーマンスの比較です。

In [20]: timeit sum(1 for line in open('Charts.ipynb')) 100000 loops, best of 3: 9.79 µs per loop In [21]: timeit len(open('Charts.ipynb').read().splitlines()) 100000 loops, best of 3: 12 µs per loop

TheExorcist · Answer

ワンラインソリューション

import os os.system("wc -l filename")

私のスニペット

os.system（ 'wc -l * .txt'）

0 bar.txt 1000 command.txt 3 test_file.txt 1003 total

1&#39;&#39; · Answer

現代のsubprocess.check_output関数を使用した、この答えに似た1行bashの解法：

def line_count(file): return int(subprocess.check_output('wc -l {}'.format(file), Shell=True).split()[0])

jeffpkamp · Answer

これは私がピュアPythonを使って見つけた最も速いものです。 2 ** 16が私のコンピュータのスイートスポットであるように見えますが、あなたはbufferを設定することによってあなたが望むどんな量のメモリでも使うことができます。

from functools import partial buffer=2**16 with open(myfile) as f: print sum(x.count('
') for x in iter(partial(f.read,buffer), ''))

私はここで答えを見つけましたなぜC++では標準入力からの行の読み取りがPythonよりずっと遅いのですか？そしてそれを少しだけ微調整しました。 wc -lはまだ他の何よりも約75％高速ですが、すぐに行を数える方法を理解するための非常に良い読みです。

Texom512 · Answer

このコードは短くて明確です。おそらくそれが最善の方法です。

num_lines = open('yourfile.ext').read().count('
')

BandGap · Answer

上記のメソッドを完成させるために、fileinputモジュールを使ってバリアントを試しました。

import fileinput as fi def filecount(fname): for line in fi.input(fname): pass return fi.lineno()

そして上記のすべてのメソッドに60milのlinesファイルを渡しました。

mapcount : 6.1331050396 simplecount : 4.588793993 opcount : 4.42918205261 filecount : 43.2780818939 bufcount : 0.170812129974

Fileinputが他のすべての方法よりもひどく、スケールがはるかに悪いことは私にとっては少し驚きです...

Silent Spectator · Answer

簡単な方法：

num_lines = len(list(open('myfile.txt')))

Mykola Kharechko · Answer

私にとっては、この変種が最も速いでしょう。

#!/usr/bin/env python def main(): f = open('filename') lines = 0 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) while buf: lines += buf.count('
') buf = read_f(buf_size) print lines if __== '__main__': main()

理由：1行ずつ読み込むよりもバッファリングが速く、string.countもとても速い

Andr&#233;s Torres · Answer

print open('file.txt', 'r').read().count("
") + 1

Andrew Jaffe · Answer

ファイルを開いた結果はイテレータです。これは長さを持つシーケンスに変換できます。

with open(filename) as f: return len(list(f))

これは明示的なループよりも簡潔で、enumerateを避けます。

Dummy · Answer

私はこのようにバッファケースを修正しました：

def CountLines(filename): f = open(filename) try: lines = 1 buf_size = 1024 * 1024 read_f = f.read # loop optimization buf = read_f(buf_size) # Empty file if not buf: return 0 while buf: lines += buf.count('
') buf = read_f(buf_size) return lines finally: f.close()

空のファイルと最後の行（\ nなし）もカウントされるようになりました。

Lerner Zhang · Answer

LinuxのPythonで安価に行数を取得したい場合は、この方法をお勧めします。

import os print os.popen("wc -l file_path").readline().split()[0]

file_pathは、抽象ファイルパスまたは相対パスのどちらでもかまいません。これが役立つことを願っています。

pyanon · Answer

count = max(enumerate(open(filename)))[0]

odwl · Answer

これはどうですか

def file_len(fname): counts = itertools.count() with open(fname) as f: for _ in f: counts.next() return counts.next()

onetwopunch · Answer

このワンライナーはどうですか。

file_length = len(open('myfile.txt','r').read().split('
'))

3900行のファイルでそれを計るのにこの方法を使用して0.003秒かかる

def c(): import time s = time.time() file_length = len(open('myfile.txt','r').read().split('
')) print time.time() - s

mdwhatcott · Answer

def line_count(path): count = 0 with open(path) as lines: for count, l in enumerate(lines, start=1): pass return count

leba-lev · Answer

これはどう？

import fileinput import sys counter=0 for line in fileinput.input([sys.argv[1]]): counter+=1 fileinput.close() print counter

Karthik · Answer

ファイルがメモリに収まる場合は、

with open(fname) as f: count = len(f.read().split(b'
')) - 1

Jet Blue · Answer

ファイル内のすべての行が同じ長さ（およびASCIIの文字のみを含む）である場合*、非常に安価に次の操作を実行できます。

fileSize = os.path.getsize( pathToFile ) # file size in bytes bytesPerLine = someInteger # don't forget to account for the newline character numLines = fileSize // bytesPerLine

* é のようなUnicode文字が使用されている場合、私はもっと多くの努力が行のバイト数を決定するために必要であろうと思われます。

jciloa · Answer

def count_text_file_lines(path): with open(path, 'rt') as file: line_count = sum(1 for _line in file) return line_count

Jenny Yue Jin · Answer

他の可能性：

import subprocess def num_lines_in_file(fpath): return int(subprocess.check_output('wc -l %s' % fpath, Shell=True).strip().split()[0])

0x90 · Answer

Count.pyという名前の実行可能スクリプトファイルを作成します。

#!/usr/bin/python import sys count = 0 for line in sys.stdin: count+=1

そして、ファイルの内容をpythonスクリプトのcat huge.txt | ./count.pyにパイプします。 Pipeは Powershell でも動作するので、結局行数を数えることになります。

私にとっては、Linuxでは次のものより30％高速でした。

count=1 with open('huge.txt') as f: count+=1

Victor · Answer

以下の方法でos.pathモジュールを使用できます。

import os import subprocess Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), Shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

Filenameはファイルの絶対パスです。