改行用に指定された区切り文字を使用してファイルを読み取る

Question

.と言う区切り文字を使用して行が区切られているファイルがあります。このファイルを1行ずつ読みたいのですが、行は改行ではなく.の存在に基づいている必要があります。

1つの方法は次のとおりです。

f = open('file','r') for line in f.read().strip().split('.'): #....do some work f.close()

しかし、ファイルが大きすぎる場合、これはメモリ効率が良くありません。ファイル全体を一緒に読むのではなく、1行ずつ読みたいと思います。

openはパラメータ 'newline'をサポートしますが、このパラメータは、前述のようにNone, '', ' ', ' ', and ' 'のみを入力として受け取りますここ。

ファイルの行を効率的に読み取る方法はありますが、事前に指定された区切り文字に基づいていますか？

NPE · Accepted Answer

ジェネレーターを使用できます。

def myreadlines(f, newline): buf = "" while True: while newline in buf: pos = buf.index(newline) yield buf[:pos] buf = buf[pos + len(newline):] chunk = f.read(4096) if not chunk: yield buf break buf += chunk with open('file') as f: for line in myreadlines(f, "."): print line

Bruno Gomes · Answer

最も簡単な方法は、ファイルを前処理して、必要な場所に改行を生成することです。

Perlを使用した例を次に示します（文字列 'abc'を改行にすることを前提としています）。

Perl -pe 's/abc/
/g' text.txt > processed_text.txt

元の改行も無視する場合は、代わりに次を使用してください。

Perl -ne 's/
//; s/abc/
/g; print' text.txt > processed_text.txt

Dev Aggarwal · Answer

これは、PDFファイルの解析に使用したFileIOとbytearrayを使用したより効率的な回答です-

import io import re # the end-of-line chars, separated by a `|` (logical OR) EOL_REGEX = b'
|
|
' # the end-of-file char EOF = b'%%EOF' def readlines(fio): buf = bytearray(4096) while True: fio.readinto(buf) try: yield buf[: buf.index(EOF)] except ValueError: pass else: break for line in re.split(EOL_REGEX, buf): yield line with io.FileIO("test.pdf") as fio: for line in readlines(fio): ...

上記の例では、カスタムEOFも処理します。それを望まない場合は、これを使用してください：

import io import os import re # the end-of-line chars, separated by a `|` (logical OR) EOL_REGEX = b'
|
|
' def readlines(fio, size): buf = bytearray(4096) while True: if fio.tell() >= size: break fio.readinto(buf) for line in re.split(EOL_REGEX, buf): yield line size = os.path.getsize("test.pdf") with io.FileIO("test.pdf") as fio: for line in readlines(fio, size): ...