エントリを分割せずに大きなファイルをチャンクに分割する

Question

UIEE形式でフォーマットされたかなり大きな.msgファイルがあります。

$ wc -l big_db.msg 8726593 big_db.msg

基本的に、ファイルは次のようなさまざまな長さのエントリで構成されています。

UR|1 AA|Condon, Richard TI|Prizzi's Family CN|Collectable- Good/Good MT|FICTION PU|G.P. Putnam & Sons DP|1986 ED|First Printing. BD|Hard Cover NT|0399132104 KE|MAFIA KE|FICTION PR|44.9 XA|4 XB|1 XC|BO XD|S UR|10 AA|Gariepy, Henry TI|Portraits of Perseverance CN|Good/No Jacket MT|SOLD PU|Victor Books DP|1989 BD|Mass Market Paperback NT|1989 tpb g 100 meditations from the Book of Job "This book...help you NT| persevere through the struggles of your life..." KE|Bible KE|religion KE|Job KE|meditations PR|28.4 XA|4 XB|5 XC|BO XD|S

これは、空白行で区切られた2つのエントリの例です。エントリを2つのファイルに分割せずに、この大きなファイルを小さなファイルに分割したいと思います。

個々のエントリは、ファイル内では改行（完全に空白の行）で区切られています。この870万行のファイルを15ファイルに分割したいと思います。 splitのようなツールが存在することは理解していますが、ファイルを分割する方法がよくわかりませんが、1つのエントリが複数のファイルに分割されないように、改行でのみ分割します。

mikeserv · Accepted Answer

これが機能する解決策です：

seq 1 $(((lines=$(wc -l </tmp/file))/16+1)) $lines | sed 'N;s|$.*$$\n$$.*$|\1d;\1,\3w /tmp/uptoline\3\2\3|;P;$d;D' | sed -ne :nl -ne '/\n$/!{N;bnl}' -nf - /tmp/file

これは、最初のsedが2番目のsedのスクリプトを記述できるようにすることで機能します。 2番目のsedは、最初に、空白行が見つかるまですべての入力行を収集します。次に、すべての出力行をファイルに書き込みます。最初のsedは、出力を書き込む場所を指示する2番目のスクリプトのスクリプトを書き出します。テストケースでは、スクリプトは次のようになりました。

1d;1,377w /tmp/uptoline377 377d;377,753w /tmp/uptoline753 753d;753,1129w /tmp/uptoline1129 1129d;1129,1505w /tmp/uptoline1505 1505d;1505,1881w /tmp/uptoline1881 1881d;1881,2257w /tmp/uptoline2257 2257d;2257,2633w /tmp/uptoline2633 2633d;2633,3009w /tmp/uptoline3009 3009d;3009,3385w /tmp/uptoline3385 3385d;3385,3761w /tmp/uptoline3761 3761d;3761,4137w /tmp/uptoline4137 4137d;4137,4513w /tmp/uptoline4513 4513d;4513,4889w /tmp/uptoline4889 4889d;4889,5265w /tmp/uptoline5265 5265d;5265,5641w /tmp/uptoline5641

私はそれを次のようにテストしました：

printf '%s\nand\nmore\nlines\nhere\n\n' $(seq 1000) >/tmp/file

これにより、次のような6000行のファイルが提供されました。

<iteration#> and more lines here #blank

... 1000回繰り返しました。

上記のスクリプトを実行した後：

set -- /tmp/uptoline* echo $# total splitfiles for splitfile do echo $splitfile wc -l <$splitfile tail -n6 $splitfile done

出力

15 total splitfiles /tmp/uptoline1129 378 188 and more lines here /tmp/uptoline1505 372 250 and more lines here /tmp/uptoline1881 378 313 and more lines here /tmp/uptoline2257 378 376 and more lines here /tmp/uptoline2633 372 438 and more lines here /tmp/uptoline3009 378 501 and more lines here /tmp/uptoline3385 378 564 and more lines here /tmp/uptoline3761 372 626 and more lines here /tmp/uptoline377 372 62 and more lines here /tmp/uptoline4137 378 689 and more lines here /tmp/uptoline4513 378 752 and more lines here /tmp/uptoline4889 372 814 and more lines here /tmp/uptoline5265 378 877 and more lines here /tmp/uptoline5641 378 940 and more lines here /tmp/uptoline753 378 125 and more lines here

slm · Answer

csplitの提案を使用する：

行番号に基づく分割

$ csplit file.txt <num lines> "{repetitions}"

例

1000行のファイルがあるとします。

$ seq 1000 > file.txt $ csplit file.txt 100 "{8}" 288 400 400 400 400 400 400 400 400 405

次のようなファイルになります。

$ wc -l xx* 99 xx00 100 xx01 100 xx02 100 xx03 100 xx04 100 xx05 100 xx06 100 xx07 100 xx08 101 xx09 1 xx10 1001 total

特定のファイルの行数に基づいて事前に数値を計算しておくことで、繰り返し回数を指定する必要があるという静的な制限を回避できます。

$ lines=100 $ echo $lines 100 $ rep=$(( ($(wc -l file.txt | cut -d" " -f1) / $lines) -2 )) $ echo $rep 8 $ csplit file.txt 100 "{$rep}" 288 400 400 400 400 400 400 400 400 405

空白行に基づく分割

一方、ファイルに含まれている空白行でファイルを単純に分割したい場合は、このバージョンのsplitを使用できます。

$ csplit file2.txt '/^$/' "{*}"

例

上記のfile.txtに4つの空白行を追加して、ファイルをfile2.txtにしたとします。次のように手動で追加されていることがわかります。

$ grep -A1 -B1 "^$" file2.txt 20 21 -- 72 73 -- 112 113 -- 178 179

上記は、サンプルファイル内の対応する番号の間にそれらを追加したことを示しています。 csplitコマンドを実行すると、次のようになります。

$ csplit file2.txt '/^$/' "{*}" 51 157 134 265 3290

空白行に基づいて分割された4つのファイルがあることがわかります。

$ grep -A1 -B1 '^$' xx0* xx01: xx01-21 -- xx02: xx02-73 -- xx03: xx03-113 -- xx04: xx04-179

参考文献

St&#233;phane Chazelas · Answer

レコードの順序を気にしない場合は、次のことができます。

gawk -vRS= '{printf "%s", $0 RT > "file.out." (NR-1)%15}' file.in

それ以外の場合は、最初にレコード数を取得して、各出力ファイルにいくつ入れるかを知る必要があります。

gawk -vRS= -v "n=$(gawk -vRS= 'END {print NR}' file.in)" ' {printf "%s", $0 RT > "file.out." int((NR-1)*15/n)}' file.in

hornj · Answer

行の最後でのみ分割する場合は、splitの-lオプションを使用して分割できるはずです。

空白行（）で分割する場合は、kshで分割する方法を次に示します。私はそれをテストしていません、そしてそれはおそらく理想的ではありませんが、この線に沿った何かがうまくいくでしょう：

filenum=0 counter=0 limit=580000 while read LINE do counter=counter+1 if (( counter >= limit )) then if [[ $LINE == "" ]] then filenum=filenum+1 counter=0 fi fi echo $LINE >>big_db$filenum.msg done <big_db.msg

dchirikov · Answer

awkをお試しください

awk 'BEGIN{RS="

"}{print $0 > FILENAME"."FNR}' big_db.msg

David Z · Answer

レコードの順序は気にしないが、特定の数の出力ファイルを取得することにこだわる場合は、 Stephaneの答えが私のやり方です。しかし、各出力ファイルが超えてはならないサイズを指定することをもっと気にかけているのではないかと思います。入力ファイルを読み取り、そのサイズに達するまでレコードを収集してから、新しい出力ファイルを開始できるため、実際にはそれが簡単になります。それがうまくいけば、ほとんどのプログラミング言語は短いスクリプトでタスクを処理できます。これがawkの実装です：

BEGIN { RS = "

" ORS = "

" maxlen = (maxlen == 0 ? 500000 : maxlen) oi = 1 } { reclen = length($0) + 2 if (n + reclen > maxlen) { oi++ n = 0 } n += reclen print $0 > FILENAME"."oi }

これをファイルに入れ、たとえばprogram.awkとし、awk -v maxlen=10000 -f program.awk big_db.msgで実行します。ここで、maxlenの値は、1つのファイルで必要な最大バイト数です。デフォルトとして500kを使用します。

設定された数のファイルを取得する場合、おそらく最も簡単な方法は、入力ファイルのサイズを必要なファイル数で除算し、その数にビットを追加してmaxlenを取得することです。たとえば、8726593バイトから15個のファイルを取得するには、15で除算して581773を取得し、いくつかを追加します。したがって、maxlen=590000またはmaxlen=600000を指定します。これを繰り返し実行したい場合は、それを実行するようにプログラムを構成することができます。