AWKプログラミング：パターンに基づいて大きなファイルを小さなファイルに分割する

Question

下に示すような大きなファイルinput.datがあります。

kpoint1 : 0.0000 0.0000 0.0000 band No. band energies occupation 1 -52.8287 2.00000 2 -52.7981 2.00000 3 -52.7981 2.00000 kpoint2 : 0.0000 0.0000 0.0000 band No. band energies occupation 1 -52.8287 2.00000 2 -52.7981 2.00000 3 -52.7981 2.00000

以下のようにファイルを2つの小さなファイルに分割する必要があります

kpoint1.dat：

kpoint1 : 0.0000 0.0000 0.0000 band No. band energies occupation 1 -52.8287 2.00000 2 -52.7981 2.00000 3 -52.7981 2.00000

およびkpoint2.dat：

kpoint1 : 0.0000 0.0000 0.0000 band No. band energies occupation 1 -52.8287 2.00000 2 -52.7981 2.00000 3 -52.7981 2.00000

そのための小さなスクリプトを作成しました。スクリプトを以下に示します。

for j in {1..2} do awk '$1=="kpoint'$j'" {for(i=1; i<=3; i++){getline; print}}' tmp7 >kpoint'$j'.dat done

スクリプトは、目的の名前で出力ファイルを作成します。しかし、すべてのファイルは空です。誰でもこれを解決するのに役立ちますか？

muru · Accepted Answer

これはawkで完全に実行できます：

$ awk '$1 ~ /kpoint[0-9]/ { file = $1 ".dat" } {print > file}' file $ head kpoint* ==> kpoint1.dat <== kpoint1 : 0.0000 0.0000 0.0000 band No. band energies occupation 1 -52.8287 2.00000 2 -52.7981 2.00000 3 -52.7981 2.00000 ==> kpoint2.dat <== kpoint2 : 0.0000 0.0000 0.0000 band No. band energies occupation 1 -52.8287 2.00000 2 -52.7981 2.00000 3 -52.7981 2.00000

また、Awkはリダイレクトのために> fileをサポートしますが、若干の違いがあります（詳細については GNU awkのマニュアルを参照してください）。

Sergiy Kolodyazhnyy · Answer

muruの答えが最も単純ですが、awkを使用しないいくつかの他の方法があります。

Perl

Awkのアプローチは、基本的に特定のファイル名に書き込み、行の先頭でkpointに遭遇した場合にのみそのファイル名を変更することです。 Perlでも同じアプローチを実行できます。

$ Perl -ane '$p=$F[0] if $F[0] =~ /kpoint/;open($f,">>",$p . ".dat"); print $f $_' input.txt

これがどのように機能するかです

-aフラグを使用すると、入力ファイルの各行から自動的に分割された特別な@F単語の配列を使用できます。したがって、$F[0]は、awkの$1と同様に、最初のWordを指します。
$p=$F[0] if $F[0] =~ /kpoint/は、kpointが行にある場合にのみ、$p（プレフィックス変数になることを意味します）を変更するためのものです。そのパターンマッチの改善は/^ *kpoint/になります
各反復で、appending$p文字列で結合された名前.datを持つファイルを開きます。パーツの追加が重要であることに注意してください。明確に実行したい場合は、おそらく古いkpointファイルを削除する必要があります。ファイルを常に新しく作成して上書きしたい場合は、元のコマンドを次のように要求できます。
```
$ Perl -ane 'if ($F[0] =~ /kpoint/){$p=$F[0]; open($f,">",$p . ".dat")}; print $f $_' input.txt 
```
最後に、print $f $_は、開いているファイル名だけを出力します。

split

この例から、各エントリは5行で構成されているように見えます。それが定数であれば、splitとのパターンマッチングに依存せずに、ファイルをそのように分割できます。具体的には、次のコマンド：

$ split --additional-suffix=".dat" --numeric-suffixes=1 -l 5 input.txt kpoint

このコマンドのオプションは次のとおりです。

--additional-suffix=".dat"は、作成される各ファイルに追加される静的な.datサフィックスです
--numeric-suffixes=1を使用すると、各ファイル名に1から始まる変化する数字を追加できます
-l 5では、入力ファイルを5行ごとに分割できます
input.txtは、分割しようとしているファイルです
kpointは静的なファイル名プレフィックスです

そして、ここで実際にこれがどのように機能するか：

$ split --additional-suffix=".dat" --numeric-suffixes=1 -l 5 input.txt kpoint $ cat kpoint01.dat kpoint1 : 0.0000 0.0000 0.0000 band No. band energies occupation 1 -52.8287 2.00000 2 -52.7981 2.00000 3 -52.7981 2.00000 $ cat kpoint02.dat kpoint2 : 0.0000 0.0000 0.0000 band No. band energies occupation 1 -52.8287 2.00000 2 -52.7981 2.00000 3 -52.7981 2.00000

オプションで、--suffix-length=1を追加して、kpoint1の代わりにkpoint01のように各数値接尾辞の長さを短くすることもできますが、kpointsが多数ある場合は問題になる可能性があります。

代替awk

これは muruの答えに似ていますが、ここでは異なるパターンマッチとsprintf()を介してファイル名変数を作成する異なるアプローチを使用しています。

$ awk '/^\ *kpoint/{f=sprintf("%s.dat",$1)};{print > f}' input.txt

Python

awkとsplitのアプローチは短くなりますが、Pythonなどの他のツールはテキスト処理に非常に適しており、これらを使用して、より詳細で実用的なソリューションを実装できます。

以下のスクリプトはそれを正確に実行し、保存する行のリストを後方に見るという考え方に基づいて動作します。スクリプトは、行の先頭でkpointに遭遇するまで行を保存し続けます。これは、新しいエントリに到達したことを意味し、以前のエントリをそれぞれのファイルに書き込む必要があることも意味します。

#!/usr/bin/env python3 import sys def write_entry(pref,line_list): # this function writes the actual file for each entry with open(".".join([pref,"dat"]),"w") as entry_file: entry_file.write("".join(line_list)) def main(): prefix = "" old_prefix = "" entry=[] with open(sys.argv[1]) as fd: for line in fd: # if we encounter kpoint string, that's a signal # that we need to write out the list of things if line.strip().startswith('kpoint'): prefix=line.strip().split()[0] # This if statement counters special case # when we just started reading the file if not old_prefix: old_prefix = prefix entry.append(line) continue write_entry(old_prefix,entry) old_prefix = prefix entry=[] # Keep storing lines. This works nicely after old # entry has been cleared out. entry.append(line) # since we're looking backwards, we need one last call # to write last entry when input file has been closed write_entry(old_prefix,entry) if __== '__main__': main()

ピュアバッシュ

Perlのアプローチとほぼ同じ考え方-すべてを特定のファイル名に書き込み、kpointを含む行が見つかった場合にのみファイル名を変更します。

#!/usr/bin/env bash while IFS= read -r line; do case "$line" in # We found next entry. Use Word-splitting to get # filename into fname variable, and truncate that filename *kpoint[0-9]*) read fname trash <<< $line && echo "$line" > "$fname".dat ;; # That's just a line within entry. Append to # current working file *) echo "$line" >> "$fname".dat ;; esac done < "$1" # Just in case there are trailing lines that weren't processed # in while loop, append them to last filename [ -n "$line" ] && echo "$line" >> "$fname".dat ;