テキストファイルを差し引くUNIXのツール？

Question

セミコロンで区切られたテキストフィールドで構成される大きなファイルが、大きなテーブルの形式であります。ソートされています。同じテキストフィールドで構成される小さなファイルがあります。ある時点で、誰かがこのファイルを他のファイルと連結してから、並べ替えを行って上記の大きなファイルを作成しました。大きなファイルから小さなファイルの行を減算したいと思います（つまり、小さなファイルの各行について、一致する文字列が大きなファイルに存在する場合は、大きなファイルからその行を削除します）。

ファイルはおおよそ次のようになります

GenericClass1; 1; 2; NA; 3; 4;
GenericClass1; 5; 6; NA; 7; 8;
GenericClass2; 1; 5; NA; 3; 8;
GenericClass2; 2; 6; NA; 4; 1;

等

これを行う上品な方法はありますか、それともawkを使用する必要がありますか？

ファイルはおおよそ次のようになります

GenericClass1; 1; 2; NA; 3; 4; GenericClass1; 5; 6; NA; 7; 8; GenericClass2; 1; 5; NA; 3; 8; GenericClass2; 2; 6; NA; 4; 1;

等

これを行う上品な方法はありますか、それともawkを使用する必要がありますか？

terdon · Accepted Answer

grepを使用できます。小さなファイルを入力として与え、一致しない行を見つけるように伝えます：

grep -vxFf file.txt bigfile.txt > newbigfile.txt

使用されるオプションは次のとおりです。

 -F, --fixed-strings Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX.) -f FILE, --file=FILE Obtain patterns from FILE, one per line. The empty file contains zero patterns, and therefore matches nothing. (-f is specified by POSIX.) -v, --invert-match Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX.) -x, --line-regexp Select only those matches that exactly match the whole line. (-x is specified by POSIX.)

Ulrich Schwarz · Answer

commはあなたの友達です：

NAME comm-2つのソート済みファイルを行ごとに比較します

構文comm [オプション] ... FILE1 FILE2

ソートされたファイルFILE1とFILE2を行ごとに比較します。
 With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files. -1 suppress column 1 (lines unique to FILE1) -2 suppress column 2 (lines unique to FILE2) -3 suppress column 3 (lines that appear in both files) 

（並べ替えを考慮しているため、commはおそらくgrepよりもパフォーマンスが向上します。）

例えば：

comm -1 -3 file.txt bigfile.txt > newbigfile.txt