文字列の組み合わせの長いリストからさまざまな文字列を含むすべてのファイルを見つける方法は？

Question

私はまだコマンドラインツール（Mac OSXターミナルを使用）に非常に慣れていないので、どこかで答えを見逃していないことを願っていますが、何時間も検索しました。

3つの文字列の200の組み合わせを含むテキストファイル（strings.txtと呼びましょう）があります。 [2017/01/30を編集]最初の5行は次のようになります。

"surveillance data" "surveillance technology" "cctv camera" "social media" "surveillance techniques" "enforcement agencies" "social control" "surveillance camera" "social security" "surveillance data" "security guards" "social networking" "surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

1行目の監視データのようなバイグラム/ 2ワードのフレーズが一緒になっている限り、strings.txtを他の形式に変更できることに注意してください。（つまり、以下の@MichaelVehrsによる回答については、必要に応じて引用符を削除できます）。

ここで、800を超えるファイルのディレクトリで、（ファイル内の任意の場所に）少なくとも1つの文字列の組み合わせを含むファイルを検索したいと思います。私の最初のアイデアは、次のようなパターンファイルでegrepを使用することでした。

egrep -i -l -r -f strings.txt file_directory

ただし、これを機能させるには、1行に1つの文字列がある場合のみです。特定のパターンの3つの文字列すべてを含む識別されたファイルが必要なため、これは望ましくありません。ある種のAND演算子をgrepパターンファイルに追加する方法はありますか？または、別の関数/ツールを使用して目的を達成する別の方法はありますか？どうもありがとう！

2017/01/30を編集

以下の@MichaelVehrsによる回答は非常に役に立ちました。私はそれを次のように編集しました：

while read one two three four five six do grep -ilFr "$one $two" *files* | xargs grep -ilFr "$three $four" | xargs grep -ilFr "$five $six" done < *patternfile* | sort -u

この回答は、パターンファイルに引用符のない文字列が含まれている場合に機能します。残念ながら、パターンファイルの最初の行のパターンとのみ一致しているようです。誰かが理由を知っていますか？

2017/01/29を編集

複数の値のgrepについての同様の質問は、前に尋ねられたですが、パターンファイルstrings.txtの3つの文字列の組み合わせの1つと一致させるには、ANDロジックが必要です。他のファイル。マッチングが機能するためにはstrings.txtの形式を変更する必要があるかもしれないことを理解しており、提案をいただければ幸いです。

George Vasiliou · Accepted Answer

agrepはシステムに存在しないように思われるため、sedとawkに基づいたこの代替手段を調べて、ローカルファイルによって読み取られたパターンからgrepと操作を適用してください。

PS：osxを使用しているので、お持ちのawkバージョンが以下の使用法をサポートするかどうかはわかりません。

awkは、この使用法で複数のパターンのAND演算を使用してgrepをシミュレートできます。
awk '/pattern1/ && /pattern2/ && /pattern3/'

したがって、これからパターンファイルを変換できます。

$ cat ./tmp/d1.txt "surveillance data" "surveillance technology" "cctv camera" "social media" "surveillance techniques" "enforcement agencies" "social control" "surveillance camera" "social security" "surveillance data" "security guards" "social networking" "surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

これに：

$ sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' ./tmp/d1.txt /surveillance data/ && /surveillance technology/ && /cctv camera/ /social media/ && /surveillance techniques/ && /enforcement agencies/ /social control/ && /surveillance camera/ && /social security/ /surveillance data/ && /security guards/ && /social networking/ /surveillance mechanisms/ && /cctv surveillance/ && /contemporary surveillance/

PS：最後に>anotherfileを使用して出力を別のファイルにリダイレクトするか、sed -iオプションを使用して同じ検索語パターンファイルにインプレース変更を加えることができます。

次に、このパターンファイルからawk形式のパターンをawkにフィードする必要があります。

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt #d1.txt = my test pattern file

次のように、この元のパターンファイルの各行にsedを適用して、元のパターンファイルのパターンを変換することもできません。

while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line") awk "$line" *.txt done <./tmp/d1.txt

またはワンライナーとして：

$ while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt

上記のコマンドは、次のような正しいAND結果をテストファイルに返します。

$ cat d2.txt This guys over there have the required surveillance technology to do the job. The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera. $ cat d3.txt All surveillance data are locked. All surveillance data are locked and guarded by security guards. There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

結果：

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt #or while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera. There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

更新：
上記のawkソリューションは、一致するtxtファイルの内容を出力します。
内容の代わりにファイル名を表示する場合は、必要に応じて次のawkを使用します。

awk "$line""{print FILENAME}" *.txt

St&#233;phane Chazelas · Answer

Perlを使用します。次のようになります。

Perl -MFile::Find -MClone=clone -lne ' # parse the strings.txt input, here looking for the sequences of # 0 or more characters (.*?) in between two " characters for (/"(.*?)"/g) { # @needle is an array of associative arrays whose keys # are the "strings" for each line. $needle[$n]{$_} = undef; } $n++; END{ sub wanted { return unless -f; # only regular files my $needle_clone = clone(\@needle); if (open FILE, "<", $_) { LINE: while (<FILE>) { # read the file line by line for (my $i = 0; $i < $n; $i++) { for my $s (keys %{$needle_clone->[$i]}) { if (index($_, $s)>=0) { # if the string is found, we delete it from the associative # array. delete $needle_clone->[$i]{$s}; unless (%{$needle_clone->[$i]}) { # if the associative array is empty, that means we have # found all the strings for that $i, that means we can # stop processing, and the file matches print $File::Find::name; last LINE; } } } } } close FILE; } } find(\&wanted, ".") }' /path/to/strings.txt

つまり、文字列検索の数を最小限に抑えます。

ここでは、ファイルを1行ずつ処理しています。ファイルが適度に小さい場合は、ファイル全体を処理することで、ファイルを少し単純化し、パフォーマンスを向上させることができます。

リストファイルが次の場所にあることを想定していることに注意してください。

 "surveillance data" "surveillance technology" "cctv camera" "social media" "surveillance techniques" "enforcement agencies" "social control" "surveillance camera" "social security" "surveillance data" "security guards" "social networking" "surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

各行に引用符で囲まれた（二重引用符で囲まれた）文字列の数（3である必要はありません）を含む形式。引用符で囲まれた文字列自体に二重引用符を含めることはできません。二重引用符は、検索対象のテキストの一部ではありません。これは、リストファイルに次のものが含まれている場合です。

"A" "B" "1" "2" "3"

これにより、現在のディレクトリ以下にある、いずれかを含むすべての通常のファイルのパスが報告されます。

AとBの両方
または（排他的論理和ではない）すべての1、2および3

それらのどこでも。

Michael Vehrs · Answer

問題は少し厄介ですが、次のようにアプローチできます。

while read one two three four five six do grep -lF "$one $two" *files* | xargs grep -lF "$three $four" | xargs grep -lF "$five $six" done < patterns | sort -u

これは、パターンファイルに1行あたり正確に6つの単語が含まれていることを前提としています（それぞれ2単語の3つのパターン）。論理的なandは、3つの連続するフィルター（grep）をチェーンすることによって実現されます。これは特に効率的ではないことに注意してください。 awkソリューションの方がおそらく高速です。

George Vasiliou · Answer

これは私のテストでうまくいくように見える別のアプローチです。

文字列ファイルのデータをd1.txtという名前のファイルにコピーし、別のディレクトリ（つまり、tmp）に移動して、後でgrepが同じファイル（d1.txt）内の文字列ファイルと一致しないようにしました。

次に、次のコマンドを使用して、この文字列ファイル（私の場合はd1.txt）に各検索語の間にセミコロンを挿入します。sed -i 's/" "/";"/g' ./tmp/d1.txt

$ cat ./tmp/d1.txt "surveillance data" "surveillance technology" "cctv camera" "social media" "surveillance techniques" "enforcement agencies" "social control" "surveillance camera" "social security" "surveillance data" "security guards" "social networking" "surveillance mechanisms" "cctv surveillance" "contemporary surveillance" $ sed -i 's/" "/";"/g' ./tmp/d1.txt $ cat ./tmp/d1.txt "surveillance data";"surveillance technology";"cctv camera" "social media";"surveillance techniques";"enforcement agencies" "social control";"surveillance camera";"social security" "surveillance data";"security guards";"social networking" "surveillance mechanisms";"cctv surveillance";"contemporary surveillance"

次に、コマンドsed 's/"//g' ./tmp/d1.txtを使用して二重引用符を削除します。PS：これは実際には必要ないかもしれませんが、テストのために二重引用符を削除しました。

$ sed -i 's/"//g' ./tmp/d1.txt && cat ./tmp/d1.txt surveillance data;surveillance technology;cctv camera social media;surveillance techniques;enforcement agencies social control;surveillance camera;social security surveillance data;security guards;social networking surveillance mechanisms;cctv surveillance;contemporary surveillance

いいえ、AND操作でマルチパターンgrepを提供するように正確に設計されたプログラムagrepを使用して、現在のディレクトリ内のすべてのファイルをgrepすることはできません。

agrepは、ANDとして評価されるために、複数のパターンをセミコロン;で区切る必要があります。

私のテストでは、次の内容の2つのサンプルファイルを作成しました。

$ cat d2.txt This guys over there have the required surveillance technology to do the job. The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera. $ cat d3.txt All surveillance data are locked. All surveillance data are locked and guarded by security guards. There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

現在のディレクトリでagrepを実行すると、正しい行（ANDを含む）とファイル名が返されます。

$ while IFS= read -r line;do agrep "$line" *;done<./tmp/d1.txt d2.txt: The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera. d3.txt: There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)