txtファイルの行のサブグループ内のGrep文字列

Question

このようなファイルがあります

AAA_21 PF13304.1 x_00004 AAA_22 PF13401.1 x_00004 SMC_N PF02463.14 x_00004 AAA_29 PF13555.1 x_00004 DUF258 PF03193.11 x_00005 AAA_15 PF13175.1 x_00005 AAA_21 PF13304.1 x_00005 AAA_22 PF13401.1 x_00005 SMC_N PF02463.14 x_00005 AAA_15 PF13175.1 x_00006 AAA_21 PF13304.1 x_00006 AAA_22 PF13401.1 x_00007 SMC_N PF02463.14 x_00007

ここで、列3に同じ文字列（x_00004など）を持つ行の各ブロックについて、特定の文字列がブロック内に一緒に存在する場合、特定の文字列を含む行のみをgrepしたいです。

したがって、grep -f <file containing string> <file to scan>を使用できることはわかっていますが、最初のアクションを適用する方法が見つかりません。 awkがここで私を助けてくれると思いますが、実際にはどうすればいいかわかりません。

私は次のようなものが欲しいです：

AAA_21 PF13304.1 x_00004 AAA_22 PF13401.1 x_00004 AAA_21 PF13304.1 x_00005 AAA_22 PF13401.1 x_00005

基本的に、フィールド3を共有している場合にのみ、PF13304.1またはPF13401.1を含む行をグレーピングします。

PF13304.1とPF13401.1を例として使用します。ブロック内で3つの文字列の存在を探すことがあるためです。 1つの問題は、探している文字列が、スキャンするファイル内で必ずしも連続していないことです。

grepにしたいすべての文字列も、txtファイルで報告されます。 grepコマンドと一致させたいので、それらを整理できます。

代わりに含む行

AAA_21 PF13304.1 x_00006 AAA_22 PF13401.1 x_00007

grepにしたい文字列はフィールド3を共有しないため、含めるべきではありません。つまり、両方がサブグループx_00006またはx_00007に存在しません。

だから、論理的な観点から私はしたい

ファイルを開く
フィールド3に従ってグループに行を分割し、フィールド3に同じ文字列を持つグループを作成します
このサブグループgrepで、各ブロックにすべて存在する場合にのみ探している文字列

Sergiy Kolodyazhnyy · Answer

Pythonでかなり簡単に行えます：

$ cat input.txt | ./find_strings.py PF13304.1 PF13401.1 AAA_21 PF13304.1 x_00004 AAA_22 PF13401.1 x_00004 AAA_21 PF13304.1 x_00005 AAA_22 PF13401.1 x_00005 AAA_21 PF13304.1 x_00006 AAA_22 PF13401.1 x_00007

の内容 find_strings.py：

#!/usr/bin/env python import sys strings=sys.argv[1:] for line in sys.stdin: for string in strings: if string in line: print line.strip()

この言葉は、入力ファイルの内容をスクリプトのstdinストリームにリダイレクトし、1行ずつストリームを読み取り、各行でコマンドラインで提供する引数のリストを検索する方法です。かなりシンプルなアプローチ

glenn jackman · Answer

確かにgrepほど単純ではありません。このプログラム：

テキストファイルをスキャンし、3番目のフィールドが同じ文字列である「ブロック」を蓄積します
ブロックが見つかったら、grepを呼び出して出力を収集します
出力の行数が検索語の数と同じ場合、出力grepの出力

awk ' function grep(block, m, grep_out, cmd, line, i) { m = 0 delete grep_out cmd = "grep -f " ARGV[1] # define the grep command print block |& cmd # invoke grep, and send the block of text as stdin close(cmd, "to") # close greps stdin so we can start reading the output # read from grep until no more output while ((cmd |& getline line) > 0) grep_out[m++] = line close(cmd) # did grep find all search terms? If yes, print the output if (length(grep_out) == nterms) for (i=0; i<m; i++) print grep_out[i] } # read the search terms file, just to count the number of lines NR == FNR { nterms++ next } # if we detect a new block, call grep and start a new block section != $3 { if (block) grep(block) block = "" section = $3 } {block = block $0 RS} # accumulate the lines in this block END {if (block) grep(block)} # also call grep at end of file ' fileContainingStrings fileToScan

この出力を生成します：

AAA_21 PF13304.1 x_00004 AAA_22 PF13401.1 x_00004 AAA_21 PF13304.1 x_00005 AAA_22 PF13401.1 x_00005

Thor · Answer

したがって、私があなたを正しく理解していれば、指定したすべてのパターンを含むすべてのサブグループを検索する必要があります。これは、sortとawkを使用して実行できます。例：

# make sure subgroups are adjacent sort -k3,3 infile | # add a newline between subroups, this allows the next # invocation of awk to read each subgroup as a record awk 'NR > 1 && p!=$3 { printf "
" } { p=$3 } 1' | # match the desired patterns and print the subgroup name awk '/\<PF13304\.1\>/ && /\<PF13401\.1\>/ { print $3 }' RS=

出力：

x_00004 x_00005

上記の出力に基づいて、infileから関連する行を抽出できるようになりました。上記のパイプに次を追加します。

while read sgrp; do grep -E "\b(PF13304\.1|PF13401\.1)\b +$sgrp\$" infile done

出力：

AAA_21 PF13304.1 x_00004 AAA_22 PF13401.1 x_00004 AAA_21 PF13304.1 x_00005 AAA_22 PF13401.1 x_00005

Peter.O · Answer

次のawkスクリプトは、match_fileに対してdata_fileの1行に1つずつリストされているliteral文字列に一致します

awk 'function endgroup() { gmc=0 # group match count for( gi=1; gi<=gz; gi++ ) { # step through all lines in a group split(group[gi],g) # split one group line for( lix in lms ) # for each literal match string index if( lix == g[2] ) # does literal match string = group record $2 mrec[++gmc]=group[gi] # group matched record array, and inc match count } if( gmc==lmz ) for( mri=1; mri<=lmz; mri++ ) print mrec[mri] delete group; gz=0 } BEGIN{ p3=FS } # an impossible previous value of $3 of "data_file" # process "match_file" NR==FNR { lms[$0] # build array with literal match strings as indices lmz++ # literal match strings array size next } # process "data_file" p3!=$3 && p3!=FS { endgroup() } { group[++gz]=$0; p3=$3 } END{ if( p3!=FS ) endgroup() } ' match_file data_file

出力：

AAA_21 PF13304.1 x_00004 AAA_22 PF13401.1 x_00004 AAA_21 PF13304.1 x_00005 AAA_22 PF13401.1 x_00005

A.B. · Answer

このようなもの？

_awk '(/x_00004/ || /x_00005/) && (/PF13401.1/ || /PF13304.1/)' your_file _

または、これは基本的に同じですが、より読みやすいグループ化を使用します

_awk '(/x_00004/ && (/PF13401.1/ || /PF13304.1/)) || (/x_00005/ && (/PF13401.1/ || /PF13304.1/))' your_file _

例

入力ファイル

_cat foo_

_AAA_21 PF13304.1 x_00004 AAA_22 PF13401.1 x_00004 SMC_N PF02463.14 x_00004 AAA_29 PF13555.1 x_00004 DUF258 PF03193.11 x_00005 AAA_15 PF13175.1 x_00005 AAA_21 PF13304.1 x_00005 AAA_22 PF13401.1 x_00005 SMC_N PF02463.14 x_00005 AAA_15 PF13175.1 x_00006 AAA_21 PF13304.1 x_00006 AAA_22 PF13401.1 x_00007 SMC_N PF02463.14 x_00007 _

コマンド

awk '(/x_00004/ || /x_00005/) && (/PF13401.1/ || /PF13304.1/)' foo

_AAA_21 PF13304.1 x_00004 AAA_22 PF13401.1 x_00004 AAA_21 PF13304.1 x_00005 AAA_22 PF13401.1 x_00005 _