awk：解析して別のファイルに書き込む

Question

以下のようなXMLファイルのレコードがあります。 <keyword>SEARCH</keyword>を検索する必要があり、存在する場合は、レコード全体を取得して別のファイルに書き込む必要があります（<record>から</record>まで）

以下は、ループ内にある私のawkコードです。 $1は、各レコードの行ごとの値を保持します。

if(index($1,"SEARCH")>0) { print $1>> "output.txt" }

このロジックには2つの問題があります。

output.txtファイルに書き込んでおり、<keyword>SEARCH</keyword>要素のみに書き込んでおり、レコード全体には書き込んでいません（<record>から</record>まで）
SEARCHは、<detail>タグにも含めることができます。このコードは、そのタグをoutput.txtに書き込みます。

XMLファイル：

<record category="xyz"> <person ssn="" e-i="E"> <title xsi:nil="true"/> <position xsi:nil="true"/> <names> <first_name/> <last_name></last_name> <aliases> <alias>CDP</alias> </aliases> <keywords> <keyword xsi:nil="true"/> <keyword>SEARCH</keyword> </keywords> <external_sources> <uri>http://www.google.com</uri> <detail>SEARCH is present in abc for xyz reason</detail> </external_sources> </details> </record> <record category="abc"> <person ssn="" e-i="F"> <title xsi:nil="true"/> <position xsi:nil="true"/> <names> <first_name/> <last_name></last_name> <aliases> <alias>CDP</alias> </aliases> <keywords> <keyword xsi:nil="true"/> <keyword>DONTSEARCH</keyword> </keywords> <external_sources> <uri>http://www.google.com</uri> <detail>SEARCH is not present in abc for xyz reason</detail> </external_sources> </details> </record>

Sobrique · Accepted Answer

有効なXMLではないため、投稿したものはサンプルであると想定します。この仮定が有効でない場合、私の答えは当てはまりません...しかし、その場合は、XMLを提供した人に、XML仕様のロールアップされたコピーをぶつけて、要求する必要があります。修理する'。

しかし実際には--awkと正規表現はその仕事に適したツールではありません。 XMLパーサーはです。そしてパーサーを使えば、あなたがやりたいことをするのはとてつもなく簡単です。

_#!/usr/bin/env Perl use strict; use warnings; use XML::Twig; #parse your file - this will error if it's invalid. my $twig = XML::Twig -> new -> parsefile ( 'your_xml' ); #set output format. Optional. $twig -> set_pretty_print('indented_a'); #iterate all the 'record' nodes off the root. foreach my $record ( $twig -> get_xpath ( './record' ) ) { #if - beneath this record - we have a node anywhere (that's what // means) #with a tag of 'keyword' and content of 'SEARCH' #print the whole record. if ( $record -> get_xpath ( './/keyword[string()="SEARCH"]' ) ) { $record -> print; } } _

xpathは、ある意味では正規表現に非常によく似ていますが、ディレクトリパスに似ています。つまり、コンテキストを認識し、XML構造を処理できます。

上記の場合：_./_は「現在のノードの下」を意味します。

_$twig -> get_xpath ( './record' ) _

'トップレベル' _<record>_タグを意味します。

ただし、_.//_は「現在のノードより下の任意のレベル」を意味するため、再帰的に実行されます。

_$twig -> get_xpath ( './/search' ) _

任意のレベルで任意の_<search>_ノードを取得します。

また、角かっこは条件を示します。これは、関数（たとえば、ノードのテキストを取得するためのtext()）であるか、属性を使用できます。例えば_//category[@name]_は、name属性を持つカテゴリを検索し、_//category[@name="xyz"]_はそれらをさらにフィルタリングします。

テストに使用されるXML：

_<XML> <record category="xyz"> <person ssn="" e-i="E"> <title xsi:nil="true"/> <position xsi:nil="true"/> <details> <names> <first_name/> <last_name></last_name> </names> <aliases> <alias>CDP</alias> </aliases> <keywords> <keyword xsi:nil="true"/> <keyword>SEARCH</keyword> </keywords> <external_sources> <uri>http://www.google.com</uri> <detail>SEARCH is present in abc for xyz reason</detail> </external_sources> </details> </person> </record> <record category="abc"> <person ssn="" e-i="F"> <title xsi:nil="true"/> <position xsi:nil="true"/> <details> <names> <first_name/> <last_name></last_name> </names> <aliases> <alias>CDP</alias> </aliases> <keywords> <keyword xsi:nil="true"/> <keyword>DONTSEARCH</keyword> </keywords> <external_sources> <uri>http://www.google.com</uri> <detail>SEARCH is not present in abc for xyz reason</detail> </external_sources> </details> </person> </record> </XML> _

出力：

_ <record category="xyz"> <person e-i="E" ssn=""> <title xsi:nil="true" /> <position xsi:nil="true" /> <details> <names> <first_name/> <last_name></last_name> </names> <aliases> <alias>CDP</alias> </aliases> <keywords> <keyword xsi:nil="true" /> <keyword>SEARCH</keyword> </keywords> <external_sources> <uri>http://www.google.com</uri> <detail>SEARCH is present in abc for xyz reason</detail> </external_sources> </details> </person> </record> _

注-上記は、レコードをSTDOUTに出力するだけです。それは実際には...私の意見では、それほど素晴らしいアイデアではありません。特に、XML構造を出力しないため、複数のレコードがある場合（「ルート」ノードがない場合）、実際には「有効な」XMLではありません。

だから私は代わりに-あなたが求めていることを正確に達成するために：

_#!/usr/bin/env Perl use strict; use warnings; use XML::Twig; my $twig = XML::Twig -> new -> parsefile ('your_file.xml'); $twig -> set_pretty_print('indented_a'); foreach my $record ( $twig -> get_xpath ( './record' ) ) { if ( not $record -> findnodes ( './/keyword[string()="SEARCH"]' ) ) { $record -> delete; } } open ( my $output, '>', "output.txt" ) or die $!; print {$output} $twig -> sprint; close ( $output ); _

これは代わりに-ロジックを反転し、（メモリ内の解析されたデータ構造から）必要なレコードを削除しますしない必要なレコード全体（XMLヘッダーを含む）を "という新しいファイルに出力しますoutput.txt」。

Firefly · Answer

私が正しく理解していれば、これはawkの解決策かもしれません！：

/^<record/ { x1=""; while (match($0, "record>$")==0) { x1=x1 $0"
"; getline; } x1=x1 $0; if (x1 ~ />SEARCH</) { print x1 > "output.txt"; } }

これにより、キー「SEARCH」を含むレコード>から\ record>までのブロックが出力ファイルに抽出されます。

Costas · Answer

それに加えて、awk（および別のテキストプロセッサと同じ）は適切なxml解析ツールではありません。

awk ' lines{ lines=lines "
" $0 } /</record/{ if(lines ~ /keyword>SEARCH</) print lines lines="" } /<record/{ lines=$0 } ' <input.txt >output.txt

sedと同じ

sed -n '/<record/{:1;N;/</record/!b1;/keyword>SEARCH</p;}' <input.txt >output.txt