特定のIDに一致するxmlドキュメントをフィルタリングする

Question

次のような多くのxmlドキュメントを含むファイルがあるとします。

<a> <b> ... </a> in between xml documents there may be plain text log messages <x> ... </x> ...

このファイルをフィルタリングして、特定の正規表現がそのxmlドキュメントのいずれかの行と一致するxmlドキュメントのみを表示するにはどうすればよいですか？ここでは単純なテキストの一致について話しているので、正規表現の一致部分は、基になる形式であるxmlを完全に知らない可能性があります。

ルート要素の開始タグと終了タグは常に独自の行にあり（空白が埋め込まれている場合があります）、ルート要素としてのみ使用されている、つまり同じ名前のタグは下に表示されないと想定できます。ルート要素。これにより、xml対応のツールを使用せずに作業を完了できるようになります。

igal · Accepted Answer

概要

Pythonソリューション、Bashソリューション、およびAwkソリューションを作成しました。すべてのスクリプトの考え方は同じです。行ごとに調べ、フラグ変数を使用して状態を追跡します。（つまり、現在XMLサブドキュメント内にいるかどうか、および一致する行が見つかったかどうか）。

Pythonスクリプトでは、すべての行をリストに読み込み、現在のXMLサブドキュメントが始まるリストインデックスを追跡して、最後に到達したときに現在のサブドキュメントを印刷できるようにします。タグ。各行で正規表現パターンを確認し、フラグを使用して、処理が完了したときに現在のサブドキュメントを出力するかどうかを追跡します。

Bashスクリプトでは、一時ファイルをバッファーとして使用して現在のXMLサブドキュメントを保存し、書き込みが完了するまで待ってからgrepを使用して、指定された正規表現に一致する行が含まれているかどうかを確認します。

AwkスクリプトはBaseスクリプトに似ていますが、ファイルの代わりにAwk配列をバッファーに使用します。

テストデータファイル

あなたの質問で与えられたサンプルデータに基づいて、次のデータファイル（data.xml）に対して両方のスクリプトをチェックしました。

<a> <b> string to search for: stuff </b> </a> in between xml documents there may be plain text log messages <x> unicode string: øæå </x>

Pythonソリューション

これがあなたが望むことをする簡単なPythonスクリプトです：

#!/usr/bin/env python2 # -*- encoding: ascii -*- """xmlgrep.py""" import sys import re invert_match = False if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match': invert_match = True sys.argv.pop(0) regex = sys.argv[1] # Open the XML-ish file with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile: # Read all of the data into a list lines = xmlfile.readlines() # Use flags to keep track of which XML subdocument we're in # and whether or not we've found a match in that document start_index = closing_tag = regex_match = False # Iterate through all the lines for index, line in enumerate(lines): # Remove trailing and leading white-space line = line.strip() # If we have a start_index then we're inside an XML document if start_index is not False: # If this line is a closing tag then reset the flags # and print the document if we found a match if line == closing_tag: if regex_match != invert_match: print(''.join(lines[start_index:index+1])) start_index = closing_tag = regex_match = False # If this line is NOT a closing tag then we # search the current line for a match Elif re.search(regex, line): regex_match = True # If we do NOT have a start_index then we're either at the # beginning of a new XML subdocument or we're inbetween # XML subdocuments else: # Check for an opening tag for a new XML subdocument match = re.match(r'^<(\w+)>$', line) if match: # Store the current line number start_index = index # Construct the matching closing tag closing_tag = '</' + match.groups()[0] + '>'

スクリプトを実行して文字列「stuff」を検索する方法は次のとおりです。

python xmlgrep.py stuff data.xml

そして、これが出力です：

<a> <b> string to search for: stuff </b> </a>

スクリプトを実行して文字列「øæå」を検索する方法は次のとおりです。

python xmlgrep.py øæå data.xml

そして、これが出力です：

<x> unicode string: øæå </x>

-vまたは--invert-matchを指定して、一致しないドキュメントを検索し、stdinで作業することもできます。

cat data.xml | python xmlgrep.py -v stuff

Bashソリューション

これが同じ基本アルゴリズムのbash実装です。フラグを使用して、現在の行がXMLドキュメントに属しているかどうかを追跡し、一時ファイルをバッファーとして使用して、処理中の各XMLドキュメントを格納します。

#!/usr/bin/env bash # xmlgrep.sh # Get the filename and search pattern from the command-line FILENAME="$1" REGEX="$2" # Use flags to keep track of which XML subdocument we're in XML_DOC=false CLOSING_TAG="" # Use a temporary file to store the current XML subdocument TEMPFILE="$(mktemp)" # Reset the internal field separator to preserver white-space export IFS='' # Iterate through all the lines of the file while read LINE; do # If we're already in an XML subdocument then update # the temporary file and check to see if we've reached # the end of the document if "${XML_DOC}"; then # Append the line to the temp-file echo "${LINE}" >> "${TEMPFILE}" # If this line is a closing tag then reset the flags if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then XML_DOC=false CLOSING_TAG="" # Print the document if it contains the match pattern if grep -Pq "${REGEX}" "${TEMPFILE}"; then cat "${TEMPFILE}" fi fi # Otherwise we check to see if we've reached # the beginning of a new XML subdocument Elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then # Extract the tag-name TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<$\w\+$>\s*$/\1/;tx;d;:x')" # Construct the corresponding closing tag CLOSING_TAG="</${TAG_NAME}>" # Set the XML_DOC flag so we know we're inside an XML subdocument XML_DOC=true # Start storing the subdocument in the temporary file echo "${LINE}" > "${TEMPFILE}" fi done < "${FILENAME}"

スクリプトを実行して文字列「stuff」を検索する方法は次のとおりです。

bash xmlgrep.sh data.xml 'stuff'

そして、これが対応する出力です：

<a> <b> string to search for: stuff </b> </a>

スクリプトを実行して文字列「øæå」を検索する方法は次のとおりです。

bash xmlgrep.sh data.xml 'øæå'

そして、これが対応する出力です：

<x> unicode string: øæå </x>

Awkソリューション

これがawkソリューションです-私のawkは素晴らしいものではないので、かなりラフです。 BashおよびPythonスクリプトと同じ基本的な考え方を使用します。各XMLドキュメントをバッファー（awk配列）に格納し、フラグを使用して状態を追跡します。指定された正規式に一致する行が含まれている場合は、ドキュメントの処理を終了して印刷します。スクリプトは次のとおりです。

#!/usr/bin/env gawk # xmlgrep.awk # Variables: # # XML_DOC # XML_DOC=1 if the current line is inside an XML document. # # CLOSING_TAG # Stores the closing tag for the current XML document. # # BUFFER_LENGTH # Stores the number of lines in the current XML document. # # MATCH # MATCH=1 if we found a matching line in the current XML document. # # PATTERN # The regular expression pattern to match against (given as a command-line argument). # # Initialize Variables BEGIN{ XML_DOC=0; CLOSING_TAG=""; BUFFER_LENGTH=0; MATCH=0; } { if (XML_DOC==1) { # If we're inside an XML block, add the current line to the buffer BUFFER[BUFFER_LENGTH]=$0; BUFFER_LENGTH++; # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags if ($0 ~ CLOSING_TAG) { XML_DOC=0; CLOSING_TAG=""; # If there was a match then output the XML document if (MATCH==1) { for (i in BUFFER) { print BUFFER[i]; } } } # If we found a matching line then update the MATCH flag else { if ($0 ~ PATTERN) { MATCH=1; } } } else { # If we reach a new opening tag then start storing the data in the buffer if ($0 ~ /<[a-z]+>/) { # Set the XML_DOC flag XML_DOC=1; # Reset the buffer delete BUFFER; BUFFER[0]=$0; BUFFER_LENGTH=1; # Reset the match flag MATCH=0; # Compute the corresponding closing tag match($0, /<([a-z]+)>/, match_groups); CLOSING_TAG="</" match_groups[1] ">"; } } }

これがあなたがそれをどのように呼ぶかです：

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

そして、これが対応する出力です：

<x> unicode string: øæå </x>