大きなXMLファイルに対してPython Iterparseを使用する

Question

Pythonでパーサーを作成する必要があります。これは、非常に大きなメモリ（> 2 GB）を、メモリのないコンピューター（2 GBのみ）で処理できます。iterparseをlxmlで使用して、それ。

私のファイルの形式は次のとおりです。

<item> <title>Item 1</title> <desc>Description 1</desc> </item> <item> <title>Item 2</title> <desc>Description 2</desc> </item>

これまでのところ私の解決策は：

from lxml import etree context = etree.iterparse( MYFILE, tag='item' ) for event, elem in context : print elem.xpath( 'description/text( )' ) del context

残念ながら、このソリューションはまだ多くのメモリを消費しています。問題は、各「アイテム」を処理した後、空の子をクリーンアップするために何かをする必要があることだと思います。データを適切にクリーンアップするためにデータを処理した後、私ができることについて誰かが提案できますか？

unutbu · Accepted Answer

Liza Dalyのfast_iter を試してください。要素elemを処理した後、elem.clear()を呼び出して子孫を削除し、先行する兄弟も削除します。

def fast_iter(context, func, *args, **kwargs): """ http://lxml.de/parsing.html#modifying-the-tree Based on Liza Daly's fast_iter http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ See also http://effbot.org/zone/element-iterparse.htm """ for event, elem in context: func(elem, *args, **kwargs) # It's safe to call clear() here because no descendants will be # accessed elem.clear() # Also eliminate now-empty references from the root node to elem for ancestor in elem.xpath('ancestor-or-self::*'): while ancestor.getprevious() is not None: del ancestor.getparent()[0] del context def process_element(elem): print elem.xpath( 'description/text( )' ) context = etree.iterparse( MYFILE, tag='item' ) fast_iter(context,process_element)

特に大きなXMLファイルを処理している場合は、Dalyの記事は素晴らしい読み物です。

編集：上記のfast_iterは、Dalyのfast_iterの修正版です。要素を処理した後は、不要になった他の要素の削除に積極的になります。

以下のスクリプトは、動作の違いを示しています。特に、orig_fast_iterはA1要素を削除しないのに対し、mod_fast_iterはそれを削除するため、より多くのメモリを節約できることに注意してください。

import lxml.etree as ET import textwrap import io def setup_ABC(): content = textwrap.dedent('''\ <root> <A1> <B1></B1> <C>1<D1></D1></C> <E1></E1> </A1> <A2> <B2></B2> <C>2<D></D></C> <E2></E2> </A2> </root> ''') return content def study_fast_iter(): def orig_fast_iter(context, func, *args, **kwargs): for event, elem in context: print('Processing {e}'.format(e=ET.tostring(elem))) func(elem, *args, **kwargs) print('Clearing {e}'.format(e=ET.tostring(elem))) elem.clear() while elem.getprevious() is not None: print('Deleting {p}'.format( p=(elem.getparent()[0]).tag)) del elem.getparent()[0] del context def mod_fast_iter(context, func, *args, **kwargs): """ http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ Author: Liza Daly See also http://effbot.org/zone/element-iterparse.htm """ for event, elem in context: print('Processing {e}'.format(e=ET.tostring(elem))) func(elem, *args, **kwargs) # It's safe to call clear() here because no descendants will be # accessed print('Clearing {e}'.format(e=ET.tostring(elem))) elem.clear() # Also eliminate now-empty references from the root node to elem for ancestor in elem.xpath('ancestor-or-self::*'): print('Checking ancestor: {a}'.format(a=ancestor.tag)) while ancestor.getprevious() is not None: print( 'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag)) del ancestor.getparent()[0] del context content = setup_ABC() context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C') orig_fast_iter(context, lambda elem: None) # Processing <C>1<D1/></C> # Clearing <C>1<D1/></C> # Deleting B1 # Processing <C>2<D/></C> # Clearing <C>2<D/></C> # Deleting B2 print('-' * 80) """ The improved fast_iter deletes A1. The original fast_iter does not. """ content = setup_ABC() context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C') mod_fast_iter(context, lambda elem: None) # Processing <C>1<D1/></C> # Clearing <C>1<D1/></C> # Checking ancestor: root # Checking ancestor: A1 # Checking ancestor: C # Deleting B1 # Processing <C>2<D/></C> # Clearing <C>2<D/></C> # Checking ancestor: root # Checking ancestor: A2 # Deleting A1 # Checking ancestor: C # Deleting B2 study_fast_iter()

Steven · Answer

iterparse()を使用すると、ツリーを構築しながらことができます。つまり、不要なものを削除しない限り、結局、結局、ツリー全体で終わります。

詳細については、元のElementTree実装の作成者が this を読んでください（ただし、lxmlにも適用できます）。

Stefan · Answer

私の経験では、element.clear（ F。Lundh およびL. Dalyを参照）の有無にかかわらずiterparseは常に非常に大きなXMLファイルに対応できるわけではありません。屋根を通り抜け、メモリエラーが発生するか、システムがクラッシュします。同じ問題が発生した場合は、おそらく同じ解決策を使用できます：expatパーサー。 F。Lundh またはOPのXMLスニペットを使用した次の例も参照してください（エンコードの問題がないことを確認するために2つのウムラウトを使用）：

import xml.parsers.expat from collections import deque def iter_xml(inpath: str, outpath: str) -> None: def handle_cdata_end(): nonlocal in_cdata in_cdata = False def handle_cdata_start(): nonlocal in_cdata in_cdata = True def handle_data(data: str): nonlocal in_cdata if not in_cdata and open_tags and open_tags[-1] == 'desc': data = data.replace('\', '\\').replace('
', '\n') outfile.write(data + '
') def handle_endtag(tag: str): while open_tags: open_tag = open_tags.pop() if open_tag == tag: break def handle_starttag(tag: str, attrs: 'Dict[str, str]'): open_tags.append(tag) open_tags = deque() in_cdata = False parser = xml.parsers.expat.ParserCreate() parser.CharacterDataHandler = handle_data parser.EndCdataSectionHandler = handle_cdata_end parser.EndElementHandler = handle_endtag parser.StartCdataSectionHandler = handle_cdata_start parser.StartElementHandler = handle_starttag with open(inpath, 'rb') as infile: with open(outpath, 'w', encoding = 'utf-8') as outfile: parser.ParseFile(infile) iter_xml('input.xml', 'output.txt')

input.xml：

<root> <item> <title>Item 1</title> <desc>Description 1ä</desc> </item> <item> <title>Item 2</title> <desc>Description 2ü</desc> </item> </root>

output.txt：

Description 1ä Description 2ü

Elazar Leibovich · Answer

sax の「コールバック」アプローチを使用しないのはなぜですか？

Ash Upadhyay · Answer

Iterparseは、解析のようにツリーを構築しますが、解析中にツリーの一部を安全に再配置または削除できることに注意してください。たとえば、大きなファイルを解析するには、要素を処理したらすぐに要素を取り除くことができます。

for event, elem in iterparse(source): if elem.tag == "record": ... process record elements ... elem.clear()上記のパターンには1つの欠点があります。ルート要素はクリアされないため、多くの空の子要素を持つ単一の要素になります。ファイルが大きいだけでなく巨大な場合、これが問題になる可能性があります。これを回避するには、ルート要素を手に入れる必要があります。これを行う最も簡単な方法は、開始イベントを有効にし、最初の要素への参照を変数に保存することです。

反復可能になる

context = iterparse(source, events=("start", "end"))

それをイテレータに変えます

context = iter(context)

ルート要素を取得する

event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear()

したがって、これは増分解析の質問ですこのリンクから詳細な回答が得られます要約された回答については、上記を参照できます

Jason Argo · Answer

Root.clear（）メソッドの唯一の問題は、NoneTypesを返すことです。つまり、たとえば、replace（）やtitle（）などの文字列メソッドで解析するデータを編集することはできません。そうは言っても、これはデータをそのまま解析するだけの場合に最適な方法です。