lxmlを使用した属性による要素の検索

Question

いくつかのデータを抽出するには、xmlファイルを解析する必要があります。特定の属性を持ついくつかの要素のみが必要です。ドキュメントの例を次に示します。

<root> <articles> <article type="news"> <content>some text</content> </article> <article type="info"> <content>some text</content> </article> <article type="news"> <content>some text</content> </article> </articles> </root>

ここでは、タイプが「ニュース」の記事のみを取得します。 lxmlで最も効率的でエレガントな方法は何ですか？

私はfindメソッドを試しましたが、あまり良くありません：

from lxml import etree f = etree.parse("myfile") root = f.getroot() articles = root.getchildren()[0] article_list = articles.findall('article') for article in article_list: if "type" in article.keys(): if article.attrib['type'] == 'news': content = article.find('content') content = content.text

Devin Jeanpierre · Accepted Answer

Xpathを使用できます。 root.xpath("//article[@type='news']")

このxpath式は、値が「news」の「type」属性を持つすべての<article/>要素のリストを返します。その後、必要な処理を行うために繰り返し処理を行うか、どこにでも渡すことができます。

テキストコンテンツのみを取得するには、次のようにxpathを拡張できます。

root = etree.fromstring(""" <root> <articles> <article type="news"> <content>some text</content> </article> <article type="info"> <content>some text</content> </article> <article type="news"> <content>some text</content> </article> </articles> </root> """) print root.xpath("//article[@type='news']/content/text()")

そして、これは['some text', 'some text']を出力します。または、コンテンツ要素だけが必要な場合は、"//article[@type='news']/content"-などになります。

Kjir · Answer

参考のために、 findall を使用して同じ結果を得ることができます。

root = etree.fromstring(""" <root> <articles> <article type="news"> <content>some text</content> </article> <article type="info"> <content>some text</content> </article> <article type="news"> <content>some text</content> </article> </articles> </root> """) articles = root.find("articles") article_list = articles.findall("article[@type='news']/content") for a in article_list: print a.text