lxmlのタグ内のすべてのテキストを取得します

Question

コードタグを含め、以下の3つのインスタンスすべてで、lxmlの_<content>_タグ内のすべてのテキストを取得するコードスニペットを作成したいと思います。 tostring(getchildren())を試しましたが、タグの間にあるテキストを見逃してしまいます。 APIを検索して関連する機能を見つけることはあまりできませんでした。手伝ってくれませんか？

_<!--1--> <content> <div>Text inside tag</div> </content> #should return "<div>Text inside tag</div> <!--2--> <content> Text with no tag </content> #should return "Text with no tag" <!--3--> <content> Text outside tag <div>Text inside tag</div> </content> #should return "Text outside tag <div>Text inside tag</div>" _

albertov · Accepted Answer

試してください：

def stringify_children(node): from lxml.etree import tostring from itertools import chain parts = ([node.text] + list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + [node.tail]) # filter removes possible Nones in texts and tails return ''.join(filter(None, parts))

例：

from lxml import etree node = etree.fromstring("""<content> Text outside tag <div>Text <em>inside</em> tag</div> </content>""") stringify_children(node)

生成：' Text outside tag <div>Text <em>inside</em> tag</div> '

Ed Summers · Answer

text_content（）は必要なことをしますか？

Arthur Debert · Answer

次のように、node.itertext()メソッドを使用します。

 ''.join(node.itertext())

Sandeep · Answer

pythonジェネレーターを使用する次のスニペットは完全に機能し、非常に効率的です。

''.join(node.itertext()).strip()

anana · Answer

Hojuによって報告された bugs を解決するalbertovの stringify-content のバージョン：

def stringify_children(node): from lxml.etree import tostring from itertools import chain return ''.join( chunk for chunk in chain( (node.text,), chain(*((tostring(child, with_tail=False), child.tail) for child in node.getchildren())), (node.tail,)) if chunk)

d3day · Answer

import urllib2 from lxml import etree url = 'some_url'

uRLを取得する

test = urllib2.urlopen(url) page = test.read()

テーブルタグを含む内のすべてのHTMLコードを取得する

tree = etree.HTML(page)

xpathセレクター

table = tree.xpath("xpath_here") res = etree.tostring(table)

resは、これが私のために仕事をしていたテーブルのhtmlコードです。

したがって、xpath_text（）を使用してタグのコンテンツを抽出し、tostring（）を使用してコンテンツを含むタグを抽出できます

div = tree.xpath("//div") div_res = etree.tostring(div)

text = tree.xpath_text("//content")

またはtext = tree.xpath（ "// content/text（）"）

div_3 = tree.xpath("//content") div_3_res = etree.tostring(div_3).strip('<content>').rstrip('</')

stripメソッドを使用したこの最後の行はニースではありませんが、動作するだけです

Percival Ulysses · Answer

stringify_childrenをこの方法で定義すると、それほど複雑ではありません。

from lxml import etree def stringify_children(node): s = node.text if s is None: s = '' for child in node: s += etree.tostring(child, encoding='unicode') return s

または1行で

return (node.text if node.text is not None else '') + ''.join((etree.tostring(child, encoding='unicode') for child in node))

理由はこの答えと同じです：子ノードのシリアル化をlxmlに任せます。この場合のtailのnode部分は、終了タグの「背後」にあるため、面白くありません。 encoding引数は、必要に応じて変更できることに注意してください。

別の可能な解決策は、ノード自体をシリアル化し、その後、開始タグと終了タグを取り除くことです。

def stringify_children(node): s = etree.tostring(node, encoding='unicode', with_tail=False) return s[s.index(node.tag) + 1 + len(node.tag): s.rindex(node.tag) - 2]

それはやや恐ろしいです。このコードは、nodeに属性が含まれていない場合にのみ適切であり、その場合でもだれも使用したいとは思わないでしょう。

Deepan Prabhu Babu · Answer

http://lxml.de/tutorial.html#using-xpath-to-find-text のドキュメントに従って実際に機能した最も単純なコードスニペットの1つは

etree.tostring(html, method="text")

ここで、etreeは完全なテキストのノード/タグであり、読み込もうとしています。ただし、スクリプトおよびスタイルタグは削除されません。

bwingenroth · Answer

上記の@Richardのコメントに応じて、stringify_childrenにパッチを適用して読み取りを行う場合：

 parts = ([node.text] + -- list(chain(*([c.text, tostring(c), c.tail] for c in node.getchildren()))) + ++ list(chain(*([tostring(c)] for c in node.getchildren()))) + [node.tail])

彼が言及する重複を避けるようです。

Joshmaker · Answer

私はこれが古い質問であることを知っていますが、これは一般的な問題であり、これまでに提案されたものより簡単に見える解決策があります：

def stringify_children(node): """Given a LXML tag, return contents as a string >>> html = "<p><strong>Sample sentence</strong> with tags.</p>" >>> node = lxml.html.fragment_fromstring(html) >>> extract_html_content(node) "<strong>Sample sentence</strong> with tags." """ if node is None or (len(node) == 0 and not getattr(node, 'text', None)): return "" node.attrib.clear() opening_tag = len(node.tag) + 2 closing_tag = -(len(node.tag) + 3) return lxml.html.tostring(node)[opening_tag:closing_tag]

この質問に対する他の回答の一部とは異なり、このソリューションはその中に含まれるすべてのタグを保持し、他の有効なソリューションとは異なる角度から問題を攻撃します。

Hrabal · Answer

lxmlにはそのためのメソッドがあります：

node.text_content()

sergzach · Answer

これが実用的なソリューションです。親タグを持つコンテンツを取得し、出力から親タグを切り取ることができます。

import re from lxml import etree def _tostr_with_tags(parent_element, html_entities=False): RE_CUT = r'^<([\w-]+)>(.*)</([\w-]+)>$' content_with_parent = etree.tostring(parent_element) def _replace_html_entities(s): RE_ENTITY = r'&#(\d+);' def repl(m): return unichr(int(m.group(1))) replaced = re.sub(RE_ENTITY, repl, s, flags=re.MULTILINE|re.UNICODE) return replaced if not html_entities: content_with_parent = _replace_html_entities(content_with_parent) content_with_parent = content_with_parent.strip() # remove 'white' characters on margins start_tag, content_without_parent, end_tag = re.findall(RE_CUT, content_with_parent, flags=re.UNICODE|re.MULTILINE|re.DOTALL)[0] if start_tag != end_tag: raise Exception('Start tag does not match to end tag while getting content with tags.') return content_without_parent

parent_elementにはElementタイプが必要です。

注意してくださいテキストコンテンツ（テキスト内のhtmlエンティティではない）が必要な場合は、html_entitiesパラメータをFalseとして。