子ではなく、この要素からのみテキストを抽出します

Question

スープの一番上の要素からテキストのみを抽出します。ただし、soup.textはすべての子要素のテキストも提供します。

私が持っています

import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>yes<b>no</b></html>') print soup.text

これへの出力はyesnoです。単純に「はい」が必要です。

これを達成する最良の方法は何ですか？

編集： '<html><b>no</b>yes</html>'の解析時にyesも出力したい。

jbochi · Accepted Answer

.find(text=True)はどうですか？

>>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').find(text=True) u'yes' >>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').find(text=True) u'no'

編集：

私はあなたが今欲しいものを理解したと思います。これを試して：

>>> BeautifulSoup.BeautifulSOAP('<html><b>no</b>yes</html>').html.find(text=True, recursive=False) u'yes' >>> BeautifulSoup.BeautifulSOAP('<html>yes<b>no</b></html>').html.find(text=True, recursive=False) u'yes'

TigrisC · Answer

contents を使用できます

>>> print soup.html.contents[0] yes

または、htmlの下のすべてのテキストを取得するには、findAll（text = True、recursive = False）を使用します

>>> soup = BeautifulSoup.BeautifulSOAP('<html>x<b>no</b>yes</html>') >>> soup.html.findAll(text=True, recursive=False) [u'x', u'yes']

結合して単一の文字列を形成する

>>> ''.join(soup.html.findAll(text=True, recursive=False)) u'xyes'

mzjn · Answer

XPathをサポートするlxmlの soupparser モジュールを調べてください。

>>> from lxml.html.soupparser import fromstring >>> s1 = '<html>yes<b>no</b></html>' >>> s2 = '<html><b>no</b>yes</html>' >>> soup1 = fromstring(s1) >>> soup2 = fromstring(s2) >>> soup1.xpath("text()") ['yes'] >>> soup2.xpath("text()") ['yes']