beautifulsoupを使用して改行間のテキストを抽出する（例：<br />タグ）

Question

より大きなドキュメント内に次のHTMLがあります

<br /> Important Text 1 <br /> <br /> Not Important Text <br /> Important Text 2 <br /> Important Text 3 <br /> <br /> Non Important Text <br /> Important Text 4 <br />

現在、BeautifulSoupを使用してHTML内の他の要素を取得していますが、<br />タグ間の重要なテキスト行を取得する方法を見つけることができませんでした。 <br />要素のそれぞれを分離してナビゲートすることはできますが、その間にテキストを取得する方法が見つかりません。どんな助けでも大歓迎です。ありがとう。

Mark Longair · Accepted Answer

2つの間にあるテキストが必要な場合は<br />タグ、次のようなことができます。

from BeautifulSoup import BeautifulSoup, NavigableString, Tag input = '''<br /> Important Text 1 <br /> <br /> Not Important Text <br /> Important Text 2 <br /> Important Text 3 <br /> <br /> Non Important Text <br /> Important Text 4 <br />''' soup = BeautifulSoup(input) for br in soup.findAll('br'): next_s = br.nextSibling if not (next_s and isinstance(next_s,NavigableString)): continue next2_s = next_s.nextSibling if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br': text = str(next_s).strip() if text: print "Found:", next_s

しかし、おそらく私はあなたの質問を誤解していますか？問題の説明は、サンプルデータの「重要」/「重要ではない」と一致していないようです。そのため、説明を使用しました;）

Ken Kinder · Answer

したがって、テストの目的で、このHTMLのチャンクがspanタグ内にあると仮定しましょう。

_x = """<span><br /> Important Text 1 <br /> <br /> Not Important Text <br /> Important Text 2 <br /> Important Text 3 <br /> <br /> Non Important Text <br /> Important Text 4 <br /></span>""" _

次に、それを解析して、スパンタグを見つけます。

_from BeautifulSoup import BeautifulSoup y = soup.find('span') _

y.childGenerator()でジェネレーターを反復処理すると、brとテキストの両方が取得されます。

_In [4]: for a in y.childGenerator(): print type(a), str(a) ....: <type 'instance'> <br /> <class 'BeautifulSoup.NavigableString'> Important Text 1 <type 'instance'> <br /> <class 'BeautifulSoup.NavigableString'> <type 'instance'> <br /> <class 'BeautifulSoup.NavigableString'> Not Important Text <type 'instance'> <br /> <class 'BeautifulSoup.NavigableString'> Important Text 2 <type 'instance'> <br /> <class 'BeautifulSoup.NavigableString'> Important Text 3 <type 'instance'> <br /> <class 'BeautifulSoup.NavigableString'> <type 'instance'> <br /> <class 'BeautifulSoup.NavigableString'> Non Important Text <type 'instance'> <br /> <class 'BeautifulSoup.NavigableString'> Important Text 4 <type 'instance'> <br /> _

Pontios · Answer

以下は私のために働いた：

for br in soup.findAll('br'): if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>': print br.contents[0]