BeautifulSoupを使用して特定のテキストを含むHTMLタグを見つける

Question

次のテキストパターンを含むHTMLドキュメントの要素を取得しようとしています：＃\ S {11}

<h2> this is cool #12345678901 </h2>

したがって、前のものは以下を使用して一致します。

soup('h2',text=re.compile(r' #\S{11}'))

結果は次のようになります。

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

一致するすべてのテキストを取得できます（上記の行を参照）。しかし、テキストの親要素が一致するようにしたいので、ドキュメントツリーを走査するための開始点としてそれを使用できます。この場合、テキストが一致するのではなく、すべてのh2要素が返されるようにします。

アイデア？

nosklo · Accepted Answer

from BeautifulSoup import BeautifulSoup import re html_text = """ <h2>this is cool #12345678901</h2> <h2>this is nothing</h2> <h1>foo #126666678901</h1> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2> """ soup = BeautifulSoup(html_text) for elem in soup(text=re.compile(r' #\S{11}')): print elem.parent

プリント：

<h2>this is cool #12345678901</h2> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2>

Bruno Bronosky · Answer

BeautifulSoupの検索操作は、BeautifulSoup.NavigableStringが他の場合のtext=とは対照的に基準として使用される場合、BeautifulSoup.Tagオブジェクトの[リスト]を配信します。オブジェクトの__dict__をチェックして、使用可能な属性を確認してください。これらの属性のうち、 BS4の変更のため、parentがpreviousよりも優先されます。

from BeautifulSoup import BeautifulSoup from pprint import pprint import re html_text = """ <h2>this is cool #12345678901</h2> <h2>this is nothing</h2> <h2>this is interesting #126666678901</h2> <h2>this is blah #124445678901</h2> """ soup = BeautifulSoup(html_text) # Even though the OP was not looking for 'cool', it's more understandable to work with item zero. pattern = re.compile(r'cool') pprint(soup.find(text=pattern).__dict__) #>> {'next': u'
', #>> 'nextSibling': None, #>> 'parent': <h2>this is cool #12345678901</h2>, #>> 'previous': <h2>this is cool #12345678901</h2>, #>> 'previousSibling': None} print soup.find('h2') #>> <h2>this is cool #12345678901</h2> print soup.find('h2', text=pattern) #>> this is cool #12345678901 print soup.find('h2', text=pattern).parent #>> <h2>this is cool #12345678901</h2> print soup.find('h2', text=pattern) == soup.find('h2') #>> False print soup.find('h2', text=pattern) == soup.find('h2').text #>> True print soup.find('h2', text=pattern).parent == soup.find('h2') #>> True

T.C. Proctor · Answer

Bs4（Beautiful Soup 4）を使用すると、OPの試行は予想どおりに機能します。

from bs4 import BeautifulSoup soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>") soup('h2',text=re.compile(r' #\S{11}'))

[<h2> this is cool #12345678901 </h2>]を返します。