Beautiful Soupで特定のテキストを含むタグを見つけるには？

Question

次のhtmlがあります（\ nでマークされた改行）：

... <tr> <td class="pos">
 "Some text:"
 <br>
 <strong>some value</strong>
 </td> </tr> <tr> <td class="pos">
 "Fixed text:"
 <br>
 <strong>text I am looking for</strong>
 </td> </tr> <tr> <td class="pos">
 "Some other text:"
 <br>
 <strong>some other value</strong>
 </td> </tr> ...

見つける方法探しているテキスト？以下のコードは最初に見つかった値を返すので、固定テキストでフィルタリングする必要があります。

result = soup.find('td', {'class' :'pos'}).find('strong').text

更新。次のコードを使用する場合：

title = soup.find('td', text = re.compile(ur'Fixed text:(.*)', re.DOTALL), attrs = {'class': 'pos'}) self.response.out.write(str(title.string).decode('utf8'))

次に、単に固定テキスト：を返します。

user130076 · Accepted Answer

次のように、findAllのテキストパラメータに正規表現を渡すことができます。

import BeautifulSoup import re columns = soup.findAll('td', text = re.compile('your regex here'), attrs = {'class' : 'pos'})

Bruno Bronosky · Answer

この投稿には答えがありませんが、この投稿は私の答えにつながりました。返すべきだと感じました。

ここでの課題は、テキストありとなしで検索する場合のBeautifulSoup.findの一貫性のない動作にあります。

注：BeautifulSoupがある場合は、次の方法でローカルでテストできます。

curl https://Gist.githubusercontent.com/RichardBronosky/4060082/raw/test.py | python

コード： https://Gist.github.com/4060082

# Taken from https://Gist.github.com/4060082 from BeautifulSoup import BeautifulSoup from urllib2 import urlopen from pprint import pprint import re soup = BeautifulSoup(urlopen('https://Gist.githubusercontent.com/RichardBronosky/4060082/raw/test.html').read()) # I'm going to assume that Peter knew that re.compile is meant to cache a computation result for a performance benefit. However, I'm going to do that explicitly here to be very clear. pattern = re.compile('Fixed text') # Peter's suggestion here returns a list of what appear to be strings columns = soup.findAll('td', text=pattern, attrs={'class' : 'pos'}) # ...but it is actually a BeautifulSoup.NavigableString print type(columns[0]) #>> <class 'BeautifulSoup.NavigableString'> # you can reach the tag using one of the convenience attributes seen here pprint(columns[0].__dict__) #>> {'next': <br />, #>> 'nextSibling': <br />, #>> 'parent': <td class="pos">
 #>> "Fixed text:"
 #>> <br />
 #>> <strong>text I am looking for</strong>
 #>> </td>, #>> 'previous': <td class="pos">
 #>> "Fixed text:"
 #>> <br />
 #>> <strong>text I am looking for</strong>
 #>> </td>, #>> 'previousSibling': None} # I feel that 'parent' is safer to use than 'previous' based on http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names # So, if you want to find the 'text' in the 'strong' element... pprint([t.parent.find('strong').text for t in soup.findAll('td', text=pattern, attrs={'class' : 'pos'})]) #>> [u'text I am looking for'] # Here is what we have learned: print soup.find('strong') #>> <strong>some value</strong> print soup.find('strong', text='some value') #>> u'some value' print soup.find('strong', text='some value').parent #>> <strong>some value</strong> print soup.find('strong', text='some value') == soup.find('strong') #>> False print soup.find('strong', text='some value') == soup.find('strong').text #>> True print soup.find('strong', text='some value').parent == soup.find('strong') #>> True

OPを支援するのは間違いなく遅すぎますが、テキストによる検索に関するすべての不満を満たしてくれるので、彼らがこれを答えとしてくれることを願っています。

QHarr · Answer

Bs4 4.7.1+では、：contains擬似クラスを使用して、検索文字列を含むtdを指定できます

from bs4 import BeautifulSoup html = ''' <tr> <td class="pos">
 "Some text:"
 <br>
 <strong>some value</strong>
 </td> </tr> <tr> <td class="pos">
 "Fixed text:"
 <br>
 <strong>text I am looking for</strong>
 </td> </tr> <tr> <td class="pos">
 "Some other text:"
 <br>
 <strong>some other value</strong>
 </td> </tr>''' soup = bs(html, 'lxml') print(soup.select_one('td:contains("Fixed text:")'))

Prasad Giri · Answer

特定のキーワードがある場合にアンカータグを見つけるためのソリューションは次のとおりです。

from bs4 import BeautifulSoup from urllib.request import urlopen,Request from urllib.parse import urljoin,urlparse rawLinks=soup.findAll('a',href=True) for link in rawLinks: innercontent=link.text if keyword.lower() in innercontent.lower(): print(link)

alek vertysh · Answer

result = soup.find('strong', text='text I am looking for').text