BeautifulSoup-タグ内のテキストによる検索

Question

次の問題を観察します。

_import re from bs4 import BeautifulSoup as BS soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> Edit </a> """) # This returns the <a> element soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*") ) soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) # This returns None soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*") ) _

何らかの理由で、_<i>_タグも存在する場合、BeautifulSoupはテキストと一致しません。タグを見つけてそのテキストを表示すると、

_>>> a2 = soup.find( 'a', href="/customer-menu/1/accounts/1/update" ) >>> print(repr(a2.text)) '
 Edit
' _

右。 Docs によると、スープは検索機能ではなく、正規表現の一致機能を使用します。したがって、DOTALLフラグを指定する必要があります。

_pattern = re.compile('.*Edit.*') pattern.match('
 Edit
') # Returns None pattern = re.compile('.*Edit.*', flags=re.DOTALL) pattern.match('
 Edit
') # Returns MatchObject _

わかった。いいね。スープで試してみましょう

_soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) soup.find( 'a', href="/customer-menu/1/accounts/1/update", text=re.compile(".*Edit.*", flags=re.DOTALL) ) # Still return None... Why?! _

編集

ヤモリの答えに基づく私のソリューション：私はこれらのヘルパーを実装しました：

_import re MATCH_ALL = r'.*' def like(string): """ Return a compiled regular expression that matches the given string with any prefix and postfix, e.g. if string = "hello", the returned regex matches r".*hello.*" """ string_ = string if not isinstance(string_, str): string_ = str(string_) regex = MATCH_ALL + re.escape(string_) + MATCH_ALL return re.compile(regex, flags=re.DOTALL) def find_by_text(soup, text, tag, **kwargs): """ Find the tag in soup that matches all provided kwargs, and contains the text. If no match is found, return None. If more than one match is found, raise ValueError. """ elements = soup.find_all(tag, **kwargs) matches = [] for element in elements: if element.find(text=like(text)): matches.append(element) if len(matches) > 1: raise ValueError("Too many matches:
" + "
".join(matches)) Elif len(matches) == 0: return None else: return matches[0] _

さて、上の要素を見つけたいときは、単にfind_by_text(soup, 'Edit', 'a', href='/customer-menu/1/accounts/1/update')を実行します

geckon · Accepted Answer

問題は、_<a>_タグが内部にある_<i>_タグに、期待するstring属性がないことです。まず、find()の_text=""_引数が何をするかを見てみましょう。

注：text引数は古い名前です。BeautifulSoup4.4.0以降はstringと呼ばれています。

docs から：

Stringは文字列を検索するためのものですが、タグを検索する引数と組み合わせることができます。BeautifulSoupは.stringがstringの値に一致するすべてのタグを検索します。このコードは、.stringが「Elsie」であるタグを見つけます。
_soup.find_all("a", string="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>] _

Tagのstring属性が何であるかを見てみましょう（ docs から）：

タグに子が1つしかなく、その子がNavigableStringである場合、子は.stringとして使用可能になります。
_title_tag.string # u'The Dormouse's story' _

（...）

タグに複数のものが含まれている場合、.stringが何を参照すべきかが明確ではないため、.stringはNoneに定義されます。
_print(soup.html.string) # None _

これはまさにあなたの場合です。 _<a>_タグには、テキストand_<i>_タグが含まれています。したがって、文字列を検索しようとすると、findはNoneを取得するため、一致しません。

これを解決する方法は？

より良い解決策があるかもしれませんが、私はおそらくこのようなもので行くでしょう：

_import re from bs4 import BeautifulSoup as BS soup = BS(""" <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> """) links = soup.find_all('a', href="/customer-menu/1/accounts/1/update") for link in links: if link.find(text=re.compile("Edit")): thelink = link break print(thelink) _

_/customer-menu/1/accounts/1/update_を指しているリンクが多すぎないと思うので、十分に高速でなければなりません。

styvane · Answer

Truetextに「編集」が含まれる場合、aを返す function を渡すことができます。 _.find_へ

_In [51]: def Edit_in_text(tag): ....: return tag.name == 'a' and 'Edit' in tag.text ....: In [52]: soup.find(Edit_in_text, href="/customer-menu/1/accounts/1/update") Out[52]: <a href="/customer-menu/1/accounts/1/update"> <i class="fa fa-edit"></i> Edit </a> _

編集：

あなたが与える関数のtextの代わりに .get_text() メソッドを使用することができます同じ結果：

_def Edit_in_text(tag): return tag.name == 'a' and 'Edit' in tag.get_text() _

Amr · Answer

ラムダを使用して1行で

soup.find(lambda tag:tag.name=="a" and "Edit" in tag.text)