BeautifulSoup：囲んでいるタグの数に関係なく、タグの中に入れる

Question

BeautifulSoupを使用して、Webページの_<p>_要素からすべての内部htmlをスクレイピングしようとしています。内部タグはありますが、気にしません。内部テキストを取得したいだけです。

たとえば、次の場合：

_<p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> _

どうすれば抽出できますか：

_Red Blue Yellow Light green _

_.string_と_.contents[0]_のどちらも必要なことをしません。 .extract()も同様です。これは、事前に内部タグを指定する必要がないためです。発生する可能性のあるものに対処したいのです。

BeautifulSoupに「目に見えるHTMLを取得する」タイプのメソッドはありますか？

- - 更新 - - -

アドバイスについては、試してください：

_soup = BeautifulSoup(open("test.html")) p_tags = soup.findAll('p',text=True) for i, p_tag in enumerate(p_tags): print str(i) + p_tag _

しかし、それは助けにはなりません-それは出力します：

_0Red 1 2Blue 3 4Yellow 5 6Light 7green 8 _

taleinat · Accepted Answer

短い答え：soup.findAll(text=True)

これはすでに回答済みですここではStackOverflow上および BeautifulSoupドキュメント。

UPDATE：

明確にするために、作業中のコード：

>>> txt = """\ <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> """ >>> import BeautifulSoup >>> BeautifulSoup.__version__ '3.0.7a' >>> soup = BeautifulSoup.BeautifulSoup(txt) >>> for node in soup.findAll('p'): print ''.join(node.findAll(text=True)) Red Blue Yellow Light green

Jaymon · Answer

受け入れられた答えは素晴らしいですが、今では6歳ですので、この答えの現在の Beautiful Soup 4バージョンです：

>>> txt = """\ <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>green</b></p> """ >>> from bs4 import BeautifulSoup, __version__ >>> __version__ '4.5.1' >>> soup = BeautifulSoup(txt, "html.parser") >>> print("".join(soup.strings)) Red Blue Yellow Light green

Codemaker · Answer

通常、Webサイトから破棄されたデータにはタグが含まれます。そのタグを回避し、テキストコンテンツのみを表示するには、テキスト属性を使用できます。

例えば、

 from BeautifulSoup import BeautifulSoup import urllib2 url = urllib2.urlopen("https://www.python.org") content = url.read() soup = BeautifulSoup(content) title = soup.findAll("title") paragraphs = soup.findAll("p") print paragraphs[1] //Second paragraph with tags print paragraphs[1].text //Second paragraph without tags

この例では、pythonサイトからすべての段落を収集し、タグ付きおよびタグなしで表示します。

erddev · Answer

私はこれとまったく同じ問題に出くわし、このソリューションの2019バージョンを共有したいと考えました。たぶんそれは誰かを助けます。

# importing the modules from bs4 import BeautifulSoup from urllib.request import urlopen # setting up your BeautifulSoup Object webpage = urlopen("https://insertyourwebpage.com") soup = BeautifulSoup( webpage.read(), features="lxml") p_tags = soup.find_all('p') for each in p_tags: print (str(each.get_text()))

最初に配列の内容を1つずつ印刷し、THENがget_text（）メソッドを呼び出してテキストからタグを削除するため、テキストのみが印刷されることに注意してください。

また：

古いfindAll（）よりもbs4で更新された 'find_all（）'を使用することをお勧めします
urllib2はurllib.requestおよびurllib.errorに置き換えられました。 here を参照してください

これで、出力は次のようになります。

赤
青い
黄
光

これが、更新されたソリューションを探している人に役立つことを願っています。

toyotasupra · Answer

最初に、strを使用してhtmlを文字列に変換します。次に、プログラムで次のコードを使用します。

import re x = str(soup.find_all('p')) content = str(re.sub("<.*?>", "", x))

これはregexと呼ばれます。これにより、2つのhtmlタグ（タグを含む）の間にあるものはすべて削除されます。