BeautifulSoupを使用してタグを削除しますが、そのコンテンツは保持します

Question

現在、私はこのようなことをするコードを持っています：

soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.extract() soup.renderContents()

無効なタグ内のコンテンツを破棄したくない場合を除きます。 soup.renderContents（）を呼び出すときにタグを取り除き、内容を内部に保持するにはどうすればよいですか？

Jesse Dhillon · Accepted Answer

私が使用した戦略は、タイプがNavigableStringである場合はタグをそのコンテンツで置き換え、そうでない場合は再帰して、コンテンツをNavigableStringなどで置き換えます。：

from BeautifulSoup import BeautifulSoup, NavigableString def strip_tags(html, invalid_tags): soup = BeautifulSoup(html) for tag in soup.findAll(True): if tag.name in invalid_tags: s = "" for c in tag.contents: if not isinstance(c, NavigableString): c = strip_tags(unicode(c), invalid_tags) s += unicode(c) tag.replaceWith(s) return soup html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>" invalid_tags = ['b', 'i', 'u'] print strip_tags(html, invalid_tags)

結果は次のとおりです。

<p>Good, bad, and ugly</p>

別の質問でこれと同じ答えをしました。たくさん登場するようです。

slacy · Answer

BeautifulSoupライブラリの現在のバージョンには、replaceWithChildren（）と呼ばれるTagオブジェクトに文書化されていないメソッドがあります。したがって、次のようなことができます。

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>" invalid_tags = ['b', 'i', 'u'] soup = BeautifulSoup(html) for tag in invalid_tags: for match in soup.findAll(tag): match.replaceWithChildren() print soup

あなたが望むように動作し、かなり簡単なコードであるように見えます（ただし、DOMをいくつかパスしますが、これは簡単に最適化できます）

corford · Answer

これはすでにコメントで他の人から言及されていますが、MozillaのBleachでそれを行う方法を示す完全な回答を投稿すると思いました。個人的には、これはBeautifulSoupを使用するよりもずっといいと思います。

import bleach html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>" clean = bleach.clean(html, tags=[], strip=True) print clean # Should print: "Bad Ugly Evil()"

Etienne · Answer

私はもっと簡単な解決策を持っていますが、欠点があるかどうかわかりません。

UPDATE：欠点があります。JesseDhillonのコメントを参照してください。また、別の解決策は、Mozillaの Bleach をBeautifulSoupの代わりに使用することです。

from BeautifulSoup import BeautifulSoup VALID_TAGS = ['div', 'p'] value = '<div><p>Hello <b>there</b> my friend!</p></div>' soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: tag.replaceWith(tag.renderContents()) print soup.renderContents()

これにより、<div><p>Hello there my friend!</p></div> 望んだ通りに。

jimmy · Answer

soup.textを使用できます

.textはすべてのタグを削除し、すべてのテキストを連結します。

Alex Martelli · Answer

タグを削除する前に、おそらくタグの子をタグの親の子に移動する必要があります。

もしそうなら、適切な場所にコンテンツを挿入するのは難しいですが、このようなことはうまくいくはずです：

from BeautifulSoup import BeautifulSoup VALID_TAGS = 'div', 'p' value = '<div><p>Hello <b>there</b> my friend!</p></div>' soup = BeautifulSoup(value) for tag in soup.findAll(True): if tag.name not in VALID_TAGS: for i, x in enumerate(tag.parent.contents): if x == tag: break else: print "Can't find", tag, "in", tag.parent continue for r in reversed(tag.contents): tag.parent.insert(i, r) tag.extract() print soup.renderContents()

値の例では、<div><p>Hello there my friend!</p></div> 望んだ通りに。

Bishwas Mishra · Answer

ラップ解除を使用します。

ラップ解除は、タグの複数のオカレンスの1つを削除し、コンテンツを保持します。

例：

>> soup = BeautifulSoup('Hi. This is a <nobr> nobr </nobr>') >> soup <html><body><p>Hi. This is a <nobr> nobr </nobr></p></body></html> >> soup.nobr.unwrap <nobr></nobr> >> soup >> <html><body><p>Hi. This is a nobr </p></body></html>

Olof Sj&#246;bergh · Answer

提案された回答のどれも、私にとってBeautifulSoupで機能するようには見えませんでした。 BeautifulSoup 3.2.1で動作するバージョンを次に示します。また、単語を連結する代わりに、異なるタグのコンテンツを結合するときにスペースを挿入します。

def strip_tags(html, whitelist=[]): """ Strip all HTML tags except for a list of whitelisted tags. """ soup = BeautifulSoup(html) for tag in soup.findAll(True): if tag.name not in whitelist: tag.append(' ') tag.replaceWithChildren() result = unicode(soup) # Clean up any repeated spaces and spaces like this: '<a>test </a> ' result = re.sub(' +', ' ', result) result = re.sub(r' (<[^>]*> )', r'\1', result) return result.strip()

例：

strip_tags('<h2><a><span>test</span></a> testing</h2><p>again</p>', ['a']) # result: u'<a>test</a> testing again'

robus gauli · Answer

これは、コンテンツを保持するタグを除外するための面倒なコードや定型コードのないより良いソリューションです。

for p_tags in div_tags.find_all("p"): print(p_tags.get_text())

それだけです。親タグ内のすべてのbrまたはi bタグで自由になり、きれいなテキストを取得できます。

Dom DaFonte · Answer

この関数のpython 3フレンドリーバージョン：

from bs4 import BeautifulSoup, NavigableString invalidTags = ['br','b','font'] def stripTags(html, invalid_tags): soup = BeautifulSoup(html, "lxml") for tag in soup.findAll(True): if tag.name in invalid_tags: s = "" for c in tag.contents: if not isinstance(c, NavigableString): c = stripTags(str(c), invalid_tags) s += str(c) tag.replaceWith(s) return soup

Tommz · Answer

これは古い質問ですが、それを行うためのより良い方法を言うだけです。まず、BeautifulSoup 3 *は開発されていないため、むしろ bs4 と呼ばれるBeautifulSoup 4 *を使用する必要があります。

また、lxmlには必要な機能だけがあります。 Cleaner class has attribute remove_tags。親タグにコンテンツがプルアップされるときに削除されるタグに設定できます。