Beautifulsoup 4：コメントタグとそのコンテンツを削除する

Question

したがって、私が廃棄しているページには、これらのhtmlコードが含まれています。コメントタグを削除するにはどうすればよいですかとその内容bs4？

<div class="foo"> cat dog sheep goat <!-- <p>NewPP limit report Preprocessor node count: 478/300000 Post‐expand include size: 4852/2097152 bytes Template argument size: 870/2097152 bytes Expensive parser function count: 2/100 ExtLoops count: 6/100 </p> --> </div>

alecxe · Accepted Answer

extract() （解決策はこの回答に基づいています）を使用できます：

PageElement.extract（）は、ツリーからタグまたは文字列を削除します。抽出されたタグまたは文字列を返します。

from bs4 import BeautifulSoup, Comment data = """<div class="foo"> cat dog sheep goat <!-- <p>test</p> --> </div>""" soup = BeautifulSoup(data) div = soup.find('div', class_='foo') for element in div(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify()

その結果、コメントなしでdivを取得します。

<div class="foo"> cat dog sheep goat </div>

roippi · Answer

通常、bs4解析ツリーを変更する必要はありません。必要に応じて、divのテキストを取得できます。

soup.body.div.text Out[18]: '
cat dog sheep goat

'

bs4コメントを区切ります。ただし、本当に解析ツリーを変更する必要がある場合：

from bs4 import Comment for child in soup.body.div.children: if isinstance(child,Comment): child.extract()

Vanjith · Answer

この回答から BeautifulSoupバージョン3のソリューションを探している場合 BS3ドキュメント-コメント

soup = BeautifulSoup("""Hello! <!--I've got to be Nice to get what I want.-->""") comment = soup.find(text=re.compile("if")) Comment=comment.__class__ for element in soup(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify()