HTML、頭と体のタグを自動的に付けないでください、beautifulsoup

Question

html5libでbeautifulsoupを使用すると、html、head、bodyのタグが自動的に配置されます。

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

この動作をオフに設定できるオプションはありますか？

unutbu · Accepted Answer

In [35]: import bs4 as bs In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser") Out[36]: <h1>FOO</h1>

これ Pythonの組み込みHTMLパーサーでHTMLを解析します。ドキュメントの引用：

Html5libとは異なり、このパーサーは<body>タグを追加して整形式のHTMLドキュメントを作成しようとはしません。 lxmlとは異なり、<html>タグを追加する必要はありません。

または、html5libパーサーを使用して、<body>の後の要素を選択することもできます。

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib') In [62]: soup.body.next Out[62]: <h1>FOO</h1>

Martijn Pieters · Answer

唯一のオプションは、データの解析にhtml5libを使用しないことです。

これはhtml5libライブラリの機能であり、不足している必須要素を追加するなど、不足しているHTMLを修正します。

userlond · Answer

さらに別の解決策：

from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p><p>Hi!</p>', 'lxml') # content handling example (just for example) # replace Google with StackOverflow for a in soup.findAll('a'): a['href'] = 'http://stackoverflow.com/' a.string = 'StackOverflow' print ''.join([unicode(i) for i in soup.html.body.findChildren(recursive=False)])

theeastcoastwest · Answer

BeautifulSoupのこの側面は、常に私を悩ませてきました。

これが私がそれに対処する方法です：

# Parse the initial html-formatted string soup = BeautifulSoup(html, 'lxml') # Do stuff here # Extract a string repr of the parse html object, without the <html> or <body> tags html = "".join([str(x) for x in soup.body.children])

簡単な内訳：

# Iterator object of all tags within the <body> tag (your html before parsing) soup.body.children # Turn each element into a string object, rather than a BS4.Tag object # Note: inclusive of html tags str(x) # Get a List of all html nodes as string objects [str(x) for x in soup.body.children] # Join all the string objects together to recreate your original html "".join()

私はまだこれが好きではありませんが、それは仕事を成し遂げます。 BS4を使用してHTMLドキュメントから特定の要素や属性をフィルタリングしてから、BS4で解析されたオブジェクトではなく、オブジェクト全体を文字列reprとして戻す必要がある場合は、常にこれに遭遇します。

うまくいけば、次に私がこれをグーグルで検索するとき、私はここで私の答えを見つけるでしょう。

Jaylin · Answer

見栄えを良くしたい場合は、次のことを試してください。

BeautifulSoup（[分析したいコンテンツ] 。prettify（））