BeautifulSoup innerhtml？

Question

divのあるページがあるとします。そのdivはsoup.find()で簡単に取得できます。

結果が得られたので、そのinnerhtmlの全体divを出力したいと思います。つまり、すべてのHTMLタグとテキストをすべて正確にまとめた文字列が必要です。 JavaScriptでobj.innerHTMLを使用して取得する文字列のように。これは可能ですか？

ChrisD · Answer

TL; DR

BeautifulSoup 4では、UTF-8でエンコードされたバイト文字列が必要な場合はelement.encode_contents()を使用し、Python Unicode文字列が必要な場合はelement.decode_contents()を使用します。たとえば、- DOMのinnerHTMLメソッドは次のようになります。

def innerHTML(element): """Returns the inner HTML of an element as a UTF-8 encoded bytestring""" return element.encode_contents()

これらの関数は現在オンラインドキュメントにはないので、コードから現在の関数定義とdoc文字列を引用します。

`encode_contents`-4.0.4以降

def encode_contents( self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): """Renders the contents of this tag as a bytestring. :param indent_level: Each line of the rendering will be indented this many spaces. :param encoding: The bytestring will be in this encoding. :param formatter: The output formatter responsible for converting entities to Unicode characters. """

フォーマッタのドキュメントも参照してください。何らかの方法でテキストを手動で処理したくない場合は、formatter="minimal"（デフォルト）またはformatter="html"（ htmlエンティティの場合）を使用する可能性が最も高くなります。

encode_contentsは、エンコードされたバイト文字列を返します。 Python Unicode文字列が必要な場合は、代わりにdecode_contentsを使用してください。

`decode_contents`-4.0.1以降

decode_contentsはencode_contentsと同じことを行いますが、エンコードされたバイト文字列の代わりにPython Unicode文字列を返します。

def decode_contents(self, indent_level=None, eventual_encoding=DEFAULT_OUTPUT_ENCODING, formatter="minimal"): """Renders the contents of this tag as a Unicode string. :param indent_level: Each line of the rendering will be indented this many spaces. :param eventual_encoding: The tag is destined to be encoded into this encoding. This method is _not_ responsible for performing that encoding. This information is passed in so that it can be substituted in if the document contains a <META> tag that mentions the document's encoding. :param formatter: The output formatter responsible for converting entities to Unicode characters. """

BeautifulSoup 3

BeautifulSoup 3には上記の機能はありませんが、代わりにrenderContentsがあります。

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, prettyPrint=False, indentLevel=0): """Renders the contents of this tag as a string in the given encoding. If encoding is None, returns a Unicode string.."""

この関数は、BS3との互換性のためにBeautifulSoup 4（ 4.0.4 ）に追加されました。

peewhy · Answer

オプションの1つは、次のようなものを使用できます。

 innerhtml = "".join([str(x) for x in div_element.contents])

Pikamander2 · Answer

テキストのみが必要な場合（HTMLタグは必要ありません）、.textを使用できます。

soup.select("div").text

Michael Litvin · Answer

unicode(x)だけではどうですか？私のために働くようです。

編集：これは、内部ではなく外部HTMLを提供します。

Yahyaa · Answer

まあ、あなただけのために.get_text()を使うこともできます[〜＃〜] text [〜＃〜]

soup.select("div").get_text()

BeautifulSoup innerhtml？

TL; DR

encode_contents-4.0.4以降

decode_contents-4.0.1以降

BeautifulSoup 3

`encode_contents`-4.0.4以降

`decode_contents`-4.0.1以降