Python 3.1の文字列内のHTMLエンティティをエスケープ解除するにはどうすればよいですか？

Question

私はすべてを見回して、python 2.6以前のソリューションを見つけましたが、これを行う方法についてはpython 3.X. Win7ボックスに。）

3.1でこれを行うことができ、できれば外部ライブラリなしで行うことができます。現在、httplib2がインストールされており、コマンドプロンプトcurlにアクセスできます（ページのソースコードを取得する方法です）。残念ながら、curlはhtmlエンティティをデコードしません。私の知る限り、ドキュメントでそれをデコードするコマンドを見つけることができませんでした。

はい、Beautiful Soupを動作させようとしましたが、3.Xでは何回も成功しませんでした。 MS Windows環境でpython 3で動作させる方法についてのEXPLICITの指示を提供できれば、とても感謝しています。

そのため、明確にするために、文字列を次のようにする必要があります：Suzy & Johnこのような文字列に：「Suzy＆John」。

unutbu · Accepted Answer

関数 html.unescape を使用できます。

Python3.4 +（更新についてJ.F. Sebastianに感謝）：

import html html.unescape('Suzy &amp; John') # 'Suzy & John' html.unescape('&quot;') # '"'

Python3.3以前：

import html.parser html.parser.HTMLParser().unescape('Suzy &amp; John')

Python2：

import HTMLParser HTMLParser.HTMLParser().unescape('Suzy &amp; John')

Greg Hewgill · Answer

xml.sax.saxutils.unescape この目的のため。このモジュールはPython標準ライブラリに含まれており、Python 2.xとPython 3.x 。

>>> import xml.sax.saxutils as saxutils >>> saxutils.unescape("Suzy &amp; John") 'Suzy & John'

Derrick Petzold · Answer

どうやら私はこれを投稿する以外に何もするほど高い評判を持っていません。 unutbuの答えは引用をエスケープしません。私が見つけた唯一のことは、この機能でした：

import re from htmlentitydefs import name2codepoint as n2cp def decodeHtmlentities(string): def substitute_entity(match): ent = match.group(2) if match.group(1) == "#": return unichr(int(ent)) else: cp = n2cp.get(ent) if cp: return unichr(cp) else: return match.group() entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});") return entity_re.subn(substitute_entity, string)[0]

これから得たものページ。

YOU · Answer

Python 3.xには html.entities もあります

Simanas · Answer

私の場合、as3エスケープ関数でエスケープされたHTML文字列があります。 1時間のグーグル検索で有用なものが見つからなかったため、このニーズに応えるためにこの再帰関数を作成しました。ここにあります、

def unescape(string): index = string.find("%") if index == -1: return string else: #if it is escaped unicode character do different decoding if string[index+1:index+2] == 'u': replace_with = ("\"+string[index+1:index+6]).decode('unicode_escape') string = string.replace(string[index:index+6],replace_with) else: replace_with = string[index+1:index+3].decode('hex') string = string.replace(string[index:index+3],replace_with) return unescape(string)

Edit-1 Unicode文字を処理する機能が追加されました。

TheJacobTaylor · Answer

これが組み込みライブラリかどうかはわかりませんが、3.1が必要でサポートされているように見えます。

From： http://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape

xml.sax.saxutils.unescape（data、entities = {}）データの文字列内の「＆」、「<」、および「>」のエスケープを解除します。