HTMLエンティティをUnicodeに、またはその逆に変換します

Question

可能性のある複製：

PythonでXML/HTMLエンティティをUnicode文字列に変換する

テキストへのHTMLエンティティコード

PythonでHTMLエンティティをUnicodeに、またはその逆に変換するにはどうすればよいですか？

hekevintran · Accepted Answer

from BeautifulSoup import BeautifulStoneSoup import cgi def HTMLEntitiesToUnicode(text): """Converts HTML entities to unicode. For example '&amp;' becomes '&'.""" text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES)) return text def unicodeToHTMLEntities(text): """Converts unicode to HTML entities. For example '&' becomes '&amp;'.""" text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace') return text text = "&amp;, &reg;, &lt;, &gt;, &cent;, &pound;, &yen;, &euro;, &sect;, &copy;" uni = HTMLEntitiesToUnicode(text) htmlent = unicodeToHTMLEntities(uni) print uni print htmlent # &, ®, <, >, ¢, £, ¥, €, §, © # &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

Isaac · Answer

「その逆」について（私は自分自身を必要としていたので、この質問を見つけましたが、助けにはなりませんでした答えがある別のサイト）：

u'some string'.encode('ascii', 'xmlcharrefreplace')

非ASCII文字がXML（HTML）エンティティに変換されたプレーンな文字列を返します。

scharfmn · Answer

Python 2.7およびBeautifulSoup4の更新

エスケープ解除-htmlparser（Python 2.7標準ライブラリ）を使用してUnicodeに変換するUnicode HTML：

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood' >>> from HTMLParser import HTMLParser >>> htmlparser = HTMLParser() >>> unescaped = htmlparser.unescape(escaped) >>> unescaped u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood' >>> print unescaped Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape-Unicode HTMLをbs4（BeautifulSoup4）でUnicodeに変換：

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>''' >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(html) >>> soup.text u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood' >>> print soup.text Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

エスケープ-ユニコードからbs4（BeautifulSoup4）でHTMLをUnicodeに変換：

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood' >>> from bs4.dammit import EntitySubstitution >>> escaper = EntitySubstitution() >>> escaped = escaper.substitute_html(unescaped) >>> escaped u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

AXO · Answer

hekevintranの答えが示唆するように、スティングのエンコードにはcgi.escape(s)を使用できますが、その関数ではデフォルトで引用符のエンコードがfalseであることに注意してくださいまた、文字列と一緒に_quote=True_キーワード引数を渡すことをお勧めします。ただし、_quote=True_を渡しても、関数は単一引用符をエスケープしません（_"'"_）（これらの問題のため、関数は非推奨バージョン3.2以降）

html.escape(s)の代わりにcgi.escape(s)を使用することが推奨されています。（バージョン3.2の新機能）

html.unescape(s)もバージョン3.4で導入になりました。

だからpython 3.4では次のことができます：

html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()を使用して、特殊文字をHTMLエンティティに変換します。
HTMLエンティティをプレーンテキスト表現に変換するためのhtml.unescape(text).

brucekaushik · Answer

私のような人が (for trademark symbol),  (for euro symbol)のようないくつかのエンティティ番号（コード）が適切にエンコードされない理由を知りたい場合、その理由はISO-8859-1（Windows-1252）にあります。

また、html5のデフォルトの文字セットはutf-8であり、html4のISO-8859-1でした。

そのため、何らかの方法で回避する必要があります（最初にそれらを見つけて置き換えます）

Mozillaのドキュメントからの参照（開始点）

https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings

Stephen Ellwood · Answer

次の関数を使用して、xlsファイルからリッピングされたUnicodeをhtmlファイルに変換しながら、xlsファイルで見つかった特殊文字を保存しました。

def html_wr(f, dat): ''' write dat to file f as html . file is assumed to be opened in binary format . if dat is nul it is replaced with non breakable space . non-ascii characters are translated to xml ''' if not dat: dat = '&nbsp;' try: f.write(dat.encode('ascii')) except: f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

これが誰かに役立つことを願って