Python）で特殊なHTML文字をエスケープする

Question

'または"または&（...）などの特殊文字を表示できる文字列があります。文字列内：

string = """ Hello "XYZ" this 'is' a test & so on """

すべての特殊文字を自動的にエスケープして、次のようにするにはどうすればよいですか。

string = " Hello &quot;XYZ&quot; this &#39;is&#39; a test &amp; so on "

kennytm · Accepted Answer

Python 3.2では、 _html.escape_関数を使用できます。例：.

_>>> string = """ Hello "XYZ" this 'is' a test & so on """ >>> import html >>> html.escape(string) ' Hello &quot;XYZ&quot; this &#x27;is&#x27; a test &amp; so on ' _

以前のバージョンのPythonについては、 http://wiki.python.org/moin/EscapingHtml を確認してください。

cgi module 付属Pythonには escape() function ：
_import cgi s = cgi.escape( """& < >""" ) # s = "&amp; &lt; &gt;" _
ただし、_&_、_<_、および_>_を超える文字はエスケープされません。 cgi.escape(string_to_escape, quote=True)として使用される場合は、_"_もエスケープします。

引用符やアポストロフィもエスケープできる小さなスニペットを次に示します。
_ html_escape_table = { "&": "&amp;", '"': "&quot;", "'": "&apos;", ">": "&gt;", "<": "&lt;", } def html_escape(text): """Produce entities within text.""" return "".join(html_escape_table.get(c,c) for c in text) _
escape() from _xml.sax.saxutils_ を使用してhtmlをエスケープすることもできます。この関数はより速く実行されるはずです。同じモジュールのunescape()関数に同じ引数を渡して、文字列をデコードできます。
_from xml.sax.saxutils import escape, unescape # escape() and unescape() takes care of &, < and >. html_escape_table = { '"': "&quot;", "'": "&apos;" } html_unescape_table = {v:k for k, v in html_escape_table.items()} def html_escape(text): return escape(text, html_escape_table) def html_unescape(text): return unescape(text, html_unescape_table) _

Robert Christie · Answer

cgi.escape メソッドは、特別な文字を有効なhtmlタグに変換します

 import cgi original_string = 'Hello "XYZ" this \'is\' a test & so on ' escaped_string = cgi.escape(original_string, True) print original_string print escaped_string

結果として

Hello "XYZ" this 'is' a test & so on Hello &quot;XYZ&quot; this 'is' a test &amp; so on

Cgi.escapeのオプションの2番目のパラメーターは、引用符をエスケープします。デフォルトでは、それらはエスケープされません

Ned Batchelder · Answer

単純な文字列関数がそれを行います：

def escape(t): """HTML-escape the text in `t`.""" return (t .replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;") .replace("'", "&#39;").replace('"', "&quot;") )

このスレッドの他の回答には小さな問題があります。cgi.escapeメソッドは何らかの理由で一重引用符を無視するため、二重引用符を明示的に要求する必要があります。リンクされたwikiページは5つすべてを実行しますが、XMLエンティティを使用します'、これはHTMLエンティティではありません。

このコード関数は、HTML標準エンティティを使用して、常に5つすべてを実行します。

Brōtsyorfuzthrāx · Answer

ここにある他の答えは、あなたがリストしたキャラクターや他のいくつかのキャラクターなどに役立ちます。ただし、他のすべてもエンティティ名に変換する場合は、別のことを行う必要があります。たとえば、áをáに変換する必要がある場合、cgi.escapeもhtml.escapeも役に立ちません。単なる辞書であるhtml.entities.entitydefsを使用するこのようなことをしたいと思うでしょう。（次のコードはPython 3.x用に作成されていますが、2.xと互換性を持たせるための部分的な試みがあります）：

# -*- coding: utf-8 -*- import sys if sys.version_info[0]>2: from html.entities import entitydefs else: from htmlentitydefs import entitydefs text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names. text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names if sys.version_info[0]>2: #Using appropriate code for each Python version. for k,v in entitydefs.items(): if k not in {"semi", "amp"}: text=text.replace(v, "&"+k+";") #You have to add the & and ; manually. else: for k,v in entitydefs.iteritems(): if k not in {"semi", "amp"}: text=text.replace(v, "&"+k+";") #You have to add the & and ; manually. #The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter: text=text.replace("ŷ", "&ycirc;") text=text.replace("Ŷ", "&Ycirc;") text=text.replace("ŵ", "&wcirc;") text=text.replace("Ŵ", "&Wcirc;") text=text.replace("ỳ", "&#7923;") text=text.replace("Ỳ", "&#7922;") text=text.replace("ẃ", "&wacute;") text=text.replace("Ẃ", "&Wacute;") text=text.replace("ẁ", "&#7809;") text=text.replace("Ẁ", "&#7808;") print(text) #Python 3.x outputs: &semi;&quot;&aacute;&egrave;&iuml;&oslash;&aelig;&#7923; #The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.