Python3で.decode（ 'string-escape'）を実行するにはどうすればよいですか？

Question

エスケープを解除する必要があるエスケープ文字列がいくつかあります。 Pythonでこれを行いたいです。

たとえば、python2.7ではこれを行うことができます：

>>> "\123omething special".decode('string-escape') 'Something special' >>>

Python3でどうすればよいですか？これは機能しません：

>>> b"\123omething special".decode('string-escape') Traceback (most recent call last): File "<stdin>", line 1, in <module> LookupError: unknown encoding: string-escape >>>

私の目標は、次のような文字列を取得できるようにすることです。

s\000u\000p\000p\000o\000r\000t\000@\000p\000s\000i\000l\000o\000c\000.\000c\000o\000m\000

そしてそれを次のように変えます：

"support@psiloc.com"

変換を行った後、持っている文字列がUTF-8またはUTF-16でエンコードされているかどうかを確認します。

Martijn Pieters · Accepted Answer

代わりにunicode_escapeを使用する必要があります。

>>> b"\123omething special".decode('unicode_escape')

start代わりにstrオブジェクトを使用する場合（python 2.7 unicodeと同等）、最初にバイトにエンコードし、次にデコードする必要があります。 unicode_escape。

最終結果としてバイトが必要な場合は、適切なエンコードに再度エンコードする必要があります（たとえば、リテラルバイト値を保持する必要がある場合は、.encode('latin1')。最初の256個のUnicodeコードポイントは1対1に対応します）。

あなたの例は、実際にはエスケープ付きのUTF-16データです。 unicode_escapeからデコードし、latin1に戻ってバイトを保存し、次にutf-16-le（BOMなしのUTF 16リトルエンディアン）からデコードします。

>>> value = b's\000u\000p\000p\000o\000r\000t\000@\000p\000s\000i\000l\000o\000c\000.\000c\000o\000m\000' >>> value.decode('unicode_escape').encode('latin1') # convert to bytes b's\x00u\x00p\x00p\x00o\x00r\x00t\x00@\x00p\x00s\x00i\x00l\x00o\x00c\x00.\x00c\x00o\x00m\x00' >>> _.decode('utf-16-le') # decode from UTF-16-LE 'support@psiloc.com'

Nathaniel J. Smith · Answer

古い「文字列エスケープ」コーデックは、バイト文字列をバイト文字列にマッピングします。そのようなコーデックで何をすべきかについては多くの議論があったため、現在、標準のエンコード/デコードインターフェイスでは利用できません。

しかし、コードはまだC-APIにあります（PyBytes_En/DecodeEscape）、これはまだPython文書化されていないcodecs.escape_encodeおよびcodecs.escape_decode。

>>> import codecs >>> codecs.escape_decode(b"ab\xff") (b'ab\xff', 6) >>> codecs.escape_encode(b"ab\xff") (b'ab\xff', 3)

これらの関数は、変換されたbytesオブジェクトと、処理されたバイト数を示す数値を返します。後者は無視できます。

>>> value = b's\000u\000p\000p\000o\000r\000t\000@\000p\000s\000i\000l\000o\000c\000.\000c\000o\000m\000' >>> codecs.escape_decode(value)[0] b's\x00u\x00p\x00p\x00o\x00r\x00t\x00@\x00p\x00s\x00i\x00l\x00o\x00c\x00.\x00c\x00o\x00m\x00'

malthe · Answer

バイト文字列でunicode_escapeを使用することはできません（むしろ、使用できますが、Python 2でstring_escapeが行うのと同じ結果を常に返すとは限りません）。

この関数は、正規表現とカスタム置換ロジックを使用してstring_escapeを実装します。

def unescape(text): regex = re.compile(b'\\(\\|[0-7]{1,3}|x.[0-9a-f]?|[\'"abfnrt]|.|$)') def replace(m): b = m.group(1) if len(b) == 0: raise ValueError("Invalid character escape: '\'.") i = b[0] if i == 120: v = int(b[1:], 16) Elif 48 <= i <= 55: v = int(b, 8) Elif i == 34: return b'"' Elif i == 39: return b"'" Elif i == 92: return b'\' Elif i == 97: return b'\a' Elif i == 98: return b'\b' Elif i == 102: return b'\f' Elif i == 110: return b'\n' Elif i == 114: return b'\r' Elif i == 116: return b'\t' else: s = b.decode('ascii') raise UnicodeDecodeError( 'stringescape', text, m.start(), m.end(), "Invalid escape: %r" % s ) return bytes((v, )) result = regex.sub(replace, text)

guettli · Answer

少なくとも私の場合、これは同等でした：

Py2: my_input.decode('string_escape') Py3: bytes(my_input.decode('unicode_escape'), 'latin1')

convertutils.py：

def string_escape(my_bytes): return bytes(my_bytes.decode('unicode_escape'), 'latin1')