バイトを文字列に変換しますか？

Question

私は外部プログラムから標準出力を得るためにこのコードを使っています：

>>> from subprocess import * >>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

Communic（）メソッドはバイトの配列を返します。

>>> command_stdout b'total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
'

しかし、私はその出力を通常のPython文字列として扱いたいのです。だから私はこのようにそれを印刷することができるように：

>>> print(command_stdout) -rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1 -rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2

それが binascii.b2a_qp（）メソッドの目的であると思いましたが、試してみると、同じバイト配列に戻りました。

>>> binascii.b2a_qp(command_stdout) b'total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
'

バイト値を文字列に戻す方法を知っている人はいますか？つまり、手動ではなく「電池」を使うということです。そして私はそれがPython 3でも大丈夫であることを望みます。

Aaron Maenpaa · Accepted Answer

文字列を生成するには、bytesオブジェクトをデコードする必要があります。

>>> b"abcde" b'abcde' # utf-8 is used here because it is a very common encoding, but you # need to use the encoding your data is actually in. >>> b"abcde".decode("utf-8") 'abcde'

Sisso · Answer

私はこの方法が簡単だと思います：

bytes = [112, 52, 52] "".join(map(chr, bytes)) >> p44

dF. · Answer

あなたはバイト文字列をデコードし、それを文字（Unicode）文字列に変換する必要があります。

b'hello'.decode(encoding)

またはPython 3の場合

str(b'hello', encoding)

anatoly techtonik · Answer

エンコーディングがわからない場合は、Python 3およびPython 2互換の方法でバイナリ入力を文字列に読み込むには、古いMS-DOS cp437 エンコーディングを使用します。

PY3K = sys.version_info >= (3, 0) lines = [] for line in stream: if not PY3K: lines.append(line) else: lines.append(line.decode('cp437'))

エンコーディングは未知であるため、英語以外のシンボルはcp437の文字に変換されることを想定してください（英語の文字はほとんどのシングルバイトエンコーディングとUTF-8で一致するため翻訳されません）。

任意のバイナリ入力をUTF-8にデコードするのは安全ではありません。

>>> b'\x00\x01\xffsd'.decode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid start byte

同じことが、Python 2でよく使われている（デフォルトの）latin-1にも当てはまります。 Codepage Layout の欠けている点を参照してください。Pythonが悪名高いordinal not in rangeで詰まっているところです。

UPDATE 20150604：Python 3にはデータを損失せずにバイナリデータにエンコードするためのsurrogateescapeエラー戦略がありクラッシュするという噂がありますが、パフォーマンスと信頼性の両方を検証するには変換テスト[binary] -> [str] -> [binary]が必要です。

UPDATE 20170116：Nearooによるコメントのおかげで - すべての未知のバイトをbackslashreplaceエラーハンドラでスラッシュエスケープする可能性もあります。これはPython 3でしか機能しないので、この回避策を使用しても、異なるバージョンのPythonからは矛盾した出力が得られるでしょう。

PY3K = sys.version_info >= (3, 0) lines = [] for line in stream: if not PY3K: lines.append(line) else: lines.append(line.decode('utf-8', 'backslashreplace'))

詳しくは https://docs.python.org/3/howto/unicode.html#python-s-unicode-support をご覧ください。

UPDATE 20170119：私はPython 2とPython 3の両方に使えるスラッシュエスケープデコードを実装することにしました。そのcp437ソリューションより遅くなるべきですが、すべてのPythonバージョンで同一の結果を生成するべきです。

# --- preparation import codecs def slashescape(err): """ codecs error handler. err is UnicodeDecode instance. return a Tuple with a replacement for the unencodable part of the input and a position where encoding should continue""" #print err, dir(err), err.start, err.end, err.object[:err.start] thebyte = err.object[err.start:err.end] repl = u'\x'+hex(ord(thebyte))[2:] return (repl, err.end) codecs.register_error('slashescape', slashescape) # --- processing stream = [b'\x80abc'] lines = [] for line in stream: lines.append(line.decode('utf-8', 'slashescape'))

lmiguelvargasf · Answer

Python 3 では、デフォルトのエンコーディングは"utf-8"であるため、直接使用できます。

b'hello'.decode()

これはと同等です

b'hello'.decode(encoding="utf-8")

一方、 Python 2 では、encodingはデフォルトのデフォルトの文字列エンコーディングになります。したがって、使用する必要があります：

b'hello'.decode(encoding)

ここでencodingはあなたが望むエンコーディングです。

注意： キーワード引数のサポートはPython 2.7で追加されました。

mcherm · Answer

私はあなたが実際に欲しいのはこれだと思います：

>>> from subprocess import * >>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0] >>> command_text = command_stdout.decode(encoding='windows-1252')

Aaronの答えは正しかった。ただし、使用するエンコーディングは知っておく必要がある。そして私はWindowsが 'windows-1252'を使っていると信じています。あなたのコンテンツにいくつかの変わった（非ASCII）文字が含まれている場合にのみ問題になりますが、それは違いを生むでしょう。

ちなみに、重要なのは、Pythonがバイナリデータとテキストデータに2つの異なるタイプを使用するように移行した理由です。それ以外はエンコーディングがわからないため、変換できません。あなたが知っているだろう唯一の方法はWindowsのドキュメンテーションを読むことです（またはそれをここで読むことです）。

ContextSwitch · Answer

Universal_newlinesをTrueに設定します。

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

serv-inc · Answer

@Aaron Maenpaaの答えは動作しますが、ユーザーは最近尋ねました

もっと簡単な方法はありますか？ 'fhand.read（）。decode（ "ASCII"）' [...]とても長いです。

あなたが使用することができます

command_stdout.decode()

decode()には標準の引数があります

codecs.decode(obj, encoding='utf-8', errors='strict')

jfs · Answer

バイトシーケンスをテキストとして解釈するには、対応する文字エンコーディングを知っておく必要があります。

unicode_text = bytestring.decode(character_encoding)

例：

>>> b'\xc2\xb5'.decode('utf-8') 'µ'

lsコマンドはテキストとして解釈できない出力を生成する可能性があります。 Unix上のファイル名はスラッシュb'/'とゼロb'\0'を除いてどんなバイトのシーケンスでもよい：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

そのようなバイトスープをutf-8エンコーディングを使ってデコードしようとするとUnicodeDecodeErrorが発生します。

さらに悪いことがあります。間違った互換性のないエンコーディングを使用すると、復号化は黙って失敗して mojibake を生成することがあります。

>>> '—'.encode('utf-8').decode('cp1252') 'â€”'

データは破損していますが、プログラムは障害が発生したことを認識しないままです。

一般に、使用する文字エンコーディングはバイトシーケンス自体には埋め込まれていません。この情報をアウトオブバンドで伝達する必要があります。いくつかの結果は他のものよりもありそうなので、推測文字エンコーディングが可能なchardetモジュールが存在します。 1つのPythonスクリプトで、さまざまな場所で複数の文字エンコーディングを使用することがあります。

lsの出力は、デコードできないfilenames でも成功するos.fsdecode()関数を使用してPython文字列に変換できます（Unixではsys.getfilesystemencoding()およびsurrogateescapeエラーハンドラを使用します）。

import os import subprocess output = os.fsdecode(subprocess.check_output('ls'))

元のバイトを取得するには、os.fsencode()を使用できます。

universal_newlines=Trueパラメータを渡すと、subprocessはlocale.getpreferredencoding(False)を使ってバイトをデコードします。例えば、Windowsではcp1252になります。

バイトストリームをオンザフライでデコードするには、 io.TextIOWrapper() を使用できます： example 。

コマンドによって出力に異なる文字エンコードが使用される場合があります。たとえば、dir internalコマンド（cmd）ではcp437が使用される場合があります。その出力をデコードするには、エンコーディングを明示的に渡すことができます（Python 3.6以降）。

output = subprocess.check_output('dir', Shell=True, encoding='cp437')

ファイル名は、（Windows Unicode APIを使用する）os.listdir()とは異なる場合があります。例えば、'\xb6'は'\x14'で置き換えることができます - Pythonのcp437コーデックは、U + 00B6ではなくb'\x14'を制御文字U + 0014にマップします。任意のUnicode文字を含むファイル名をサポートするには、 ASCII以外のUnicode文字を含む可能性のあるPowerShell出力をpython文字列にデコードするを参照してください。

wim · Answer

この質問は実際にはsubprocessの出力について質問しているので、Popenは encoding キーワードを受け入れるのでもっと直接的なアプローチがあります（Python 3.6以降）：

>>> from subprocess import Popen, PIPE >>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0] >>> type(text) str >>> print(text) total 0 -rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

他のユーザーの一般的な答えは、 decode bytes to textです。

>>> b'abcde'.decode() 'abcde'

引数なしでは、 sys.getdefaultencoding() が使用されます。データがsys.getdefaultencoding()ではない場合は、 decode callでエンコーディングを明示的に指定する必要があります。

>>> b'caf\xe9'.decode('cp1250') 'café'

Broper · Answer

decode()を試してみると次のようになります。

AttributeError: 'str' object has no attribute 'decode'

キャストで直接エンコードタイプを指定することもできます。

>>> my_byte_str b'Hello World' >>> str(my_byte_str, 'utf-8') 'Hello World'

eafloresf · Answer

リストをきれいにする機能を作りました

def cleanLists(self, lista): lista = [x.strip() for x in lista] lista = [x.replace('
', '') for x in lista] lista = [x.replace('\b', '') for x in lista] lista = [x.encode('utf8') for x in lista] lista = [x.decode('utf8') for x in lista] return lista

bers · Answer

（行の終わりで）Windowsシステムからのデータを扱うとき、私の答えは

String = Bytes.decode("utf-8").replace("
", "
")

どうして？これを複数行のInput.txtで試してください。

Bytes = open("Input.txt", "rb").read() String = Bytes.decode("utf-8") open("Output.txt", "w").write(String)

すべての行末は（に）2倍になり、余分な空行になります。 Pythonのテキスト読み取り関数は通常、文字列がのみを使用するように行末を正規化します。 Windowsシステムからバイナリデータを受け取った場合、Pythonはそれを実行する機会がありません。したがって、

Bytes = open("Input.txt", "rb").read() String = Bytes.decode("utf-8").replace("
", "
") open("Output.txt", "w").write(String)

元のファイルを複製します。

Inconnu · Answer

Python 3の場合、これはbyteからstringに変換するためのはるかに安全で Pythonic のアプローチです。

def byte_to_str(bytes_or_str): if isinstance(bytes_or_str, bytes): #check if its in bytes print(bytes_or_str.decode('utf-8')) else: print("Object not of byte type") byte_to_str(b'total 0
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2
')

出力：

total 0 -rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file1 -rw-rw-r-- 1 thomas thomas 0 Mar 3 07:03 file2

Zhichang Yu · Answer

http://docs.python.org/3/library/sys.html から、

標準ストリームとの間でバイナリデータを読み書きするには、基になるバイナリバッファを使用します。たとえば、バイトをstdoutに書き込むには、sys.stdout.buffer.write(b'abc')を使用します。

Leonardo Filipe · Answer

def toString(string): try: return v.decode("utf-8") except ValueError: return string b = b'97.080.500' s = '97.080.500' print(toString(b)) print(toString(s))

Boris · Answer

「specific」の場合、「シェルコマンドを実行し、その出力をバイトではなくテキストとして取得します」、Python 3.7、 subprocess.run を使用して、text=True（および出力をキャプチャするcapture_output=True）を渡す必要があります

command_result = subprocess.run(["ls", "-l"], capture_output=True, text=True) command_result.stdout # is a `str` containing your program's stdout

textはかつてuniversal_newlinesと呼ばれていましたが、Python 3.7で変更されました（エイリアス）。 3.7より前のPythonバージョンをサポートする場合は、universal_newlines=Trueの代わりにtext=Trueを渡します

HCLivess · Answer

単に文字列をバイトに変換するのではなく、任意のバイトを変換したい場合は、

with open("bytesfile", "rb") as infile: str = base64.b85encode(imageFile.read()) with open("bytesfile", "rb") as infile: str2 = json.dumps(list(infile.read()))

ただし、これはあまり効率的ではありません。それは2MBの画像を9MBに変えるでしょう。