python encoding utf-8

Question

Pythonでいくつかのスクリプトを実行しています。ファイルに保存する文字列を作成します。この文字列は、ディレクトリの樹木およびファイル名からの大量のデータを取得しました。 convmvによると、私の樹木はすべてUTF-8になっています。

後でMySQLに保存するため、すべてをUTF-8で保持します。今のところ、UTF-8のMySQLでは、一部の文字（éやè-私はフランス語です）で問題が発生しました。

pythonは常に文字列をUTF-8として使用します。私はインターネットでいくつかの情報を読みましたが、私はこのようにしました。

私のスクリプトはこれで始まります：

 #!/usr/bin/python # -*- coding: utf-8 -*- def createIndex(): import codecs toUtf8=codecs.getencoder('UTF8') #lot of operations & building indexSTR the string who matter findex=open('config/index/music_vibration_'+date+'.index','a') findex.write(codecs.BOM_UTF8) findex.write(toUtf8(indexSTR)) #this bugs!

そして、私が実行するとき、ここに答えがあります：UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2171: ordinal not in range(128)

編集：私のファイルでは、アクセントがきれいに書かれています。このファイルを作成した後、それを読み取り、MySQLに書き込みます。しかし、理由はわかりませんが、エンコードに問題があります。 MySQLデータベースがutf8にあるか、SQLクエリSHOW variables LIKE 'char%'がutf8またはバイナリのみを返します。

私の機能は次のようになります：

#!/usr/bin/python # -*- coding: utf-8 -*- def saveIndex(index,date): import MySQLdb as mdb import codecs sql = mdb.connect('localhost','admin','*******','music_vibration') sql.charset="utf8" findex=open('config/index/'+index,'r') lines=findex.readlines() for line in lines: if line.find('#artiste') != -1: artiste=line.split('[:::]') artiste=artiste[1].replace('
','') c=sql.cursor() c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom="'+artiste+'"') nbr=c.fetchone() if nbr[0]==0: c=sql.cursor() iArt+=1 c.execute('INSERT INTO artistes(nom,status,path) VALUES("'+artiste+'",99,"'+artiste+'/")'.encode('utf8')

そして、ファイルにうまく表示されているアーティストは、BDDに悪い書き込みをします。何が問題ですか？

Martijn Pieters · Accepted Answer

alreadyエンコードされたデータをエンコードする必要はありません。それをしようとすると、Pythonは最初にdecodeをunicodeに戻してからエンコードします。 UTF-8に。それがここで失敗しているものです：

>>> data = u'\u00c3' # Unicode data >>> data = data.encode('utf8') # encoded to UTF-8 >>> data '\xc3\x83' >>> data.encode('utf8') # Try to *re*-encode it Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

データをファイルに直接書き込むだけで、既にエンコードされたデータをエンコードする必要はありませんno。

代わりにunicode値を作成する場合、実際にファイルに書き込み可能にエンコードする必要があります。代わりに codecs.open() を使用すると、Unicode値をUTF-8にエンコードするファイルオブジェクトが返されます。

また、本当にUTF-8 BOMを書きたくない、unlessあなたhaveは、そうでなければUTF-8を読み取れないMicrosoftツール（MS Notepadなど）をサポートします。

MySQLの挿入問題については、2つのことを行う必要があります。

charset='utf8'をMySQLdb.connect()呼び出しに追加します。
クエリまたは挿入するときにunicodeオブジェクトではなくstrオブジェクトを使用しますが、sqlパラメーターを使用して、MySQLコネクターが正しく実行できるようにしますあなたのためのもの：
```
artiste = artiste.decode('utf8') # it is already UTF8, decode to unicode c.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,)) # ... c.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/')) 
```

代わりにcodecs.open()を使用して内容を自動的にデコードすると、実際にうまく機能する場合があります。

import codecs sql = mdb.connect('localhost','admin','ugo&(-@F','music_vibration', charset='utf8') with codecs.open('config/index/'+index, 'r', 'utf8') as findex: for line in findex: if u'#artiste' not in line: continue artiste=line.split(u'[:::]')[1].strip() cursor = sql.cursor() cursor.execute('SELECT COUNT(id) AS nbr FROM artistes WHERE nom=%s', (artiste,)) if not cursor.fetchone()[0]: cursor = sql.cursor() cursor.execute('INSERT INTO artistes(nom,status,path) VALUES(%s, 99, %s)', (artiste, artiste + u'/')) artists_inserted += 1

Unicode、UTF-8、およびエンコーディングをブラッシュアップすることができます。次の記事をお勧めします。

Ev Haus · Answer

残念ながら、string.encode（）メソッドは常に信頼できるとは限りません。詳細については、このスレッドをチェックしてください：ある文字列（utf-8またはそれ以外）をpython の単純なASCII文字列に変換する簡単な方法は何ですか