Python：UnicodeDecodeError： 'utf8'コーデックはバイトをデコードできません

Question

RTFファイルをpython文字列に変換します。一部のテキストでは、次のエラーが発生します。

Traceback (most recent call last): File "11.08.py", line 47, in <module> X = vectorizer.fit_transform(texts) File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 716, in fit_transform X = super(TfidfVectorizer, self).fit_transform(raw_documents) File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 398, in fit_transform term_count_current = Counter(analyze(doc)) File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 313, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\Python27\lib\site-packages\sklearn\feature_extraction	ext.py", line 224, in decode doc = doc.decode(self.charset, self.charset_error) File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 462: invalid start byte

私はもう試した：

ファイルのテキストを新しいファイルにコピーして貼り付ける
rtfファイルをtxtファイルとして保存
Notepad ++でtxtファイルを開き、「utf-8に変換」を選択して、エンコーディングをutf-8に設定します。
Microsoft Wordでファイルを開き、新しいファイルとして保存する

何も動作しません。何か案は？

それはおそらく関連していませんが、あなたが疑問に思っている場合のコードは次のとおりです：

f = open(dir+location, "r") doc = Rtf15Reader.read(f) t = PlaintextWriter.write(doc).getvalue() texts.append(t) f.close() vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english') X = vectorizer.fit_transform(texts)

Andreas Mueller · Accepted Answer

メーリングリストで述べたように、charset_errorオプションを使用してignoreに設定するのがおそらく最も簡単です。ファイルが実際にutf-16である場合、Vectorizerで文字セットをutf-16に設定することもできます。 docs を参照してください。

Jose Luis Mart&#237;n Romera · Answer

これはあなたの問題を解決します：

import codecs f = codecs.open(dir+location, 'r', encoding='utf-8') txt = f.read()

その瞬間から、txtはUnicode形式であり、コード内のどこでも使用できます。

処理後にUTF-8ファイルを生成する場合：

f.write(txt.encode('utf-8'))

Piyush S. Wanare · Answer

次のように、エンコーディングエラーなしでjsファイルのcsvファイル行をダンプできます。

json.dump(row,jsonfile, encoding="ISO-8859-1")

Shalini Baranwal · Answer

この行を保持します。

vectorizer = TfidfVectorizer(encoding='latin-1',sublinear_tf=True, max_df=0.5, stop_words='english')

encoding = 'latin-1'がうまくいきました。