'' UnicodeDecodeError： 'charmap' codecは位置29815のバイト0x9dをデコードできません：文字マップを<未定義>に修正できますか？

Question

現時点では、Python 3プログラムを取得して、Spyder IDE/GUIを介して、情報で満たされたテキストファイルを操作します。ただし、ファイルを読み取ろうとすると、次のエラーが発生します。

 File "<ipython-input-13-d81e1333b8cd>", line 77, in <module> parser(f) File "<ipython-input-13-d81e1333b8cd>", line 18, in parser data = infile.read() File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

プログラムのコードは次のとおりです。

import os os.getcwd() import glob import re import sqlite3 import csv def parser(file): # Open a TXT file. Store all articles in a list. Each article is an item # of the list. Split articles based on the location of such string as # 'Document PRN0000020080617e46h00461' articles = [] with open(file, 'r') as infile: data = infile.read() start = re.search(r'
 HD
', data).start() for m in re.finditer(r'Document [a-zA-Z0-9]{25}
', data): end = m.end() a = data[start:end].strip() a = '
 ' + a articles.append(a) start = end # In each article, find all used Intelligence Indexing field codes. Extract # content of each used field code, and write to a CSV file. # All field codes (order matters) fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP', 'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN'] for a in articles: used = [f for f in fields if re.search(r'
 ' + f + r'
', a)] unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'
 ' + f + r'
', a)] fields_pos = [] for f in used: f_m = re.search(r'
 ' + f + r'
', a) f_pos = [f, f_m.start(), f_m.end()] fields_pos.append(f_pos) obs = [] n = len(used) for i in range(0, n): used_f = fields_pos[i][0] start = fields_pos[i][2] if i < n - 1: end = fields_pos[i + 1][1] else: end = len(a) content = a[start:end].strip() obs.append(content) for f in unused: obs.insert(f[0], '') obs.insert(0, file.split('/')[-1].split('.')[0]) # insert Company ID, e.g., GVKEY # print(obs) cur.execute('''INSERT INTO articles (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf, co, ina, ns, re, ipc, ipd, pub, an) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', obs) # Write to SQLITE conn = sqlite3.connect('factiva.db') with conn: cur = conn.cursor() cur.execute('DROP TABLE IF EXISTS articles') # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name cur.execute('''CREATE TABLE articles (nid integer primary key, id text, hd text, cr text, wc text, pd text, et text, sn text, sc text, ed text, pg text, la text, cy text, lp text, td text, ct text, rf text, co text, ina text, ns text, re text, ipc text, ipd text, pub text, an text)''') for f in glob.glob('*.txt'): print(f) parser(f) # Write to CSV to feed Stata with open('factiva.csv', 'w', newline='') as csvfile: writer = csv.writer(csvfile) with conn: cur = conn.cursor() cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL') colname = [desc[0] for desc in cur.description] writer.writerow(colname) for obs in cur.fetchall(): writer.writerow(obs)

Giacomo Catenazzi · Answer

https://en.wikipedia.org/wiki/Windows-1252 からわかるように、コード0x9DはCP1252で定義されていません。

「エラー」は、たとえばopen関数で：エンコーディングを指定しないので、python（windowsでのみ）システムエンコーディングを使用します。一般に、同じマシンで作成するのではなく、エンコードを指定するのが本当に良いです。

Csvを書くために、あなたのopenにもコーディングをすることをお勧めします。明示的にする方が本当に良いです。

元のファイル形式はわかりませんが、開く, encoding='utf-8'は通常良いことです（LinuxとMacOのデフォルトです）。

Romano · Answer

上記はうまくいきませんでした。代わりにこれを試してください：, errors='ignore'不思議に働いた！

AnksG · Answer

これを修正するには、ファイルの形式を.csvから.xlsxに変更しました。