非ASCII文字を単一のスペースに置き換えます

Question

すべての非ASCII（\ x00-\x7F）文字をスペースに置き換える必要があります。私は何かが足りない場合を除いて、これがPythonでは決して簡単ではないことに驚きました。次の関数は、すべての非ASCII文字を単純に削除します。

def remove_non_ascii_1(text): return ''.join(i for i in text if ord(i)<128)

そしてこれは、ASCIIコード以外の文字を、文字コードポイントのバイト数に応じたスペースの量で置き換えます（つまり、–文字は3つのスペースで置き換えられます）。

def remove_non_ascii_2(text): return re.sub(r'[^\x00-\x7F]',' ', text)

すべての非ASCII文字を単一のスペースに置き換えるにはどうすればよいですか？

無数 / 類似 SO 質問、なしアドレス文字置き換え反対からの除去、、（さらに、特定の文字ではなく、すべての非ASCII文字を処理します。

Martijn Pieters · Accepted Answer

あなたの''.join()式はフィルタリングで、ASCII以外のものはすべて削除されます。代わりに条件式を使用できます。

return ''.join([i if ord(i) < 128 else ' ' for i in text])

これは文字を一つずつ扱い、それでも置き換えられる文字ごとに一つのスペースを使います。

あなたの正規表現は、単に連続した非ASCII文字をスペースに置き換えます。

re.sub(r'[^\x00-\x7F]+',' ', text)

そこの+に注意してください。

Alvaro Fuentes · Answer

あなたのために私はあなたの元の文字列の最も似通った表現を得るために私はお勧めします unidecodeモジュール：

from unidecode import unidecode def remove_non_ascii(text): return unidecode(unicode(text, encoding = "utf-8"))

それから文字列でそれを使うことができます：

remove_non_ascii("Ceñía") Cenia

Mark Tolonen · Answer

文字の処理には、Unicode文字列を使用します。

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32. >>> s='ABC马克def' >>> import re >>> re.sub(r'[^\x00-\x7f]',r' ',s) # Each char is a Unicode codepoint. 'ABC def' >>> b = s.encode('utf8') >>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence. b'ABC def'

しかし、あなたの文字列が分解されたUnicode文字（例えば、別々の文字とアクセント記号の組み合わせ）を含んでいる場合は、まだ問題があることに注意してください。

>>> s = 'mañana' >>> len(s) 6 >>> import unicodedata as ud >>> n=ud.normalize('NFD',s) >>> n 'mañana' >>> len(n) 7 >>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint 'ma ana' >>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced 'man ana'

AXO · Answer

置換文字が「？」の場合スペースではなく、result = text.encode('ascii', 'replace').decode()をお勧めします。

"""Test the performance of different non-ASCII replacement methods.""" import re from timeit import timeit # 10_000 is typical in the project that I'm working on and most of the text # is going to be non-ASCII. text = 'Æ' * 10_000 print(timeit( """ result = ''.join([c if ord(c) < 128 else '?' for c in text]) """, number=1000, globals=globals(), )) print(timeit( """ result = text.encode('ascii', 'replace').decode() """, number=1000, globals=globals(), ))

結果：

0.7208260721400134 0.009975979187503592

parsecer · Answer

これはどうですか？

def replace_trash(unicode_string): for i in range(0, len(unicode_string)): try: unicode_string[i].encode("ascii") except: #means it's non-ASCII unicode_string=unicode_string[i].replace(" ") #replacing it with a single space return unicode_string

Kasr&#226;mvd · Answer

ネイティブで効率的なアプローチとして、ordや文字をループする必要はありません。 asciiでエンコードしてエラーを無視してください。

以下は、非ASCII文字を削除するだけです。

new_string = old_string.encode('ascii',errors='ignore')

削除した文字を置き換えたい場合は、次の手順を実行してください。

final_string = new_string + b' ' * (len(old_string) - len(new_string))