Pythonでファイルからスペース以外の特殊文字を削除するにはどうすればよいですか？

Question

テキストの膨大なコーパス（行ごと）があり、特殊文字を削除したいが、文字列のスペースと構造は維持したい。

hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.

する必要があります

hello there A Z R T world welcome to python this should be the next line followed by another million like this

Chiheb Nexus · Accepted Answer

このパターンはregexでも使用できます。

import re a = '''hello? there A-Z-R_T(,**), world, welcome to python. this **should? the next line#followed- by@ an#other %million^ %%like $this.''' for k in a.split("
"): print(re.sub(r"[^a-zA-Z0-9]+", ' ', k)) # Or: # final = " ".join(re.findall(r"[a-zA-Z0-9]+", k)) # print(final)

出力：

hello there A Z R T world welcome to python this should the next line followed by an other million like this

編集：

それ以外の場合は、最終行をlistに保存できます。

final = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in a.split("
")] print(final)

出力：

['hello there A Z R T world welcome to python ', 'this should the next line followed by an other million like this ']

Eliethesaiyan · Answer

Nfnニールの答えは素晴らしいと思います...しかし、単純な正規表現を追加して、すべての単語がない文字を削除しますが、下線は単語の一部と見なされます

print re.sub(r'\W+', ' ', string) >>> hello there A Z R_T world welcome to python

ssp4all · Answer

よりエレガントな解決策は

print(re.sub(r"\W+|_", " ", string))

>>> hello there A Z R T world welcome to python this should the next line followed by another million like this

ここで、reはpythonのregexモジュールです

re.subは、パターンをスペースで置き換えます。つまり、" "

r''は入力文字列を未加工として扱います(with )

\Wすべての非単語、つまりアンダースコアを除くすべての特殊文字*＆^％$などの場合_

+は、*と同様に、0から無制限に一致します（1対複数）。

|は論理OR

_はアンダースコアを表します

wwii · Answer

特殊文字をNoneにマッピングする辞書を作成する

d = {c:None for c in special_characters}

辞書を使用して変換テーブルを作成します。テキスト全体を変数に読み込み、テキスト全体で str.translate を使用します。