テキストファイルからユニコード文字を削除-sed、他のbash / Shellメソッド

Question

ターミナル上のテキストファイルの束からUnicode文字を削除するにはどうすればよいですか？私はこれを試しましたが、うまくいきませんでした：

sed 'g/\u'U+200E'//' -i *.txt

これらのユニコードをテキストファイルから削除する必要があります

U+0091 - sort of weird "control" space U+0092 - same sort of weird "control" space A0 - non-space break U+200E - left to right mark

Michał Šrajer · Accepted Answer

特定の文字のみを削除し、Pythonを使用している場合、次のことができます。

CHARS=$(python -c 'print u"\u0091\u0092\u00a0\u200E".encode("utf8")') sed 's/['"$CHARS"']//g' < /tmp/utf8_input.txt > /tmp/ascii_output.txt

kev · Answer

file.txtの非ASCII文字をすべてクリアします

$ iconv -c -f utf-8 -t ascii file.txt $ strings file.txt

choroba · Answer

Unicodeのutf-8エンコーディングの場合、sedに次の正規表現を使用できます。

sed 's/\xc2\x91\|\xc2\x92\|\xc2\xa0\|\xe2\x80\x8e//'

Michał Šrajer · Answer

Iconvを使用します。

iconv -f utf8 -t ascii//TRANSLIT < /tmp/utf8_input.txt > /tmp/ascii_output.txt

これにより、「Š」などの文字が「S」（最も似ている文字）に変換されます。

ma11hew28 · Answer

Swiftファイルをutf-8からasciiに変換：

for file in *.Swift; do iconv -f utf-8 -t ascii "$file" > "$file".tmp mv -f "$file".tmp "$file" done