ファイルのエンコーディングを見つける効果的な方法

Question

はいは最もよくある質問であり、この問題は私にとってあいまいであり、私はそれについてあまり知らないので。

しかし、エンコードファイルを見つけるための非常に正確な方法が欲しいです。 Notepad ++ほど正確です。

2Toad · Accepted Answer

StreamReader.CurrentEncodingプロパティが正しいテキストファイルエンコーディングを返すことはほとんどありません。バイトオーダーマーク（BOM）を分析することで、ファイルのエンディアンを決定することに成功しました。

/// <summary> /// Determines a text file's encoding by analyzing its byte order mark (BOM). /// Defaults to ASCII when detection of the text file's endianness fails. /// </summary> /// <param name="filename">The text file to analyze.</param> /// <returns>The detected encoding.</returns> public static Encoding GetEncoding(string filename) { // Read the BOM var bom = new byte[4]; using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read)) { file.Read(bom, 0, 4); } // Analyze the BOM if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7; if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8; if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32; return Encoding.ASCII; }

補足として、代わりにEncoding.Defaultを返すようにこのメソッドの最後の行を変更すると、OSの現在のANSIコードページのエンコーディングがデフォルトで返されます。

Simon Mourier · Answer

次のコードは、StreamReaderクラスを使用してうまく機能します。

 using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true)) { reader.Peek(); // you need this! var encoding = reader.CurrentEncoding; }

トリックはPeek呼び出しを使用することです。それ以外の場合、.NETは何も実行していません（プリアンブル、BOMを読み取っていません）。もちろん、エンコードをチェックする前に他のReadXXX呼び出しを使用すると、それも機能します。

ファイルにBOMがない場合、defaultEncodingIfNoBomエンコードが使用されます。このオーバーロードメソッドのないStreamReaderもあります（この場合、デフォルト（ANSI）エンコーディングがdefaultEncodingIfNoBomとして使用されます）が、コンテキストでデフォルトエンコーディングとみなすものを定義することをお勧めします。

UTF8、UTF16/Unicode（LE＆BE）およびUTF32（LE＆BE）のBOMを使用したファイルでこれを正常にテストしました。 UTF7では機能しません。

CodesInChaos · Answer

次の手順を試してみます。

1）バイトオーダーマークがあるかどうかを確認します

2）ファイルが有効なUTF8かどうかを確認します

3）ローカルの「ANSI」コードページを使用します（Microsoftが定義するANSI）

ステップ2は、UTF8以外のコードページ内のほとんどのASCIIシーケンスが有効なUTF8ではないため機能します。

Alexei Ag&#252;ero Alba · Answer

これをチェックして。

UDE

これは、Mozilla Universal Charset Detectorの移植版であり、次のように使用できます。

public static void Main(String[] args) { string filename = args[0]; using (FileStream fs = File.OpenRead(filename)) { Ude.CharsetDetector cdet = new Ude.CharsetDetector(); cdet.Feed(fs); cdet.DataEnd(); if (cdet.Charset != null) { Console.WriteLine("Charset: {0}, confidence: {1}", cdet.Charset, cdet.Confidence); } else { Console.WriteLine("Detection failed."); } } }

Berthier Lemieux · Answer

@CodesInChaosによって提案されたステップの実装の詳細を提供します。

1）バイトオーダーマークがあるかどうかを確認します

2）ファイルが有効なUTF8かどうかを確認します

3）ローカルの「ANSI」コードページを使用します（Microsoftが定義するANSI）

ステップ2は、UTF8以外のコードページ内のほとんどのASCIIシーケンスが有効なUTF8ではないため機能します。 https://stackoverflow.com/a/4522251/867248 は、この戦術を詳細に説明しています。

using System; using System.IO; using System.Text; // Using encoding from BOM or UTF8 if no BOM found, // check if the file is valid, by reading all lines // If decoding fails, use the local "ANSI" codepage public string DetectFileEncoding(Stream fileStream) { var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback()); using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier, detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024)) { string detectedEncoding; try { while (!reader.EndOfStream) { var line = reader.ReadLine(); } detectedEncoding = reader.CurrentEncoding.BodyName; } catch (Exception e) { // Failed to decode the file using the BOM/UT8. // Assume it's local ANSI detectedEncoding = "ISO-8859-1"; } // Rewind the stream fileStream.Seek(0, SeekOrigin.Begin); return detectedEncoding; } } [Test] public void Test1() { Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv"); var detectedEncoding = DetectFileEncoding(fs); using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding))) { // Consume your file var line = reader.ReadLine(); ...

Enzojz · Answer

以下のコードは、一部のcppまたはhまたはmlファイルがISO-8859-1（Latin-1）またはBOMなしのUTF-8でエンコードされているかどうかを決定するPowershellコードです。私はフランスで働いている中国人であり、MSVCはフランスのコンピューターではLatin-1として保存し、中国のコンピューターではGBとして保存するため、システムと同僚の間でソースファイルを交換する際のエンコードの問題を回避できます。

方法は簡単です。すべての文字がx00-x7Eの間にあり、ASCII、UTF-8、およびLatin-1がすべて同じ場合、UTF-8で非ASCIIファイルを読み取ると、特殊文字が表示されるので、Latin-1で読んでみてください。 Latin-1では、\ x7Fと\ xAFの間は空ですが、GBはx00-xFFの間でfullを使用するため、2つの間にある場合、Latin-1ではありません

コードはPowerShellで記述されていますが、.netを使用しているため、C＃またはF＃に簡単に変換できます。

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False) foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) { $openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8) $contentUTF = $openUTF.ReadToEnd() [regex]$regex = '�' $c=$regex.Matches($contentUTF).count $openUTF.Close() if ($c -ne 0) { $openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1')) $contentLatin1 = $openLatin1.ReadToEnd() $openLatin1.Close() [regex]$regex = '[\x7F-\xAF]' $c=$regex.Matches($contentLatin1).count if ($c -eq 0) { [System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding) $i.FullName } else { $openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030')) $contentGB = $openGB.ReadToEnd() $openGB.Close() [System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding) $i.FullName } } } Write-Host -NoNewLine 'Press any key to continue...'; $null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

Sedrick · Answer

C＃はこちらをご覧ください

https://msdn.Microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

string path = @"path	o\your\file.ext"; using (StreamReader sr = new StreamReader(path, true)) { while (sr.Peek() >= 0) { Console.Write((char)sr.Read()); } //Test for the encoding after reading, or at least //after the first read. Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding); Console.ReadLine(); Console.WriteLine(); }

Pacurar Stefan · Answer

.NETはあまり役に立ちませんが、次のアルゴリズムを試すことができます。

bOM（バイトオーダーマーク）によるエンコーディングを見つけようとします...見つからない可能性が非常に高い
異なるエンコーディングに解析してみてください

呼び出しは次のとおりです。

var encoding = FileHelper.GetEncoding(filePath); if (encoding == null) throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");

コードは次のとおりです。

public class FileHelper { /// <summary> /// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings /// Defaults to UTF8 when detection of the text file's endianness fails. /// </summary> /// <param name="filename">The text file to analyze.</param> /// <returns>The detected encoding or null.</returns> public static Encoding GetEncoding(string filename) { var encodingByBOM = GetEncodingByBOM(filename); if (encodingByBOM != null) return encodingByBOM; // BOM not found :(, so try to parse characters into several encodings var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8); if (encodingByParsingUTF8 != null) return encodingByParsingUTF8; var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1")); if (encodingByParsingLatin1 != null) return encodingByParsingLatin1; var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7); if (encodingByParsingUTF7 != null) return encodingByParsingUTF7; return null; // no encoding found } /// <summary> /// Determines a text file's encoding by analyzing its byte order mark (BOM) /// </summary> /// <param name="filename">The text file to analyze.</param> /// <returns>The detected encoding.</returns> private static Encoding GetEncodingByBOM(string filename) { // Read the BOM var byteOrderMark = new byte[4]; using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read)) { file.Read(byteOrderMark, 0, 4); } // Analyze the BOM if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7; if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8; if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32; return null; // no BOM found } private static Encoding GetEncodingByParsing(string filename, Encoding encoding) { var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback()); try { using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true)) { while (!textReader.EndOfStream) { textReader.ReadLine(); // in order to increment the stream position } // all text parsed ok return textReader.CurrentEncoding; } } catch (Exception ex) { } return null; // } }

raushan · Answer

役に立つかもしれません

string path = @"address/to/the/file.extension"; using (StreamReader sr = new StreamReader(path)) { Console.WriteLine(sr.CurrentEncoding); }