Java UTF-8エンコーディングがURLConnectionに設定されていません

Question

http://api.freebase.com/api/trans/raw/m/0h47 からデータを取得しようとしています

あなたがテキストで見ることができるように、このような歌があります：/ælˈdʒɪəriə/。

ページからソースを取得しようとすると、úなどの歌のテキストが表示されます。

これまで、次のコードで試しました。

urlConnection.setRequestProperty("Accept-Charset", "UTF-8"); urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");

私は何が間違っているのですか？

私のコード全体：

URL url = null; URLConnection urlConn = null; DataInputStream input = null; try { url = new URL("http://api.freebase.com/api/trans/raw/m/0h47"); } catch (MalformedURLException e) {e.printStackTrace();} try { urlConn = url.openConnection(); } catch (IOException e) { e.printStackTrace(); } urlConn.setRequestProperty("Accept-Charset", "UTF-8"); urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8"); urlConn.setDoInput(true); urlConn.setUseCaches(false); StringBuffer strBseznam = new StringBuffer(); if (strBseznam.length() > 0) strBseznam.deleteCharAt(strBseznam.length() - 1); try { input = new DataInputStream(urlConn.getInputStream()); } catch (IOException e) { e.printStackTrace(); } String str = ""; StringBuffer strB = new StringBuffer(); strB.setLength(0); try { while (null != ((str = input.readLine()))) { strB.append(str); } input.close(); } catch (IOException e) { e.printStackTrace(); }

Joop Eggen · Accepted Answer

HTMLページはUTF-8であり、アラビア文字などを使用できます。ただし、Unicode 127を超える文字は、úのような数値エンティティとしてエンコードされます。 UTF-8は完全に正しいので、Accept-Encodingは役に立ちません。

エンティティを自分でデコードする必要があります。何かのようなもの：

String decodeNumericEntities(String s) { StringBuffer sb = new StringBuffer(); Matcher m = Pattern.compile("\&#(\d+);").matcher(s); while (m.find()) { int uc = Integer.parseInt(m.group(1)); m.appendReplacement(sb, ""); sb.appendCodepoint(uc); } m.appendTail(sb); return sb.toString(); }

ちなみに、これらのエンティティは、処理されたHTMLフォームに由来する可能性があるため、Webアプリの編集側にあります。

問題のコードの後：

DataInputStreamをテキスト用の（Buffered）Readerに置き換えました。 InputStreamsは、バイナリデータ、バイトを読み取ります。読者のテキスト、文字列。 InputStreamReaderは、パラメーターとしてInputStreamとエンコーディングを持ち、Readerを返します。

try { BufferedReader input = new BufferedReader( new InputStreamReader(urlConn.getInputStream(), "UTF-8")); StringBuilder strB = new StringBuilder(); String str; while (null != (str = input.readLine())) { strB.append(str).append("
"); } input.close(); } catch (IOException e) { e.printStackTrace(); }

limlim · Answer

URLConnectionにユーザーエージェントも追加してみてください。

urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");

これは私のデコードの問題を魅力のように解決しました。

Hoons · Answer

問題は、ストリームから読んでいるときだと思います。 readUTFを呼び出す代わりにDataInputStreamでreadLineメソッドを呼び出すか、InputStreamReaderを作成して設定する必要があります。エンコードすると、BufferedReaderから1行ずつ読み取ることができます（これは既存のtry/catch内にあります）：

Charset charset = Charset.forName("UTF8"); InputStreamReader stream = new InputStreamReader(urlConn.getInputStream(), charset); BufferedReader reader = new BufferedReader(stream); StringBuffer responseBuffer = new StringBuffer(); String read = ""; while ((read = reader.readLine()) != null) { responseBuffer.append(read); }