Facebook JSONのエンコードが不適切です

Question

Facebookのメッセンジャーデータをダウンロードしました（Facebookアカウントでsettings、に移動してからFacebookの情報に移動し、次に情報のダウンロードに移動して、作成します少なくともMessagesチェックボックスがオンになっているファイル）

ただし、エンコードには小さな問題があります。よくわかりませんが、Facebookはこのデータに不適切なエンコードを使用しているようです。テキストエディターで開くと、次のようなものが表示されます：Rados\u00c5\u0082aw。 python（UTF-8）でそれを開こうとすると、RadosÅ\x82aw。しかし、私は取得する必要があります：Radosław。

My pythonスクリプト：

text = open(os.path.join(subdir, file), encoding='utf-8') conversations.append(json.load(text))

最も一般的なエンコーディングをいくつか試しました。サンプルデータは次のとおりです。

{ "sender_name": "Rados\u00c5\u0082aw", "timestamp": 1524558089, "content": "No to trzeba ostatnie treningi zrobi\u00c4\u0087 xD", "type": "Generic" }

Martijn Pieters · Accepted Answer

Facebookのダウンロードデータが正しくエンコードされていないことは確かに確認できます。 Mojibake 。元のデータはUTF-8でエンコードされていますが、代わりにLatin -1としてデコードされています。必ずバグレポートを提出してください。

それまでの間、次の2つの方法で損傷を修復できます。

データをJSONとしてデコードしてから、すべての文字列をLatin-1として再エンコードし、再度UTF-8としてデコードします。
```
>>> import json >>> data = r'"Rados\u00c5\u0082aw"' >>> json.loads(data).encode('latin1').decode('utf8') 'Radosław' 
```

データをバイナリとしてロードし、すべての\u00hh最後の2桁の16進数が表すバイトを含むシーケンス、UTF-8としてデコードしてからJSONとしてデコードします。

import re from functools import partial fix_mojibake_escapes = partial( re.compile(rb'\u00([\da-f]{2})').sub, lambda m: bytes.fromhex(m.group(1).decode())) with open(os.path.join(subdir, file), 'rb') as binary_data: repaired = fix_mojibake_escapes(binary_data.read()) data = json.loads(repaired.decode('utf8'))

サンプルデータから、これにより以下が生成されます。

{'content': 'No to trzeba ostatnie treningi zrobić xD', 'sender_name': 'Radosław', 'timestamp': 1524558089, 'type': 'Generic'}

Geekmoss · Answer

オブジェクトを解析するための私のソリューションは parse_hookロード/ロードのコールバック関数：

import json def parse_obj(dct): for key in dct: dct[key] = dct[key].encode('latin_1').decode('utf-8') pass return dct data = '{"msg": "Ahoj sv\u00c4\u009bte"}' # String json.loads(data) # Out: {'msg': 'Ahoj svÄ\x9bte'} json.loads(data, object_hook=parse_obj) # Out: {'msg': 'Ahoj světe'} # File with open('/path/to/file.json') as f: json.load(f, object_hook=parse_obj) # Out: {'msg': 'Ahoj světe'} pass

更新：

文字列を含むリストを解析するためのソリューションが機能しません。更新されたソリューションは次のとおりです。

import json def parse_obj(obj): for key in obj: if isinstance(obj[key], str): obj[key] = obj[key].encode('latin_1').decode('utf-8') Elif isinstance(obj[key], list): obj[key] = list(map(lambda x: x if type(x) != str else x.encode('latin_1').decode('utf-8'), obj[key])) pass return obj

Ondrej Sotolar · Answer

@Martijn Pietersのソリューションに基づいて、Javaでも似たようなものを書きました。

public String getMessengerJson(Path path) throws IOException { String badlyEncoded = Files.readString(path, StandardCharsets.UTF_8); String unescaped = unescapeMessenger(badlyEncoded); byte[] bytes = unescaped.getBytes(StandardCharsets.ISO_8859_1); String fixed = new String(bytes, StandardCharsets.UTF_8); return fixed; }

Unescapeメソッドはorg.Apache.commons.lang.StringEscapeUtilsに触発されています。

private String unescapeMessenger(String str) { if (str == null) { return null; } try { StringWriter writer = new StringWriter(str.length()); unescapeMessenger(writer, str); return writer.toString(); } catch (IOException ioe) { // this should never ever happen while writing to a StringWriter throw new UnhandledException(ioe); } } private void unescapeMessenger(Writer out, String str) throws IOException { if (out == null) { throw new IllegalArgumentException("The Writer must not be null"); } if (str == null) { return; } int sz = str.length(); StrBuilder unicode = new StrBuilder(4); boolean hadSlash = false; boolean inUnicode = false; for (int i = 0; i < sz; i++) { char ch = str.charAt(i); if (inUnicode) { unicode.append(ch); if (unicode.length() == 4) { // unicode now contains the four hex digits // which represents our unicode character try { int value = Integer.parseInt(unicode.toString(), 16); out.write((char) value); unicode.setLength(0); inUnicode = false; hadSlash = false; } catch (NumberFormatException nfe) { throw new NestableRuntimeException("Unable to parse unicode value: " + unicode, nfe); } } continue; } if (hadSlash) { hadSlash = false; if (ch == 'u') { inUnicode = true; } else { out.write("\"); out.write(ch); } continue; } else if (ch == '\') { hadSlash = true; continue; } out.write(ch); } if (hadSlash) { // then we're in the weird case of a \ at the end of the // string, let's output it anyway. out.write('\'); } }