Facebook JSON badly encoded

Question

I can indeed confirm that the Facebook download data is incorrectly encoded; a Mojibake. The original data is UTF-8 encoded but was decoded as Latin-1 instead. I’ll make sure to file a bug report.

What this means is that any non-ASCII character in the string data was encoded twice. First to UTF-8, and then the UTF-8 bytes were encoded again by interpreting them as Latin-1 encoded data (which maps exactly 256 characters to the 256 possible byte values), by using the \uHHHH JSON escape notation (so a literal backslash, a literal lowercase letter u, followed by 4 hex digits, 0-9 and a-f). Because the second step encoded byte values in the range 0-255, this resulted in a series of \u00HH sequences (a literal backslash, a literal lower case letter u, two 0 zero digits and two hex digits).

E.g. the Unicode character U+0142 LATIN SMALL LETTER L WITH STROKE in the name Radosław was encoded to the UTF-8 byte values C5 and 82 (in hex notation), and then encoded again to \u00c5\u0082.

You can repair the damage in two ways:

Decode the data as JSON, then re-encode any string values as Latin-1 binary data, and then decode again as UTF-8:
```
 >>> import json
 >>> data = r'"Rados\u00c5\u0082aw"'
 >>> json.loads(data).encode('latin1').decode('utf8')
 'Radosław'
```
This would require a full traversal of your data structure to find all those strings, of course.
Load the whole JSON document as binary data, replace all \u00hh JSON sequences with the byte the last two hex digits represent, then decode as JSON:
```
 import re
 from functools import partial

 fix_mojibake_escapes = partial(
     re.compile(rb'\\u00([\da-f]{2})').sub,
     lambda m: bytes.fromhex(m[1].decode()),
 )

 with open(os.path.join(subdir, file), 'rb') as binary_data:
     repaired = fix_mojibake_escapes(binary_data.read())
 data = json.loads(repaired)
```
(If you are using Python 3.5 or older, you’ll have to decode the repaired bytes object from UTF-8, so use json.loads(repaired.decode())).

From your sample data this produces:
```
 {'content': 'No to trzeba ostatnie treningi zrobić xD',
  'sender_name': 'Radosław',
  'timestamp': 1524558089,
  'type': 'Generic'}
```
The regular expression matches against all \u00HH sequences in the binary data and replaces those with the bytes they represent, so that the data can be decoded correctly as UTF-8. The second decoding is taken care of by the json.loads() function when given binary data.

Leave a Comment Cancel reply