PHP DOMDocument loadHTML not encoding UTF-8 correctly

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly. If your string doesn’t contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8: $profile=”<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>”; $dom = new DOMDocument(); $dom->loadHTML(‘<?xml … Read more

u’\ufeff’ in Python string

I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding. Without it, the BOM is included in the read result: >>> f = open(‘file’, mode=”r”) >>> f.read() ‘\ufefftest’ Giving the correct encoding, the BOM is omitted in … Read more

UTF-8 byte[] to String

Look at the constructor for String String str = new String(bytes, StandardCharsets.UTF_8); And if you’re feeling lazy, you can use the Apache Commons IO library to convert the InputStream to a String directly: String str = IOUtils.toString(inputStream, StandardCharsets.UTF_8);

UTF-8: General? Bin? Unicode?

In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct. Here is the difference: For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is … Read more

“Incorrect string value” when trying to insert UTF-8 into MySQL via JDBC?

MySQL’s utf8 permits only the Unicode characters that can be represented with 3 bytes in UTF-8. Here you have a character that needs 4 bytes: \xF0\x90\x8D\x83 (U+10343 GOTHIC LETTER SAUIL). If you have MySQL 5.5 or later you can change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy … Read more

error UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0). Since … Read more

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)