utf-8 – Page 37 – Tarik Billa

PHP DOMDocument loadHTML not encoding UTF-8 correctly

October 8, 2022 by Tarik

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly. If your string doesn’t contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8: $profile=”<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>”; $dom = new DOMDocument(); $dom->loadHTML(‘<?xml … Read more

u’\ufeff’ in Python string

October 4, 2022 by Tarik

I ran into this on Python 3 and found this question (and solution). When opening a file, Python 3 supports the encoding keyword to automatically handle the encoding. Without it, the BOM is included in the read result: >>> f = open(‘file’, mode=”r”) >>> f.read() ‘\ufefftest’ Giving the correct encoding, the BOM is omitted in … Read more

UTF-8 byte[] to String

October 4, 2022 by Tarik

Look at the constructor for String String str = new String(bytes, StandardCharsets.UTF_8); And if you’re feeling lazy, you can use the Apache Commons IO library to convert the InputStream to a String directly: String str = IOUtils.toString(inputStream, StandardCharsets.UTF_8);

How to use UTF-8 in resource properties with ResourceBundle

October 1, 2022 by Tarik

UTF-8: General? Bin? Unicode?

October 1, 2022 by Tarik

In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct. Here is the difference: For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is … Read more

How do I check if a string is unicode or ascii?

September 30, 2022 by Tarik

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes. In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this: def whatisthis(s): if isinstance(s, str): print “ordinary string” elif isinstance(s, unicode): print “unicode … Read more

“Incorrect string value” when trying to insert UTF-8 into MySQL via JDBC?

September 29, 2022 by Tarik

MySQL’s utf8 permits only the Unicode characters that can be represented with 3 bytes in UTF-8. Here you have a character that needs 4 bytes: \xF0\x90\x8D\x83 (U+10343 GOTHIC LETTER SAUIL). If you have MySQL 5.5 or later you can change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy … Read more

error UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 0: invalid start byte

September 28, 2022 by Tarik

Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0). Since … Read more

Detect encoding and make everything UTF-8

September 27, 2022 by Tarik

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output. I made a function that addresses all this issues. ItÂ´s called Encoding::toUTF8(). You don’t need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix … Read more

Using PowerShell to write a file in UTF-8 without the BOM

September 27, 2022 by Tarik

Using .NET’s UTF8Encoding class and passing $False to the constructor seems to work: $MyRawString = Get-Content -Raw $MyPath $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False [System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)