utf-8 – Page 36 – Tarik Billa

How to remove all non printable characters in a string?

October 18, 2022 by Tarik

7 bit ASCII? If your Tardis just landed in 1963, and you just want the 7 bit printable ASCII chars, you can rip out everything from 0-31 and 127-255 with this: $string = preg_replace(‘/[\x00-\x1F\x7F-\xFF]/’, ”, $string); It matches anything in range 0-31, 127-255 and removes it. 8 bit extended ASCII? You fell into a Hot … Read more

Using Javascript’s atob to decode base64 doesn’t properly decode utf-8 strings

October 18, 2022 by Tarik

The Unicode Problem Though JavaScript (ECMAScript) has matured, the fragility of Base64, ASCII, and Unicode encoding has caused a lot of headache (much of it is in this question’s history). Consider the following example: const ok = “a”; console.log(ok.codePointAt(0).toString(16)); // 61: occupies < 1 byte const notOK = “✓” console.log(notOK.codePointAt(0).toString(16)); // 2713: occupies > 1 … Read more

How do I determine file encoding in OS X?

October 16, 2022 by Tarik

Using the -I (that’s a capital i) option on the file command seems to show the file encoding. file -I {filename}

Encode String to UTF-8

October 14, 2022 by Tarik

How about using ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(myString)

How can I output a UTF-8 CSV in PHP that Excel will read properly?

October 13, 2022 by Tarik

I have the same (or similar) problem. In my case, if I add a BOM to the output, it works: header(‘Content-Encoding: UTF-8’); header(‘Content-type: text/csv; charset=UTF-8’); header(‘Content-Disposition: attachment; filename=Customers_Export.csv’); echo “\xEF\xBB\xBF”; // UTF-8 BOM I believe this is a pretty ugly hack, but it worked for me, at least for Excel 2007 Windows. Not sure it’ll … Read more

How to convert a string to utf-8 in Python

October 10, 2022 by Tarik

In Python 2 >>> plain_string = “Hi!” >>> unicode_string = u”Hi!” >>> type(plain_string), type(unicode_string) (<type ‘str’>, <type ‘unicode’>) ^ This is the difference between a byte string (plain_string) and a unicode string. >>> s = “Hello!” >>> u = unicode(s, “utf-8”) ^ Converting to unicode and specifying the encoding. In Python 3 All strings are … Read more

Do I really need to encode ‘&’ as ‘&’?

October 10, 2022 by Tarik

Yes. Just as the error said, in HTML, attributes are #PCDATA meaning they’re parsed. This means you can use character entities in the attributes. Using & by itself is wrong and if not for lenient browsers and the fact that this is HTML not XHTML, would break the parsing. Just escape it as & and … Read more

Write to UTF-8 file in Python

October 10, 2022 by Tarik

I believe the problem is that codecs.BOM_UTF8 is a byte string, not a Unicode string. I suspect the file handler is trying to guess what you really mean based on “I’m meant to be writing Unicode as UTF-8-encoded text, but you’ve given me a byte string!” Try writing the Unicode string for the byte order … Read more

What’s the difference between Unicode and UTF-8? [duplicate]

October 10, 2022 by Tarik

As Rasmus states in his article “The difference between UTF-8 and Unicode?”: If asked the question, “What is the difference between UTF-8 and Unicode?”, would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand … Read more

HTML encoding issues – “Â” character showing up instead of ” “

October 9, 2022 by Tarik

Somewhere in that mess, the non-breaking spaces from the HTML template (the s) are encoding as ISO-8859-1 so that they show up incorrectly as an “Â” character That’d be encoding to UTF-8 then, not ISO-8859-1. The non-breaking space character is byte 0xA0 in ISO-8859-1; when encoded to UTF-8 it’d be 0xC2,0xA0, which, if you (incorrectly) … Read more