utf-8 – Page 2 – Tarik Billa

UTF-8 Continuation bytes

April 3, 2024 by Tarik

A continuation byte in UTF-8 is any byte where the top two bits are 10. They are the subsequent bytes in multi-byte sequences. The following table may help: Unicode code points Encoding Binary value ——————- ——– ———— U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy … Read more

Remove diacritics using Go

March 11, 2024 by Tarik

You can use the libraries described in Text normalization in Go. Here’s an application of those libraries: // Example derived from: http://blog.golang.org/normalization package main import ( “fmt” “unicode” “golang.org/x/text/transform” “golang.org/x/text/unicode/norm” ) func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks } func main() { t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC) result, _, _ … Read more

Write UTF-8 files from R

January 9, 2024 by Tarik

Convert git repository file encoding

January 9, 2024 by Tarik

You can do this with git filter-branch. The idea is that you have to change the encoding of the files in every commit, rewriting each commit as you go. First, write a script that changes the encoding of every file in the repository. It could look like this: #!/bin/sh find . -type f -print | … Read more

Utf8_general_ci or utf8mb4 or…?

January 7, 2024 by Tarik

MySQL’s utf32 and utf8mb4 (as well as standard UTF-8) can directly store any character specified by Unicode; the former is fixed size at 4 bytes per character whereas the latter is between 1 and 4 bytes per character. utf8mb3 and the original utf8 can only store the first 65,536 codepoints, which will cover CJVK (Chinese, … Read more

Fetching UTF-8 text from MySQL in R returns “????”

January 7, 2024 by Tarik

How to get UTF-8 in Node.js?

January 6, 2024 by Tarik

Hook into you response generator or create a middleware that does the following: res.setHeader(“Content-Type”, “application/json; charset=utf-8”); Otherwise the browser displays the content in it’s favorite encoding. If this doesn’t help you DB is probably in the wrong encoding. For older node.js versions use: res.header(“Content-Type”, “application/json; charset=utf-8”);

How to get ncurses to output astral plane unicode characters

January 3, 2024 by Tarik

It’s not exactly that ncurses is broken. More like, glibc is broken. Or whatever implementation of libc you are using; I’m just assuming that it is glibc. Unlike simple console output (i.e., printf), ncurses needs to know how wide every character is when it is printed because it needs to maintain its own model of … Read more

PDO::exec() or PDO::query()?

January 3, 2024 by Tarik

Django dumpdata UTF-8 (Unicode)

January 3, 2024 by Tarik

After struggling with similar issues, I’ve just found, that xml formatter handles UTF8 properly. manage.py dumpdata –format=xml > output.xml I had to transfer data from Django 0.96 to Django 1.3. After numerous tries with dump/load data, I’ve finally succeeded using xml. No side effects for now. Hope this will help someone, as I’ve landed at … Read more