UTF-8 Continuation bytes

A continuation byte in UTF-8 is any byte where the top two bits are 10. They are the subsequent bytes in multi-byte sequences. The following table may help: Unicode code points Encoding Binary value ——————- ——– ———— U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy … Read more

Remove diacritics using Go

You can use the libraries described in Text normalization in Go. Here’s an application of those libraries: // Example derived from: http://blog.golang.org/normalization package main import ( “fmt” “unicode” “golang.org/x/text/transform” “golang.org/x/text/unicode/norm” ) func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks } func main() { t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC) result, _, _ … Read more

Character code of unknown character-character, e.g. square or question mark romb

Unicode has two symbols for unknown characters: β–‘ (WHITE SQUARE, U+25A1) – Replaces a missing or unsupported Unicode character. οΏ½ (REPLACEMENT CHARACTER, U+FFFD) – Replaces an invalid or unrecognizable character. Indicates a Unicode error. Sources Quora – What symbol is the square box shown for non-representable Unicode characters? FileFormat.Info – Unicode Character ‘WHITE SQUARE’ (U+25A1) … Read more

Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

Computers are generally much better at dealing with data on 4 byte boundaries. The benefits in terms of reduced memory consumption are relatively small compared with the pain of working on 3-byte boundaries. (I speculate there was also a reluctance to have a limit that was “only what we can currently imagine being useful” when … Read more

Unicode mirror character?

We’ve talked about attacks using the RLO (U+202E RIGHT TO LEFT OVERRIDE) character in the past, which shifts the ‘visual’ display of a string from the position it’s placed inside that string. So for example: document[U+202E]fdp.exe visually looks like documentexe.pdf I talked about these and other attacks of this sort here http://www.casaba.com/products/UCAPI/. In fact we’re … Read more