Why does Java char use UTF-16?

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical: Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that … Read more

Slice a string containing Unicode chars

Possible solutions to codepoint slicing I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way? If you know the exact byte indices, you can slice a string: let text = “Hello привет”; println!(“{}”, &text[2..10]); This prints “llo пр”. So the problem is to … Read more

Python and BeautifulSoup encoding issues [duplicate]

In your case this page has wrong utf-8 data which confuses BeautifulSoup and makes it think that your page uses windows-1252, you can do this trick: soup = BeautifulSoup.BeautifulSoup(content.decode(‘utf-8′,’ignore’)) by doing this you will discard any wrong symbols from the page source and BeautifulSoup will guess the encoding correctly. You can replace ‘ignore’ by ‘replace’ … Read more

UTF-8 Continuation bytes

A continuation byte in UTF-8 is any byte where the top two bits are 10. They are the subsequent bytes in multi-byte sequences. The following table may help: Unicode code points Encoding Binary value ——————- ——– ———— U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy … Read more

Remove diacritics using Go

You can use the libraries described in Text normalization in Go. Here’s an application of those libraries: // Example derived from: http://blog.golang.org/normalization package main import ( “fmt” “unicode” “golang.org/x/text/transform” “golang.org/x/text/unicode/norm” ) func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks } func main() { t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC) result, _, _ … Read more