unicode – Tarik Billa

Pandas – Writing an excel file containing unicode – IllegalCharacterError

April 12, 2024 by Tarik

The same problem happened to me. I solved it as follows: First, install python package xlsxwriter: pip install xlsxwriter Second, replace the default engine ‘openpyxl’ with ‘xlsxwriter’: df.to_excel(“test.xlsx”, engine=”xlsxwriter”)

How can I properly display German characters in HTML?

April 11, 2024 by Tarik

It seems you need some basic explanations about something that unfortunately even most programmers don’t understand properly. Files like your HTML page are saved and transmitted over the Internet as a sequence of bytes, but you want them displayed as characters. In order to translate bytes into characters, you need a set of rules called … Read more

Why does Java char use UTF-16?

April 10, 2024 by Tarik

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical: Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that … Read more

Slice a string containing Unicode chars

April 9, 2024 by Tarik

Possible solutions to codepoint slicing I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way? If you know the exact byte indices, you can slice a string: let text = “Hello привет”; println!(“{}”, &text[2..10]); This prints “llo пр”. So the problem is to … Read more

Is there any reason to prefer UTF-16 over UTF-8?

April 9, 2024 by Tarik

East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required). Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there’s a lot of markup) it’s much of a … Read more

Python and BeautifulSoup encoding issues [duplicate]

April 9, 2024 by Tarik

In your case this page has wrong utf-8 data which confuses BeautifulSoup and makes it think that your page uses windows-1252, you can do this trick: soup = BeautifulSoup.BeautifulSoup(content.decode(‘utf-8′,’ignore’)) by doing this you will discard any wrong symbols from the page source and BeautifulSoup will guess the encoding correctly. You can replace ‘ignore’ by ‘replace’ … Read more

Remove zero width space unicode character from Python string

April 8, 2024 by Tarik

You can encode it into ascii and ignore errors: u’\u200cHealth & Fitness’.encode(‘ascii’, ‘ignore’) Output: ‘Health & Fitness’

UTF-8 Continuation bytes

April 3, 2024 by Tarik

A continuation byte in UTF-8 is any byte where the top two bits are 10. They are the subsequent bytes in multi-byte sequences. The following table may help: Unicode code points Encoding Binary value ——————- ——– ———— U+000000-U+00007f 0xxxxxxx 0xxxxxxx U+000080-U+0007ff 110yyyxx 00000yyy xxxxxxxx 10xxxxxx U+000800-U+00ffff 1110yyyy yyyyyyyy xxxxxxxx 10yyyyxx 10xxxxxx U+010000-U+10ffff 11110zzz 000zzzzz yyyyyyyy … Read more

Remove diacritics using Go

March 11, 2024 by Tarik

You can use the libraries described in Text normalization in Go. Here’s an application of those libraries: // Example derived from: http://blog.golang.org/normalization package main import ( “fmt” “unicode” “golang.org/x/text/transform” “golang.org/x/text/unicode/norm” ) func isMn(r rune) bool { return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks } func main() { t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC) result, _, _ … Read more

Write UTF-8 files from R

January 9, 2024 by Tarik