utf-16 – Tarik Billa

Why does Java char use UTF-16?

April 10, 2024 by Tarik

Java used UCS-2 before transitioning over UTF-16 in 2004/2005. The reason for the original choice of UCS-2 is mainly historical: Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that … Read more

Is there any reason to prefer UTF-16 over UTF-8?

April 9, 2024 by Tarik

East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required). Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there’s a lot of markup) it’s much of a … Read more

Utf8_general_ci or utf8mb4 or…?

January 7, 2024 by Tarik

MySQL’s utf32 and utf8mb4 (as well as standard UTF-8) can directly store any character specified by Unicode; the former is fixed size at 4 bytes per character whereas the latter is between 1 and 4 bytes per character. utf8mb3 and the original utf8 can only store the first 65,536 codepoints, which will cover CJVK (Chinese, … Read more

UTF8 vs. UTF16 vs. char* vs. what? Someone explain this mess to me!

November 30, 2023 by Tarik

Check out Joel Spolsky’s The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) EDIT 20140523: Also, watch Characters, Symbols and the Unicode Miracle by Tom Scott on YouTube – it’s just under ten minutes, and a wonderful explanation of the brilliant ‘hack’ that is UTF-8

Convert UTF-16 to UTF-8 under Windows and Linux, in C

November 29, 2023 by Tarik

Change encoding to UTF-8 with PowerShell: Get-Content PATH\temp.txt -Encoding Unicode | Set-Content -Encoding UTF8 PATH2\temp.txt

What is the Unicode U+001A Character? Aka 0x1A

September 20, 2023 by Tarik

U+001A is defined in the Unicode Standard as a control character with the name SUBSTITUTE, and it belongs to a group characterized as follows, in chapter 16 of the standard: “There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022 … Read more

Enter Unicode characters with 8-digit hex code

August 30, 2023 by Tarik

You can use <C-v>U, that is, an uppercase u, to input an 8 digit hex codepoint character. More information here and here.

Confusing sizeof(char) by ISO/IEC in different character set encoding like UTF-16

August 28, 2023 by Tarik

The C++ standard (and C, for that matter) effectively define byte as the size of a char type, not as an eight-bit quantity1. As per C++11 1.7/1 (my bold): The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic … Read more

JavaScript strings outside of the BMP

August 2, 2023 by Tarik

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can. But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, … Read more

Why does the Java char primitive take up 2 bytes of memory?

July 18, 2023 by Tarik

When Java was originally designed, it was anticipated that any Unicode character would fit in 2 bytes (16 bits), so char and Character were designed accordingly. In fact, a Unicode character can now require up to 4 bytes. Thus, UTF-16, the internal Java encoding, requires supplementary characters use 2 code units. Characters in the Basic … Read more