Java – what are characters, code points and surrogates? What difference is there between them?

Question

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that’s way more than 2¹⁶.

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.

Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.

Java uses UTF-16 internally to represent text.

In particular, a char (character) is an unsigned two-byte value that contains a UTF-16 value.

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

Leave a Comment Cancel reply