To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.
A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that’s way more than 216.
Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.
When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.
Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.
Java uses UTF-16 internally to represent text.
In particular, a char
(character) is an unsigned two-byte value that contains a UTF-16 value.
If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2