What is the difference between Unicode code points and Unicode scalars?

Question

First let’s look at definitions D9, D10 and D10a, Section 3.4, Characters and Encoding:

D9 Unicode codespace:
A range of integers from 0 to 10FFFF₁₆.

D10 Code point:
Any value in the Unicode codespace.

• A code point is also known as a
code position.

…

D10a Code point type:
Any of the seven fundamental classes of code points in the standard:
Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

[emphasis added]

Okay, so code points are integers in a certain range. They are divided into categories called “code point types”.

Now let’s look at definition D76, Section 3.9, Unicode Encoding Forms:

D76 Unicode scalar value:
Any Unicode code point except high-surrogate and low-surrogate code points.

• As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF₁₆ and E000₁₆
to 10FFFF₁₆, inclusive.

Surrogates are defined and explained in Section 3.8, just before D76. The gist is that surrogates are divided into two categories high-surrogates and low-surrogates. These are used only by UTF-16 so that it can represent all scalar values. (There are 1,112,064 scalars but 2¹⁶ = 65536 is much less than that.)
UTF-8 doesn’t have this problem; it is a variable length encoding scheme (code points can be 1-4 bytes long), so it can accommodate encode all scalars without using surrogates.

Summary: a code point is either a scalar or a surrogate. A code point is merely a number in the most abstract sense; how that number is encoded into binary form is a separate issue. UTF-16 uses surrogate pairs because it can’t directly represent all possible scalars. UTF-8 doesn’t use surrogate pairs.

In the future, you might find consulting the Unicode glossary helpful. It contains many of the frequently used definitions, as well as links to the definitions in the Unicode specification.

Leave a Comment Cancel reply