Unicode is a standard that defines numeric codes for glyphs used in written communication. Or, as they say it themselves:
The standard for digital
representation of the characters used
in writing all of the world’s
languages. Unicode provides a uniform
means for storing, searching, and
interchanging text in any language. It
is used by all modern computers and is
the foundation for processing text on
the Internet. Unicode is developed and
maintained by the Unicode Consortium.
There are many common, yet easily avoided, programming errors committed by developers who don’t bother to educate themselves about Unicode and its encodings.
- First, go to the source for
authoritative, detailed information
and implementation guidelines. - As mentioned by others, Joel Spolsky
has a good list of these
errors. - I also like Elliotte Rusty Harold’s
Ten Commandments of Unicode. - Developers should also watch out for
canonical representation attacks.
Some of the key concepts you should be aware of are:
- Glyphs—concrete graphics used to represent written characters.
- Composition—combining glyphs to create another glyph.
- Encoding—converting Unicode points to a stream of bytes.
- Collation—locale-sensitive comparison of Unicode strings.