liuxuhelloworld's notebook

Before Unicode, there were many different character encoding schemes. This caused two problems. First, a particular code value corresponds to different letters in the different encoding schemes. Second, the encodings for languages with large character sets have variable length: some common characters are encoded as single byte, others require two or more bytes. Unicode was invented to overcome the limitations of traditional character encoding schemes.

A code point is a code value that is associated with a character in an encoding scheme. In the Unicode standard, code points are written in hexadecimal and prefixed with U+, such as U+0041 for the code point of the Latin letter A.

Unicode has code points that are grouped into 17 code planes. The first code plane, called the basic multilingual plane, consists of the Unicode characters with code points U+0000 to U+FFFF. Sixteen additional planes, with code points U+10000 to U+10FFFF, hold the supplementary characters.

The UTF-16 encoding represents all Unicode code points in a variable-length code. The characters in the basic multilingual plane are represented as 16-bit values, called code units. The supplementary characters are encoded as consecutive pairs of code units. Each of the values in such an encoding pair falls into a range of 2048 unused values of the basic multilingual plane, called the surrogate area (U+D800 to U+DBFF for the first code unit, U+DC00 to U+DFFF for the second code unit).