Donnerstag, 1. März 2007

Think with ASCII-ANSI, UTF-8 or Unicode?

This post is aimed to equip programmers with "buttom line" of basic character encodings, namely ANSI, unicode and UTF-8, etc. It is based mainly on the article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky and some other resources.

As a chinese-Windows-user living in Germany who have to switch between English, German and Chinese (sometimes more aroused by a little Japanese or French) everyday, I have long been confused with the encoding problems. Though it seems difficult for all of us at the first sight, it is worth you 10 or 15 minutes tee-time for the reading, because all the software users who use a different language from yours will appreciate that instead of being frustrated by ?? ??? ????s in your software's interface ( which I have suffered several times, even from Hotmail system...)

(1) ASCII : Enough for pure English speakers
It was introduced based on plain English letters, with neccessary extensions(including control characters, some accented letters or drawing symbols) to a capacity of 255 (2^7-1 right?) . The "empty seats" except for control characters and basic letters were at first fulfiled by different characters, which made the interchange of information between different systems very difficult. Hence ANSI introduce a standard into ASCII character set, the result is
a. Different languages can use different ASCII Pages, which are all the same below 128;
b. Different pages have different ASCII characters above 128.

ASCII is still widely used up to today, not only in computers but all widely in many electros. It's not so difficult for western programmers and users to pick up an approriate page so as to deliver the information. However, it is almost a waste when comes to the Asia facing something like 这家中国餐馆真贵!(Things in this chinese restaurant are too dear!). Take chinese as example, the most often used "letters" (or characters) are more than six thousand, besides about 100 thousand not so often used.

(2) Unicode: Suitable for almost every user

Unicode was designed to create a SINGLE CHARACTER SET that can hold every possible writing system around the earth. Every letter(including chinese or japanese characters) have a number assiged by the Unicode consortium in the pattern of U+0645 (ASCII-ANSI:A), which is called a code-point. U+ means "unicode" and the numbers are hexadecimal(十六进制 in Chinese). If you have special interest in any letter within your own languange or anywhere else, you can crack its unicode by vising the Unicode web site.

There're still more than one versions of Unicode, in means of the bits every character consumes. Apparently, the more bits a character need, the more characters can be expressed in this system. Widely used are UCS-2(using 2 bytes for every character), Unicode-16(every character takes 16bits, able to present 65536 characters), Unicode-32 (every character takes 32bits, capacity up to 4294967296 characters) , UTF-7 and UTF-8 (we will talk about it later).

For example, in Unicode-16, my name "David" can be expressed in code-points
+U0044+U0061+U0076+U0069+U0064
Naturally you shouldn't see these code-points when you try to open a email. So how are they stored in memory and then taken out for representing?

In the convention of two-bytes for each code-point-number, string "David" could be stored as
00 44 00 61 00 76 00 69 00 64
or
44 00 61 00 76 00 69 00 64 00
depending how an implementor want to store their unicode code points in high-endian or low-endian mode ( about endian you can refer to wiki, which is a somehow interesting story). Hence two ways to store unicode were introduced with the difference of a "Univode Byte Oder Mark", or a special two-byte mark to identify the order of the following string. The two marks are FE FF for high-low and FF FE for low-high respectively. This mark tells a programm that the following string is stored in which order.

Then how is about Unicode-8? How does it manage to store the string of unicode code points in memory using only 8 bit bytes? The trick is that in UTF-8 every code point from 0-127 is stored in a single byte, only code points 128 and above are stored using 2-6 bytes.

The result of such "sparing"? "David" will be stored as "44 61 76 69 64" ( all leading 00 are missed), exactly the same as ASCII!

(3) How about something like "Windows-1252" or "ISO-8859-1" and many others?

They are all traditional encodings which can only store some code points and present all other code points into question marks (if you have every tried to install a software of its chinese version into a computer with German-Windows-System, you will probably be confused by them). ISO 8859-1 is also called Latin 1, able to present basic latin letters. Windows 1252 provide support for western european encodings.

In another word, UTF 7,8,Unicode-16,32 are all-round soldiers while these traditional encodings are only specialized in some characters.

(4) Therefore it is not hard to understand why we programmers should ask "which encoding it uses" first when we are confronted of a string, even before we have to know what tokens does the very string have. Because without the encoding information we even can't "read" the string, just in the same way that we can't hear the music before we insert CD into player and let the player decide which format it uses.

Take an example of Email. You are expected to have a string in the header which specifies which encoding the mail uses in the form of
Content-Type: text/plain; charset="UTF-8"
or in a html file
"
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">"
( here we benefit from the fact that almost every encoding in common use does the same thing with characters between 32 and 127, so we don't have to specify the encoding this meta-infomation uses...)

Introduced above is only basic points of encoding systems. It does not mean to tell you as well as me everything about encoding staffs , however, it is at least a good habit to take care for all the potent software users or webpage visitors which will probably get a headach when she or he sees nothing but annoying "???"s.


(may subject to further addition and revise)