Encoding of Chinese Characters

In computer text applications, the GB encoding scheme most often renders simplified Chinese characters, while Big5 most often renders traditional characters. Although neither encoding has an explicit connection with a specific character set, the lack of a one-to-one mapping between the simplified and traditional sets established a de facto linkage.

Since simplified Chinese conflated many characters into one and since the initial version of the GB encoding scheme, known as GB2312-80, contained only one code point for each character, it is impossible to use GB2312 to map to the bigger set of traditional characters. It is theoretically possible to use Big5 code to map to the smaller set of simplified character glyphs, although there is little market for such a product. Newer and alternative forms of GB have support for traditional characters. In particular, mainland authorities have now established GB 18030 as the official encoding standard for use in all mainland software publications. The encoding contains all East Asian characters included in Unicode 3.0. As such, GB 18030 encoding contains both simplified and traditional characters found in Big-5 and GB, as well as all characters found in Japanese and Korean encodings.

Unicode deals with the issue of simplified and traditional characters as part of the project of Han unification by including code points for each. This was rendered necessary by the fact that the linkage between simplified characters and traditional characters is not one-to-one. While this means that a Unicode system can display both simplified and traditional characters, it also means that different localization files are needed for each type.

The Chinese characters used in modern Japanese have also undergone simplification, but generally to a lesser extent than with simplified Chinese, it's worth mentioning that Japanese writing system reduced the number of Chinese characters in daily use, which was also part of the Japanese language reforms, thus, a number of complex characters were written phonetically. Reconciling these different character sets in Unicode became part of the controversial process of Han unification. Not surprisingly, some of the Chinese characters used in Japan are neither 'traditional' nor 'simplified'. In this case, these characters cannot be found in traditional/simplified Chinese dictionaries.

As a conclusion, GB18030 is the better choice for both Simplified Chinese and Traditional Chinese. Accordingly, in Google's language settings, zh-Hans/lang_zh_Hans is for Simplified Chinese and zh-Hant/lang_zh_Hant is for Traditional Chinese.

http://en.wikipedia.org/wiki/Simplified_Chinese_characters
http://www.google.com/coop/docs/cse/resultsxml.html#chineseSearch

No comments:

Post a Comment

Labels