Jump to content

Chinese character encoding

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Artoria2e5 (talk | contribs) at 04:11, 4 December 2016. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computing, Chinese character encodings can be used to represent text written in the CJK languages — Chinese, Japanese, Korean — and (rarely) obsolete Vietnamese, all of which use Chinese characters. Several general-purpose character encodings accommodate Chinese characters, and some of them were developed specifically for Chinese.

In additional to Unicode (with the set of CJK Unified Ideographs), local encoding systems exist. The Chinese Guobiao ("national standard") system used in Mainland China and Singapore, and the (mainly) Taiwanese Big5 system used in Taiwan, Hong Kong and Macau are two primary "legacy" local encoding systems. Guobiao is usually displayed using simplified characters and Big5 is usually displayed using traditional characters. There is however no mandated connection between the encoding system and the font used to display the characters; font and encoding are usually tied together for practical reasons.

The issue of which encoding to use can also have political implications, as GB is the official standard of the People's Republic of China and Big5 is a de facto standard of Taiwan.

In contrast to the situation with Japanese, there has been relatively little overt opposition to Unicode, which solves many of the issues involved with GB and Big5. Unicode is widely regarded as politically neutral, has good support for both simplified and traditional characters, and can be easily converted to and from the GB and Big5. Furthermore, Unicode has the advantage of not being limited only to Chinese, since it can also display many other character sets.

Guobiao

The Guobiao (GB) line of character encodings start with the Simplified Chinese charset GB 2312 published in 1980. Two encoding schemes existed for GB2312: a one-or-two byte 8-bit EUC-CN encoding commonly used, and a 7-bit encoding called HZ[1] for usenet posts.[2]: 94  A traditional variant called GB/T 12345 was published in 1990.

The EUC-CN form was later extended into GBK to include all Unicode 1.1 CJK Ideographs in 1993, abandoning the ISO-2022 model. By doing so, GBK includes Traditional Chinese characters in additional to simplified ones in GB2312.[3] GBK gained popularity through the widespread Code page 936 implementation found in Microsoft Windows 95.

In 2000, GB 18030 was published as GBK's successor. This new encoding includes a four-byte UTF which encodes all Unicode codepoints not previously encoded.[4] In 2005, GB 18030 was published to contain reference glyphs for scripts used by ethnic minorities in China, as well as glyphs from CJK Unified Ideographs Extension B due to the update of Unicode.

Adobe-GB1 is the corresponding PostScript charset for GB encodings.

Big5

The Big5 family of character encodings start with the initial definition by the consortium of five companies in Taiwan that developed it.[5] It is a double-byte character set (DBCS) somehow similar to Shift JIS, often combined with a MBCS like ASCII. Quite a few vendor as well as official extensions exist, of which ETEN, HKSCS (Hong Kong) and Big5-2003 (as a part of CNS 11643 by Taiwan) are the most well-known ones.[6] Adobe-CNS1 is the PostScript charaset corresponding to the Big5 family of encodings.

Conversion

Prior to GBK which includes both traditional and simplified characters, conversion between Chinese and Taiwanese charsets was complicated by the need of transcribing text between the two variants of Chinese, as one charset cover many of the other's characters only in its own variant. The conversion between traditional and simplified Chinese is usually problematic, because the simplification of some traditional forms merged two or more different characters into one simplified form. The traditional to simplified (many-to-one) conversion is technically simple. The opposite conversion often results in a data loss when converting to GB 2312: in mapping one-to-many when assigning traditional glyphs to the simplified glyphs, some characters will inevitably be the wrong choices in some of the usages. Thus simplified to traditional conversion often requires usage context or common phrase lists to resolve conflicts. This issue is less of a problem with newer standards such as GBK, GB18030 and Unicode which have separate code points for both simplified and traditional characters.

One other issue is that many of the encoding systems are missing characters. While the missing characters are often literary and not commonly used in ordinary text, this does become a problem because people's names often contain these characters. An example of the problem is the Taiwanese politician Wang Jian-Hsuan whose second given name is not in some character systems, and PRC Premier Zhu Rongji whose Rong character is not in GB2312. The newest GB standard, GB18030 has the complete character repertoire of Unicode 4.0, including the Unihan extensions in the Supplementary Ideographic Plane.[2]: 105 

See also

References

  1. ^ RFC 1843
  2. ^ a b Lunde, Ken (December 2008). CJKV Information Processing. O'Reilly Media, Inc. ISBN 978-0-596-51447-1. Retrieved 11 September 2016.
  3. ^ "GB18030-2000 - The New Chinese National Standard - GB 18030". 2012-08-25. Archived from the original on 2012-08-25. Retrieved 2016-10-13. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
  4. ^ Authoritative mapping table between GB18030-2000 and Unicode. ICU – International Components for Unicode. 2001-02-21. Accessed 2016-10-13.
  5. ^ "[chinese mac] Character Sets". chinesemac.org. Retrieved 2016-10-13.
  6. ^ "Big5 Variants in Mozilla: Mozilla 系列與 Big5 中文字碼". moztw.org. Retrieved 2016-10-13.

Further reading