CJK Unified Ideographs

From Wikipedia, the free encyclopedia
Jump to: navigation, search
CJK ideograph 次 in Simplified and Traditional Chinese, Japanese Kanji and Korean Hanja.

The Chinese, Japanese and Korean (CJK) scripts share a common background. In the process called Han unification the common (shared) characters were identified, and named "CJK Unified Ideographs". As of version 6.1, Unicode defines a total of 74,617 CJK Unified Ideographs.[1]

The terms ideographs or ideograms may be misleading, since the Chinese script is not strictly a picture writing system.

Historically, the Vietnamese writing system Chữ Nôm uses Chinese ideographs too, so sometimes the abbreviation "CJKV" is used. Since the 16th century, an extended Latin alphabet has been used in Vietnam.

Contents

[edit] CJK Unified Ideographs blocks

[edit] CJK Unified Ideographs

The basic block named CJK Unified Ideographs (4E00-9FFF) contains 20,941 basic Chinese characters, not only those used in the Chinese writing system but also the Kanji used in the Japanese writing system and the Hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters were also used in the Vietnamese Chữ nôm script (now obsolete). The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical sequence.

The block is the result of Han unification[2], which was somewhat controversial in the Far East.[3] Since Chinese, Japanese and Korean characters were coded in the same location, the appearance of a selected glyph could depend on the particular font being used. However, the source separation rule states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding.[4]

Using variation selectors[5] it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1 character set proposal, which actually calls for 14,658 ideographic variation sequences,[5] is an extreme example of the use of variation selectors.[6]

[edit] Charts

4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.

[edit] Sources

The code points in this block are assigned under Source Separation Rule.

China
Code Standard Character count note
G0 GB 2312-80 6763
G1 GB 12345-90 2352
G3 GB 7589-87 unsimplified form 7237
G5 GB 7590-87 unsimplified form 7039
G7 Modern Chinese general character chart 642
G8 GB 8565-89 290
Taiwan
Code Standard Character count note
T1 CNS 11643-1986 plane 1 5401+9
T2 CNS 11643-1986 plane 2 7650
TE CNS 11643-1986 plane 14 6319+239+10 239 from CCIII, 10 from XCCS
Japan
Code Standard Character count note
J0 JIS X 0208-90 6335+1
J1 JIS X 0212-90 5801
South Korea
Code Standard Character count note
K0 KS C 5601-87 4888 includes 268 duplicates
K1 KS C 5657-91 2856
Others

In Unicode 4.1, 14 HKSCS-2004 characters and 8 GB 18030 characters were assigned to between U+9FA6 and U+9FBB code points.

[edit] CJK Unified Ideographs Extension A

Extension A, located in block 3400-4DBF, contains characters which are more commonly used.

[edit] Charts

3400-4DBF.

[edit] Sources

China
Code Standard
GE GB 16500-95
GS Singapore CJK ideographs
Taiwan
Code Standard note
T3 CNS 11643-1992 plane 3
T4 CNS 11643-1992 plane 4
T5 CNS 11643-1992 plane 5
T6 CNS 11643-1992 plane 6
T7 CNS 11643-1992 plane 7
TF CNS 11643-1992 plane 15
Japan
Code Standard note
JA Unified Japanese IT Vendors Contemporary Ideographs, 1993
South Korea
Code Standard note
K2 PKS C 5700-1:1994
K3 PKS C 5700-2:1994
Vietnam
Code Standard note
V0 TCVN 5773:1993
V1 TCVN 6056:1995

[edit] CJK Unified Ideographs Extension B

Extension B, in block 20000-2A6DF, comprises 42,711 characters that were added in Unicode 3.1 (2001). These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Chữ Nôm characters that were historically used for writing the Vietnamese language.

[edit] Charts

20000-215FF, 21600-230FF, 23100-245FF, 24600-260FF, 26100-275FF, 27600-290FF, 29100-2A6DF.

[edit] Sources

[edit] CJK Unified Ideographs Extension C

The Extension C comprises 4,149 characters that were added, in block 2A700-2B73F, in Unicode 5.2 (2009).

[edit] Charts

2A700-2B73F.

[edit] Sources

China
Japan
  • Japanese KOKUJI Collection
South Korea
  • Korean IRG Hanja Character Set 5th Edition: 2001
North Korea
  • KPS 10721:2003
Vietnam
  • Từ điển chữ Nôm (喃字典), Nguyễn Quang Hồng, 2006
  • Từ điển chữ Nôm Tày, Hoàng Triều Ân, 2003
  • Bảng tra chữ Nôm miền Nam, Vũ Văn Kính, 1994
Other
  • Unicode UTC
  • ABC Chinese-English Dictionary, John DeFrancis (德范克), et al., eds., 2nd edition. (1998) Honolulu: University of Hawaii Press
  • The Church of Jesus Christ of Latter-day Saints Hong Kong division
  • Mathews' Chinese-English Dictionary, Robert H. Mathews (1975) Cambridge; Harvard University Press
  • Guangyun
  • Chinese bird system index (中国鸟类系统检索), Zheng Zhuoxin (郑作新), et al. (2000), Beijing, 科学出版社 (www.sciencep.com)
  • Annotated Shuowen Jiezi, Duan Yucai

[edit] CJK Unified Ideographs Extension D

The Extension D comprises 222 characters that were added, in block 2B740-2B81F, in Unicode 6.0 (2010).

[edit] Charts

2B740-2B81F.

[edit] CJK Unified Ideographs Extension E (projected)

The CJK Unified Ideographs Extension E block was earlier provisional named Extension D.

CJK-E was originally intended to include another 16,000+ characters not present in CJK-C, however, in May 2007, the Republic of China (Taiwan) withdrew 6,545 personal name usage characters deemed no longer in use[7], and so has been reduced to approximately 10,000 characters.


[edit] CJK Compatibility Ideographs

Of the four compatibility CJK ideographs blocks in Unicode, only CJK Compatibility Ideographs block (F900–FAFF) actually contains twelve characters for CJK Unified Ideographs compatibility. All other compatibility ideographs do not relate to CJK Unification.

[edit] Charts

F900-FAFF.

[edit] Known issues

[edit] Disunification of U+4039

The character U+4039 (䀹) was a unification of two different glyphs (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different that should not have been unified; they have different pronunciations and different meanings.

The proposal of disunification of U+4039[8] was accepted and the new character is encoded at U+9FC3 in Unicode 5.1.

[edit] Unified ideographs outside of the blocks

The CJK Compatibility Ideographs block (F900-FAFF) is not part of the "unified ideographs" list, but includes twelve characters that are in fact classified and named as unified ideographs: FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21, FA23, FA24, FA27, FA28 and FA29.

[edit] Unifiable variants and exact duplicates in Extension B

In CJK Unified Ideographs Extension B, hundreds of glyph variants[9] were encoded, as well as six exact duplicates[10]:

  • U+34A8 㒨 = U+20457 𠑗
  • U+3DB7 㶷 = U+2420E 𤈎
  • U+8641 虁 = U+27144 𧅄
  • U+204F2 𠓲 = U+23515 𣔕
  • U+249BC 𤦼 = U+249E9 𤧩
  • U+24BD2 𤯒 = U+2A415 𪐕

[edit] Other CJK Ideographs in Unicode, not Unified

Apart from the five blocks of "Unified Ideographs", Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different.

Four blocks (one of which is labelled "Unified Ideographs") of compatibility characters are included for compatibility with legacy text handling system and other legacy character sets. They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore their use is discouraged.

Usually, compatibility characters are those that would not have been encoded except for compatibility and round-trip convertibility with other standards. However, the amount of CJK ideographs within any non-Unicode standard is too big to fit into Unicode's CJK Compatibility Ideographs blocks. Instead, code points are assigned when the affected characters are approved by the Unicode Consortium, but have yet to assign any code points within the CJK Unified Ideographs blocks.

[edit] Unicode version history

CJK unified Ideographs additions per Unicode version
Unicode version Addition Plane Characters added Total Characters
1.0 CJK Unified Ideographs Basic Multilingual Plane (BMP) 20,902 20,914
CJK Compatibility Ideographs BMP 12
3.0 CJK Unified Ideographs Extension A BMP 6,582 27,496
3.1 CJK Unified Ideographs Extension B Supplementary Ideographic Plane (SIP) 42,711 70,207
4.1 CJK Unified Ideographs: Ideographs from HKSCS-2004 and GB 18030-2000 not in ISO 10646 BMP 22 70,229
5.1 CJK Unified Ideographs: Ideographs from Adobe Japan and disunification of U+4039 BMP 8 70,237
5.2 CJK Unified Ideographs Extension C SIP 4,149 74,394
8 other characters from ARIB #47, #95, #93 and HKSCS BMP 8
6.0 CJK Unified Ideographs Extension D SIP 222 74,616
6.1 1 character corresponding to Adobe-Japan1-6 CID+20156 BMP 1 74,617

[edit] Notes

[edit] See also

[edit] External links

Personal tools
Namespaces

Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages