CJK Unified Ideographs

From Wikipedia, the free encyclopedia
Jump to: navigation, search
CJKV ideograph in traditional and simplified Chinese, Korean, Vietnamese and Japanese.

The Chinese, Japanese and Korean (CJK) scripts share a common background. In the process called Han unification the common (shared) characters were identified, and named "CJK Unified Ideographs". Unicode defines a total of 80,388 CJK Unified Ideographs.[1]

The terms ideographs or ideograms may be misleading, since the Chinese script is not strictly a picture writing system.

Historically, Vietnam used Chinese ideographs too, so sometimes the abbreviation "CJKV" is used. This system was replaced by the Latin-based Vietnamese alphabet in the 1920s.

CJK Unified Ideographs blocks[edit]

CJK Unified Ideographs[edit]

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,950 basic Chinese characters in the range U+4E00 through U+9FD5. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters were also used in Vietnam's Nôm script (now obsolete). The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order.

The block is the result of Han unification,[2] which was somewhat controversial in the Far East.[3] Since Chinese, Japanese and Korean characters were coded in the same location, the appearance of a selected glyph could depend on the particular font being used. However, the source separation rule states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding.[4]

Using variation selectors, it is possible to specify certain variant CJK ideograms within Unicode. The Adobe-Japan1 character set proposal, which actually calls for 14,658 ideographic variation sequences, is an extreme example of the use of variation selectors.[5]

Charts[edit]

4E00-62FF, 6300-77FF, 7800-8CFF, 8D00-9FFF.

Sources[edit]

Note: Most characters appear in multiple sources, making the sum of individual character counts (87,360) far more than the number of encoded characters (20,950).
The following statistics is based on the Unicode 8.0 code chart. All figures are quadruple-verified.

Country Code Standard Character count Total
China G0 GB 2312-80 6763 20913
G1 GB 12345-90 2202
G3 GB 7589-87 traditional form 4834
G5 GB 7590-87 traditional form 2841
G7 Modern Chinese general character chart 42
G8 GB8565-88 290
G9 GB18030-2000 8
GE GB16500-95 3779
GH GB/T 15564-1995 59
GHZ Hanyu Da Zidian 1
GK GB 12052-89 89
GKX Kangxi Dictionary 2
GFC Modern Chinese Standard Dictionary (现代汉语规范词典) 2
GGFZ General Chinese Standard Dictionary (通用规范汉字字典) 1
Hong Kong H Hong Kong Supplementary Character Set 2292 15353
HB0 Computer Chinese Glyph and Character Code Mapping Table, Technical Report C-26
(電腦用中文字型與字碼對照表, 技術通報C-26)
10
HB1 Big-5, Level 1 5401
HB2 Big-5, Level 2 7650
Japan J0 JIS X 0208-1990 6356 12563
J1 JIS X 0212-1990 3058
J3 JIS X 0213-2004 Level 3 1132
J3A JIS X 0213-2004 Level 3 addendum 9
J4 JIS X 0213-2004 Level 4 2005
JARIB ARIB STD-B24 3
South Korea K0 KS C 5601-87 (now KS X 1001:2004) 4620 15391
K1 KS C 5657-91 (now KS X 1002:2004) 2856
K2 PKS C 5700-1:1994 7911
K4 PKS 5700-3:1998 4
Taiwan T1 CNS 11643-1992 plane 1 5413 18370
T2 CNS 11643-1992 plane 2 7650
T3 CNS 11643-1992 plane 3 4144
T4 CNS 11643-1992 plane 4 894
T5 CNS 11643-1992 plane 5 63
T6 CNS 11643-1992 plane 6 31
T7 CNS 11643-1992 plane 7 16
TC CNS 11643-1992 plane 12 1
TF CNS 11643-1992 plane 15 158
Vietnam V0 TCVN 5773-1993 593 4757
V1 TCVN 6056-1995 3310
V2 VHN 01-1998 763
V3 VHN 02-1998 91
Unicode UTC UTC UTC sources 13 13

In Unicode 4.1, 14 HKSCS-2004 characters and 8 GB 18030 characters were assigned to between U+9FA6 and U+9FBB code points.

CJK Unified Ideographs Extension A[edit]

The block named CJK Unified Ideographs Extension A (3400–4DBF) contains 6,582 additional characters in the range U+3400 through U+4DB5 that were added in Unicode 3.0 (1999).

Charts[edit]

3400-4DBF.

Sources[edit]

Note: Most characters appear in more than one source, making the sum of individual character counts (15,565) far more than the number of encoded characters (6,582).
The following statistics is based on the Unicode 8.0 code chart. All figures are quadruple verified.

Country Code Standard Character count Total
China GKX Kangxi Zidian 1890 6192
GHZ Hanyu Da Zidian 339
G3 GB 7589-87 traditional form 2391
G5 GB 7590-87 traditional form 1226
G7 Modern Chinese general character chart 120
GS Singapore Chinese characters 226
Hong Kong H Hong Kong Supplementary Character Set 572 572
South Korea K3 PKS C 5700-2:1994 1834 1836
K4 PKS 5700-3:1998 2
Taiwan T3 CNS 11643-1992 plane 3 2178 5906
T4 CNS 11643-1992 plane 4 2917
T5 CNS 11643-1992 plane 5 395
T6 CNS 11643-1992 plane 6 197
T7 CNS 11643-1992 plane 7 133
TF CNS 11643-1992 plane 15 86
Japan J3 JIS X 0213-2004 Level 3 19 738
JA Japanese IT Vendors Contemporary Ideographs, 1993 574
J4 JIS X 0213-2004 Level 4 145
Vietnam V0 TCVN 5773-1993 138 308
V2 VHN 01-1998 151
V3 VHN 02-1998 19
UTC UTC UTC sources 13 13

CJK Unified Ideographs Extension B[edit]

The block named CJK Unified Ideographs Extension B (20000–2A6DF) contains 42,711 characters in the range U+20000 through U+2A6D6 that were added in Unicode 3.1 (2001). These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Nôm characters that were formerly used to write Vietnamese.

Charts[edit]

20000-215FF, 21600-230FF, 23100-245FF, 24600-260FF, 26100-275FF, 27600-290FF, 29100-2A6DF.

Sources[edit]

Note: Many characters appear in more than one source, making the sum of individual character counts (67,157) far more than the number of encoded characters (42,711).
The following statistics is based on the Unicode 8.0 code chart. All figures are quadruple verified.

Country Code Standard Character count Total
China G3 GB 7589-87 traditional form 1 30525
G9 GB18030-2000 6
G4K Siku Quanshu 522
GBK Encyclopedia of China 86
GCH Cihai 247
GCY Ciyuan 66
GFZ Founder Press System 65
GHC Hanyu Da Cidian 553
GHZ Hanyu Da Zidian 10510
GKX Kangxi Dictionary 18469
Hong Kong H Hong Kong Supplementary Character Set 1702 1702
Japan J3 JIS X 0213-2004 Level 3 25 303
J3A JIS X 0213-2004 Level 3 addendum 1
J4 JIS X 0213-2004 Level 4 277
South Korea K4 PKS 5700-3:1998 166 166
Taiwan T3 CNS 11643-1992 plane 3 25 30178
T4 CNS 11643-1992 plane 4 3408
T5 CNS 11643-1992 plane 5 8111
T6 CNS 11643-1992 plane 6 5934
T7 CNS 11643-1992 plane 7 6299
TF CNS 11643-1992 plane 15 6401
Vietnam V0 TCVN 5773-1993 1515 4231
V2 VHN 01-1998 2290
V3 VHN 02-1998 425
V4 Dictionary on Nom (Từ điển chữ Nôm)
Dictionary on Nom of Tay ethnic (Từ điển chữ Nôm Tày)
Lookup Table for Nom in the South (Bảng tra chữ Nôm miền Nam)
1
Unicode UTC UTC / UCI UTC sources (UTC 48 + UCI 4) 52 52

CJK Unified Ideographs Extension C[edit]

The block named CJK Unified Ideographs Extension C (2A700–2B73F) contains 4,149 characters in the range U+2A700 through U+2B734 that were added in Unicode 5.2 (2009).

Charts[edit]

2A700-2B73F.

Sources[edit]

Note: Some characters appear in more than one source, making the sum of individual character counts (4,531) more than the number of encoded characters (4,149).
The following statistics is based on the Unicode 8.0 code chart. All figures are quadruple verified.

Country Code Standard Character count Total
China GBK Encyclopedia of China 74 1119
GCH Cihai 264
GCY Ciyuan 1
GCYY Chinese Academy of Surveying and Mapping ideographs 55
GFZ Founder Press System 1
GGH Old Chinese Dictionary (古代汉语词典) 50
GHC Hanyu Da Cidian 14
GHZ Hanyu Da Zidian 1
GJZ Commercial Press ideographs 61
GKX Kangxi Dictionary 6
GXC Xiandai Hanyu Cidian 25
GZFY Dictionary of Chinese Dialects (汉语方言大辞典) 202
GZJW Collections of Bronze Inscriptions from Yin and Zhou Dynasties
(殷周金文集成引得)
365
Hong Kong H Hong Kong Supplementary Character Set 1 1
Japan JK Japanese Kokuji Collection 367 367
North Korea KP1 KPS 10721-2000 8 8
South Korea K5 Korean IRG Hanja Character Set 404 404
Macau MAC Macao Information System Character Set (澳門資訊系統字集) 16 16
Taiwan TC CNS 11643-1992 plane 12 634 1750
TD CNS 11643-1992 plane 13 766
TE CNS 11643-1992 plane 14 350
Vietnam V1 TCVN 6056:1995 1 785
V4 Dictionary on Nom (Từ điển chữ Nôm)
Dictionary on Nom of Tay ethnic (Từ điển chữ Nôm Tày)
Lookup Table for Nom in the South (Bảng tra chữ Nôm miền Nam)
784
Unicode UTC UTC / UCI UTC sources (UTC 80 + UCI 1) 81 81

CJK Unified Ideographs Extension D[edit]

The block named CJK Unified Ideographs Extension D (2B740–2B81F) contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0 (2010).

Charts[edit]

2B740–2B81F.

Sources[edit]

Note: Four characters appear in more than one source, so the sum of individual character counts is 226.
The following statistics is based on the Unicode 8.0 code chart. All figures are quadruple verified.

Country Code Standard Character count Total
China GCH Cihai 1 76
GIDC ID System of the Ministry of Public Security of China 32
GXC Xiandai Hanyu Cidian 4
GZH Zhonghua Zihai 39
Japan JH Hanyo-Denshi Program (汎用電子情報交換環境整備プログラム) 107 107
Taiwan TB CNS 11643-1992 plane 15 24 24
Unicode UTC UTC UTC sources 19 19

CJK Unified Ideographs Extension E[edit]

The block named CJK Unified Ideographs Extension E (2B820–2CEAF) contains 5,762 characters in the range U+2B820 through U+2CEA1 that were added in Unicode 8.0 (2015).

Charts[edit]

2B820–2CEAF.

Sources[edit]

Note: 38 characters appear in more than one source, so the sum of individual character counts is 5,790.
The following statistics is based on the Unicode 8.0 code chart. All figures are quadruple verified.

Country Code Standard Character count Total
China GBK Encyclopedia of China 15 2815
GCH Cihai 112
GCY Ciyuan 3
GCYY Chinese Academy of Surveying and Mapping ideographs 98
GDZ Geology Press ideographs 1
GGH Old Chinese Dictionary (古代汉语词典) 176
GHC Hanyu Da Cidian 7
GIDC ID System of the Ministry of Public Security of China 36
GJZ Commercial Press ideographs 147
GKX Kangxi Dictionary 22
GRM People's Daily ideographs 3
GWZ Hanyu Da Cidian Press ideographs 12
GXC Xiandai Hanyu Cidian 57
GXH Xinhua Zidian 4
GZFY Hanyu Fangyan Dacidian (汉语方言大辞典, Dictionary of Chinese Dialects) 712
GZJW Collections of Bronze Inscriptions from Yin and Zhou Dynasties
(殷周金文集成引得)
1410
Taiwan TC CNS 11643-1992 plane 12 323 1257
TD CNS 11643-1992 plane 13 595
TE CNS 11643-1992 plane 14 339
Japan JK Japanese Kokuji Collection 415 415
Macau MAC Macao Information System Character Set (澳門資訊系統字集) 48 48
Vietnam V4 Dictionary on Nom (Từ điển chữ Nôm)
Dictionary on Nom of Tay ethnic (Từ điển chữ Nôm Tày)
Lookup Table for Nom in the South (Bảng tra chữ Nôm miền Nam)
1028 1028
UTC UTC UTC sources 227 227

UTC Sources[edit]

The Ideographic Rapporteur Group (IRG) bears the formal responsibility of developing extensions to the encoded repertoires of unified CJK ideographs. The Unicode Consortium participates in this group as a liaison member of ISO. The characters submitted by the Unicode Technical Committee bear the prefix "UTC". All CJK Unified Ideographs in ISO/IEC10646 are required to have at least one source identifier. Changes to IRG source information, however, can leave a given ideograph without any such sources. In such cases, the ideograph is included in the U-source database to guarantee it has at least one source. Such ideographs are indicated by a source prefix of "UCI" instead of "UTC".[6]

The UTC sources consist of the following:

  • ABC Chinese-English Dictionary by John DeFrancis
  • The Adobe-CNS1 glyph collection
  • The Adobe-Japan1 glyph collection
  • A Complete Checklist of Species and Subspecies of Chinese Birds (中国鸟类系统检索)
  • The Great Nom Dictionary (Đại Tự Điển Chữ Nôm)
  • Annotations to Shuowen Jiezi (annotated by Duan Yucai)
  • GB18030-2000
  • Required Character List Supplied by The Church of Jesus Christ of Latter-day Saints (Hong Kong)
  • New Commercial Dictionary (商务新词典), Hong Kong
  • Defect reports filed against the Unicode Standard or other direct communication with the Unicode editorial committee
  • Unicode Technical Committee (UTC) documents
  • Modern Chinese Dictionary (现代汉语词典), by Chinese Academy of Social Sciences, Linguistics Research Institute, Dictionary Editorial Office
  • Working Group (WG2) documents
  • Wenlin (文林) http://www.wenlin.com/

CJK Compatibility Ideographs[edit]

There are four Unicode blocks whose names include the phrase "CJK Compatibility":

The CJK Compatibility Ideographs block contains twelve characters for CJK Unified Ideographs compatibility. None of the other characters in these blocks relate to CJK Unification. See Unified ideographs outside of the blocks below.

Known issues[edit]

Disunification of U+4039[edit]

The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings.

The proposal of disunification of U+4039[7] was accepted and the new character is encoded at U+9FC3 in Unicode 5.1.

Unified ideographs outside of the blocks[edit]

The CJK Compatibility Ideographs block (F900-FAFF) is not part of the "unified ideographs" list, but includes twelve characters that are in fact classified and named as unified ideographs: FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21, FA23, FA24, FA27, FA28 and FA29.

Unifiable variants and exact duplicates in Extension B[edit]

In CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded.[8] In addition to the deliberate encoding of close glyph variants, six exact duplicates (where the same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a de facto disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake:[9]

  • U+34A8 㒨 = U+20457 𠑗 : U+20457 is the same as the China-source glyph for U+34A8, but it is significantly different from the Taiwan-source glyph for U+34A8
  • U+3DB7 㶷 = U+2420E 𤈎 : same glyph shapes
  • U+8641 虁 = U+27144 𧅄 : U+27144 is the same as the Korean-source glyph for U+8641, but it is significantly different from the China-, Taiwan- and Japan-source glyphs for U+8641
  • U+204F2 𠓲 = U+23515 𣔕 : same glyph shapes, but ordered under different radicals
  • U+249BC 𤦼 = U+249E9 𤧩 : same glyph shapes
  • U+24BD2 𤯒 = U+2A415 𪐕 : same glyph shapes, but ordered under different radicals
  • U+26842 𦡂 = U+26866 𦡦 : same glyph shapes
  • U+FA23 﨣 = U+27EAF 𧺯 : same glyph shapes (U+FA23 﨣 is a unified CJK ideograph, despite its name "CJK COMPATIBILITY IDEOGRAPH-FA23.")

Other CJK Ideographs in Unicode, not Unified[edit]

Apart from the five blocks of "Unified Ideographs", Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different.

Four blocks (one of which is labelled "Unified Ideographs") of compatibility characters are included for compatibility with legacy text handling system and other legacy character sets. They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore their use is discouraged.

Usually, compatibility characters are those that would not have been encoded except for compatibility and round-trip convertibility with other standards. However, the amount of CJK ideographs within any non-Unicode standard is too big to fit into Unicode's CJK Compatibility Ideographs blocks. Instead, code points are assigned when the affected characters are approved by the Unicode Consortium, but have yet to assign any code points within the CJK Unified Ideographs blocks.

Unicode version history[edit]

CJK unified Ideographs additions per Unicode version
Unicode version Addition Plane Characters added Total Characters
1.0 (1991) CJK Unified Ideographs Basic Multilingual Plane (BMP) 20,902 20,914
CJK Compatibility Ideographs BMP 12
3.0 (1999) CJK Unified Ideographs Extension A BMP 6,582 27,496
3.1 (2001) CJK Unified Ideographs Extension B Supplementary Ideographic Plane (SIP) 42,711 70,207
4.1 (2005) CJK Unified Ideographs: Ideographs from HKSCS-2004 and GB 18030-2000 not in ISO 10646 BMP 22 70,229
5.1 (2008) CJK Unified Ideographs: Ideographs from Adobe Japan and disunification of U+4039 BMP 8 70,237
5.2 (2009) CJK Unified Ideographs Extension C SIP 4,149 74,394
8 other characters from ARIB #47, #95, #93 and HKSCS BMP 8
6.0 (2010) CJK Unified Ideographs Extension D SIP 222 74,616
6.1 (2012) 1 character corresponding to Adobe-Japan1-6 CID+20156 BMP 1 74,617
8.0 (2015) CJK Unified Ideographs Extension E SIP 5,762 80,388
9 other characters BMP 9

Notes[edit]

See also[edit]