GB 2312

GB 2312
MIME / IANA	GB_2312-80 (GB2312 for usual EUC form)
Alias(es)	iso-ir-58, chinese, csISO58GB231280
Language(s)	Simplified Chinese, English, Russian; Partial support:; Greek, Japanese
Standard	GB/T 2312-1980
Classification	ISO-2022-compatible DBCS, CJK encoding
Extensions	ISO-IR-165
Encoding formats	EUC-CN (GB2312),; HZ-GB-2312
Preceded by	Chinese telegraph code
Succeeded by	GBK, GB 18030
Other related encoding(s)	JIS X 0208, KS X 1001
	v; t; e;

GB/T 2312-1980 is a key official character set of the People's Republic of China, used for Simplified Chinese characters. GB2312 is the registered internet name for EUC-CN, which is its usual encoded form. GB refers to the Guobiao standards (国家标准). GB/T 2312-1980 has been superseded by GBK and GB18030, which include additional characters, but GB/T 2312 remains in widespread use as a subset of those encodings.

According to a National Standard Bulletin of the People's Republic of China in 2017, the National Standard GB 2312-1980 is no longer mandatory, and its standard code is modified to GB/T 2312-1980.^[1]

While GB/T 2312 covers over 99.99% contemporary Chinese text usage,^[2] historical texts and many names remain out of scope. Old GB 2312 standard includes 6,763 Chinese characters (on two levels: the first is arranged by reading, the second by radical then number of strokes), along with symbols and punctuation, Japanese kana, the Greek and Cyrillic alphabets, Zhuyin, and a double-byte set of Pinyin letters with tone marks. In later version GB/T 2312-1980, there are 7,445 letters.

As of June 2020^[update], GB2312 is the most popular Chinese encoding, with 13.6% of web pages served from China and territories declaring it,^[3] or 0.4% of all web pages globally, a drop from 3.5% in January 2010.^[4] However, note that all major web browsers decode documents marked as e.g. "GB2312" or "GB 2312" (while not all for "GB_2312") as if it were marked "gbk",^[5] which is a superset encoding and GB 2312 and GBK have a combined 16.7% share (or 0.6% globally).

There is an analogous character set known as GB/T 12345, closely related to GB/T 2312, but with traditional character forms replacing simplified forms, and some extra 62 supplemental characters.^[6]^[7] GB-encoded fonts often come in pairs, one with the GB/T 2312 (simplified) character set and the other with the GB/T 12345 (traditional) character set.

Characters

Characters in GB/T 2312 are arranged in a 94x94 grid (as in ISO 2022), and the two-byte code point of each character is expressed in the kuten (or quwei) form, which specifies a row (ku or qu) and the position of the character within the row (cell, ten or wei).

The rows (numbered from 1 to 94) contain characters as follows:

01–09, comprising punctuation and other special characters; also Hiragana, Katakana, Greek, Cyrillic, Pinyin, Bopomofo
16–55, the first plane for Chinese characters, arranged according to Pinyin. (3755 characters).
56–87, the second plane for Chinese characters, arranged according to radical and strokes. (3008 characters).
88–89, further Chinese characters. (103 characters). Defined only for GB/T 12345, not GB/T 2312.

The rows 10–15 and 90–94 are unassigned.

For GB/T 2312-1980, it contains 682 signs and 6763 Chinese Characters.

Encodings of GB/T 2312

EUC-CN

EUC-CN is often used as the character encoding (i.e. for external storage) in programs that deal with GB/T 2312, thus maintaining compatibility with ASCII. Two bytes are used to represent every character not found in ASCII. The value of the first byte is from 0xA1–0xF7 (161–247), while the value of the second byte is from 0xA1–0xFE (161–254). Since all of these ranges are beyond ASCII, like UTF-8, it is possible to check if a byte is part of a multi-byte construct when using EUC-CN, but not if a byte is first or last.

Compared to UTF-8, GB2312 (whether native or encoded in EUC-CN) is more storage efficient: while UTF-8 uses three bytes^[a] per CJK ideograph, GB2312 only uses two. However, GB2312 does not cover as many ideographs as Unicode does.

To map the kuten code points to bytes, add 160 (0xA0) to the row number (ku, the 1000s and 100s place) of the code point to form the high byte, and add 160 to the column number (ten, the 10s and 1s place) of the code point to form the low byte.

For example, if you have the GB/T 2312 code point 4566 ("外",^[8] which means foreign), the high byte will use the row number 45: 45+160=205=0xCD, and the low byte will come from the column, 66: 66+160=212=0xE2. So, the full encoding is 0xCDE2.^[9]

HZ

HZ is another encoding of GB 2312 that is used mostly for Usenet postings.

Two implementations of GB/T 2312

There are two implementations of GB/T 2312 which differ in few code points.

EUC-CN	GBK/GB18030 subset	GB2312.TXT	Character name^[10]^: 3
A1A4	U+00B7 · MIDDLE DOT	U+30FB ・ KATAKANA MIDDLE DOT	间隔点; 'separator dot'
A1AA	U+2014 — EM DASH	U+2015 ― HORIZONTAL BAR	破折号; 'em dash'

The GBK/GB18030 subset is compatible with both GBK and GB18030; GB2312.TXT is the then-official implementation from ftp.unicode.org,^[11] which has been obsolete since August 2011^[12] and missing as of September 2016. Even more vendor mappings existed.^[11]

As of 2015, Microsoft .Net Framework is using the subset. ICU,^[13] iconv-1.14,^[14] php-5.6, ActivePerl-5.20, Java 1.7 and Python 3.4^[15] are using GB2312.TXT. Ruby 2.2 is compatible with both implementations; it internally converts the conflictive characters to the subset. W3C's technical recommendation specifies a GBK encoding to be inferred for streams labelled gb2312, which in turn uses a GB18030 decoder.^[16]

References

^ "2017年第7号中国国家标准公告 (China National Standard Bulletin 2017 No.7)". Standardization Administration of the People's Republic of China. Retrieved 3 July 2018.
^ Hannas, William C. (1997). Asia's Orthographic Dilemma. University of Hawai‘i Press. p. 264. the set provides for better than 99.99 percent of all usage. Nevertheless, the designers found it necessary to add 14,276 "special usage" characters to cover contingencies!
^ "Distribution of Character Encodings among websites that use China and territories". w3techs.com. Retrieved 2020-06-01.
^ "Historical trends in the usage of character encodings, June 2020". w3techs.com. Retrieved 2020-06-01.
^ "Encoding: Summarized test results". www.w3.org. Retrieved 2019-11-15.
^ "GB/T 12345" (PDF).
^ GB12345-80 to Unicode table. Unicode Consortium. 1993-12-06. Archived from the original on 2004-06-17.
^ https://archive.org/details/GB2312-1980/page/n17
^ https://web.archive.org/web/20160303230643/http://cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html
^ "GB 2312-1980: Information technology—Chinese ideogram coded character set for information interchange (basic set)". Retrieved 2 October 2016.
^ ^a ^b Haible, Bruno. "GB2312 (Conversion Tables)". Retrieved 29 September 2016.
^ "Readme – MAPPINGS/OBSOLETE/EASTASIA". 9 August 2001. Retrieved 29 September 2016.
^ "java-EUC_CN-1.3_P.ucm". Retrieved 29 September 2016.^{[permanent dead link]}
^ "libiconv:lib/gb2312.h". GNU Savannah. Retrieved 29 September 2016.
^ "Issue 24036". Python Bug Tracker.
^ "Encoding § Names and labels". W3C. Retrieved 29 September 2016.

Notes

^ Only for ideographs covered by GB/T 2312, all of which fall into Unicode BMP

External links

[SAC2017-7-1] "2017年第7号中国国家标准公告 (China National Standard Bulletin 2017 No.7)". Standardization Administration of the People's Republic of China. Retrieved 3 July 2018.

[2] Hannas, William C. (1997). Asia's Orthographic Dilemma. University of Hawai‘i Press. p. 264. the set provides for better than 99.99 percent of all usage. Nevertheless, the designers found it necessary to add 14,276 "special usage" characters to cover contingencies!

[3] "Distribution of Character Encodings among websites that use China and territories". w3techs.com. Retrieved 2020-06-01.

[4] "Historical trends in the usage of character encodings, June 2020". w3techs.com. Retrieved 2020-06-01.

[5] "Encoding: Summarized test results". www.w3.org. Retrieved 2019-11-15.

[cjkv-12345-6] "GB/T 12345" (PDF).

[7] GB12345-80 to Unicode table. Unicode Consortium. 1993-12-06. Archived from the original on 2004-06-17.

[9] ttps://archive.org/details/GB2312-1980/page/n17

[10] ttps://web.archive.org/web/20160303230643/http://cs.nyu.edu/~yusuke/tools/unicode_to_gb2312_or_gbk_table.html

[gb2312-80-11] "GB 2312-1980: Information technology—Chinese ideogram coded character set for information interchange (basic set)". Retrieved 2 October 2016.

[many-mappings-12] Haible, Bruno. "GB2312 (Conversion Tables)". Retrieved 29 September 2016.

[13] "Readme – MAPPINGS/OBSOLETE/EASTASIA". 9 August 2001. Retrieved 29 September 2016.

[14] "java-EUC_CN-1.3_P.ucm". Retrieved 29 September 2016.^{[permanent dead link]}

[15] "libiconv:lib/gb2312.h". GNU Savannah. Retrieved 29 September 2016.

[16] "Issue 24036". Python Bug Tracker.

[17] "Encoding § Names and labels". W3C. Retrieved 29 September 2016.

[8] Only for ideographs covered by GB/T 2312, all of which fall into Unicode BMP

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[a]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

v t e Character encodings
Early telecommunications	Telegraph code Needle Morse Non-Latin Wabun/Kana Chinese Cyrillic Baudot and Murray Fieldata ASCII ISO/IEC 646 BCDIC Teletex and Videotex/Teletext T.51/ISO/IEC 6937 ITU T.61 ITU T.101 World System Teletext background sets Transcode
ISO/IEC 8859	Approved parts -1 (Western Europe) -2 (Central Europe) -3 (Maltese/Esperanto) -4 (North Europe) -5 (Cyrillic) -6 (Arabic) -7 (Greek) -8 (Hebrew) -9 (Turkish) -10 (Nordic) -11 (Thai) -13 (Baltic) -14 (Celtic) -15 (New Western Europe) -16 (Romanian) Abandoned parts -12 (Devanagari) Proposed but not approved KOI-8 Cyrillic Sámi Adaptations Welsh Barents Cyrillic Estonian Ukrainian Cyrillic
Bibliographic use	MARC-8 ANSEL CCCII/EACC ISO 5426 5426-2 5427 5428 6438 6862
National standards	ArmSCII Big5 BraSCII CNS 11643 DIN 66003 ELOT 927 GOST 10859 GB 2312 GB 12345 GB 12052 GB 18030 HKSCS ISCII JIS X 0201 JIS X 0208 JIS X 0212 JIS X 0213 KOI-7 KPS 9566 KS X 1001 KS X 1002 LST 1564 LST 1590-4 PASCII Shift JIS SI 960 TIS-620 TSCII VISCII VSCII YUSCII
ISO/IEC 2022	ISO/IEC 8859 ISO/IEC 10367 Extended Unix Code / EUC
Mac OS Code pages ("scripts")	Armenian Arabic Barents Cyrillic Celtic Central European Croatian Cyrillic Devanagari Farsi (Persian) Font X (Kermit) Gaelic Georgian Greek Gujarati Gurmukhi Hebrew Iceland Inuit Keyboard Latin (Kermit) Maltese/Esperanto Ogham Roman Romanian Sámi Turkish Turkic Cyrillic Ukrainian VT100
DOS code pages	437 668 708 720 737 770 773 775 776 777 778 850 851 852 853 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 897 899 903 904 932 936 942 949 950 951 1034 1040 1042 1043 1044 1098 1115 1116 1117 1118 1127 3846 ABICOMP CS Indic CSX Indic CSX+ Indic CWI-2 Iran System Kamenický Mazovia MIK
IBM AIX code pages	895 896 912 915 921 922 1006 1008 1009 1010 1012 1013 1014 1015 1016 1017 1018 1019 1046 1124 1133
Windows code pages	CER-GS 932 936 (GBK) 950 1169 Extended Latin-8 1250 1251 1252 1253 1254 1255 1256 1257 1258 1270 Cyrillic + Finnish Cyrillic + French Cyrillic + German Polytonic Greek
EBCDIC code pages	Japanese language in EBCDIC DKOI
DEC terminals (VTx)	Multinational (MCS) National Replacement (NRCS) French Canadian Swiss Spanish United Kingdom Dutch Finnish French Norwegian and Danish Swedish Norwegian and Danish (alternative) 8-bit Greek 8-bit Turkish SI 960 Hebrew Special Graphics Technical (TCS)
Platform specific	1052 1053 1054 1055 1056 1057 1058 Acorn RISC OS Amstrad CPC Apple II ATASCII Atari ST BICS Casio calculators CDC Compucolor 8001 Compucolor II CP/M+ DEC RADIX 50 DEC MCS/NRCS DG International Galaksija GEM GSM 03.38 HP Roman HP FOCAL HP RPL SQUOZE LICS LMBCS MSX NEC APC NeXT PETSCII PostScript Standard PostScript Latin 1 SAM Coupé Sega SC-3000 Sharp calculators Sharp MZ Sinclair QL Teletext TI calculators TRS-80 Ventura International WISCII XCCS ZX80 ZX81 ZX Spectrum
Unicode / ISO/IEC 10646	UTF-1 UTF-7 UTF-8 UTF-16 UTF-32 UTF-EBCDIC GB 18030 DIN 91379 BOCU-1 CESU-8 SCSU TACE16 Comparison of Unicode encodings
TeX typesetting system	Cork LY1 OML OMS OT1
Miscellaneous code pages	ABICOMP ASMO 449 Digital encoding of APL symbols ISO-IR-68 ARIB STD-B24 Fieldata HZ IEC-P27-1 INIS 7-bit 8-bit ISO-IR-169 ISO 2033 KOI KOI8-R KOI8-RU KOI8-U Mojikyō SEASCII Stanford/ITS Symbol TRON Unified Hangul Code
Control character	Morse prosigns C0 and C1 control codes ISO/IEC 6429 JIS X 0211 Unicode control, format and separator characters Whitespace characters
Related topics	CCSID Character encodings in HTML Charset detection Han unification Hardware code page MICR code Mojibake Variable-length encoding
Character sets