Talk:GB 18030
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||
|
now we have an understandable explanation in english
[edit]now we have an understandable explanation in english from sun anyone fancy expanding this article? Plugwash 17:43, 22 January 2006 (UTC)
One thing to beware of in using the Sun article - it has an error where it mentions Unicode 2.1 instead of 1.1 as basis for GBK. I checked elsewhere at length, and the Wikipedia GBK page is actually correct. BTW I did the recent updates to GBK, GB2312 and GB18030. -- Richard Donkin 06:39, 27 January 2006 (UTC)
GB 18030-2005?
[edit]according to http://www.sac.gov.cn/, the 2005 version of this standard was released at Nov. 8,2005, effective May 1, 2006.--Skyfiler 22:30, 23 February 2006 (UTC)
- got anymore detail? is it just official incorporation of stuff introduced by unicode since the original standard? Plugwash 17:18, 2 February 2006 (UTC)
- No more tech details on the website, and search yield no results :(--Skyfiler 22:30, 23 February 2006 (UTC)
Unicode code points
[edit]>which is easily sufficient to cover Unicode's 1,114,112 (17*65536) code points.
this counted the 2048 surrogate points which don't need to be encoded, e.g. un UTF-16 you can't encode U+D800 as this isn't a valid real code point:
where does this figure come from. I undestand that UTF-16 can encode all code points but it only covers 1,112,064:
BMP is 0x10000 - (0xE000 - 0xD800), i.e. don't count surrogate code points = 0xF800 (63488) The rest have 20 bits in UTF-16 four bytes sequences (0x1000 is subtracted) = 2 ^ 20 = 1048576 (0x100000) 0xF800 + 0x100000 = 0x10F800 (1,112,064)
- Sorry i counted the overall range of code points and didn't account for the ones that are permanently reserved for uses other than encoding characters (e.g. the surrogates you mentioned). Plugwash 17:26, 14 July 2006 (UTC)
The need for a new mapping table
[edit]It appears that the mapping table, while probably based on more or less official mappings, refers to an older version of Unicode, and, taking the section further up on the page into account, perhaps also of GB18030, and lacks several mappings that have now become available. Notably, it maps several characters present in GB18030 to characters in Unicode's Private Use Area, although according to Kenneth Whistler of Unicode, Inc, these characters were already mapped in Unicode 4.1. The same appears true of the mapping table included with my copy of Ubuntu Linux, which may or may not be the same table.
It would appear that no up-to-date table is available in the public domain, however, so this may be the most up-to-date table that's available. At any rate, I think we should keep our eyes open in case a more recent table surfaces. Rōnin 20:07, 21 February 2007 (UTC)
- All the sources i can find written in english online seem to say that GB18030 is supposed to be a 1:1 mapping of all parts of unicode including the private use area and all code points that are currently unassigned. Do you have an authoritive source that says (either directly or through tables that can be read in conjunction with that big xml file) that GB18030 values that map to private use unicode code points are used in the GB18030 standard for something other than private use? (note, it seems that most but not all of the bmp private use area is mapped either to codes that our gbk article calls out as private use or to 4 byte codes) Plugwash 04:00, 16 March 2007 (UTC)
- I have a few, actually. http://www.unicode.org/faq/han_cjk.html#23 says this:
- A. That used to be true, as of Unicode 4.0. There were in fact a small number of characters in GB 18030 that had not made it into Unicode (and ISO/IEC 10646). However, to avoid having to map characters to the PUA for support of GB18030, the missing characters were added as of Unicode 4.1, so of course, they are in Unicode 5.0 and later versions.
- You can find the characters in question in Annex C (p. 92) of GB 18030-2000. All now have regular Unicode characters. These can be found in the ranges: U+31C0..U+31CF (for CJK strokes) and U+9FA6..U+9FBB (for various CJK characters and components).
- I am aware that a new version of GB 18030 has been released which shows reference glyphs for a wider range of characters (including supplementary ones, I believe) and updates the mappings from Unicode 3.0 to Unicode 4.0 or higher - which changes some of them from PUA code points to assigned code points.
- Some assigned characters are mapped from 2-byte parts of GBK and GB 18030 to the Private-Use Area in the BMP (U+E000..U+F8FF). A small portion of these mappings have changed between GBK and GB 18030, and GB 18030 maps them instead to Unicode characters that were introduced in Unicode 3.0.
- It seems to me like the table in the article maps a lot of codes in the range E000-F8FF, which could mean that it actually maps a lot of now assigned code points to the Private Use Area. Though as seen in the post from the mailing list quoted above, the people at the ICU project are aware of it.
- The article also makes it clear now that the up-to-date version of the encoding is called GB18030-2005, while the mapping table is for the earlier version GB18030-2000. ICU does not provide an updated table, and the only one I've seen so far is one that allegedly originated from a blog. Something tells me GB18030-2005 isn't generating massive amounts of interest. Rōnin (talk) 21:48, 19 August 2008 (UTC)
I have just created a table at GB 18030#PUA (moved from my user page), with some generally acceptable references. --Artoria2e5 emits crap 06:49, 11 September 2016 (UTC)
Section on compatibility?
[edit]Can we see a section or subsection on compatibility with other formats, or put the information in the lede? It sounds like the code page is compatible with Unicode, but I don't know for sure if what I'm reading is worded such that other information I'm not seeing would contradict it in some situations. It would be much less confusing if it were stated exactly where all the relevant information will be. ᛭ LokiClock (talk) 12:29, 14 March 2011 (UTC)
Chinese sites moving towards UTF8
[edit]Not sure about on the OS side, but Chinese sites (Sina, Taobao to name a few) have recently been moving towards UTF-8, usually from GB 2312. --207.38.206.45 (talk) 00:25, 26 December 2015 (UTC)
Confusion in History section about character mapping changes from GB18030-2000 to GB18030-2005?
[edit]I'm not an expert, but digging on the web, and the box to the right side of the page in this article, suggests the text in the History section uses the value 1E37 when it should be 1E3F. A GB18030 expert should check this and fix it. — Preceding unsigned comment added by 72.179.1.38 (talk) 05:22, 3 June 2018 (UTC) Ooops.... I'll swear it said 1E37 a few minutes ago in the History section. Now I only see 1E3F, so it appears to be corrected.
Is this encoding good for Japanese, or even all East-Asian languages?
[edit]To answer my question, I know it will work with all, since it covers all of Unicode, and no, it's not actually commonly used in Japan, but is it as or more space-efficient, as the other encodings used in Asia? It seems better than UTF-16 at least for those languages, for efficiency. UTF-8 isn't bad either (assuming e.g. some mixed in ASCII, for e.g. HTML, but so is this format). comp.arch (talk) 15:17, 9 September 2021 (UTC)
“the PRC decided to mandate support of certain code points outside the BMP”
[edit]What exactly is this supposed to tell us? Full support for UTF-8, for example, mandates supporting all (valid) code points outside the BMP, so the PRC deciding to mandate support for certain ones is not more “catastrophic” than UTF-8 support (and UTF-8 had already existed for several years befor the publication of GB 18030). So in what respect was this a “move of historic significance”? Was there something that did not let developers “get away” with supporting only BMP code points in the PRC, while not fully supporting UTF-8 was fine? --2A02:8108:50BF:C694:C133:96B0:222E:BF9A (talk) 21:15, 18 June 2022 (UTC)
- There was a time when most widely adopted Unicode implementations only actually implemented the Basic Multilingual Plane. Back then, UTF16-like formats were the most common Unicode transformation formats to actually be widely implemented. The UTF16-like format that was the standard prior to Unicode expanding beyond the BMP was simpler and needed less special-casing for systems to deal with it. That earlier, "parent" format may have been commonly called UCS-2 (for two bytes) or also UTF-16 (for 16 bits); I'm not sure – but even if it was also called UTF-16, this historical format was distinct from what is now known as UTF-16. The historical UTF16-like format, let's call it UCS-2 (which may even be historically correct) did NOT include the surrogate pair hack, which really is something of a hack, and not a particularly elegant one, but one which is now a required part of current UTF-16.
- The surrogates provide 10 bits each, and the use of a high and low surrogate together as a pair was what gave Unicode the additional 16 planes of 65,536 code points each, i.e. a total of 1,048,576 extra code points in addition to the code points of the BMP, at the low price of just 2,048 previously-unassigned BMP code points, which are now utterly permabanned for anything other than UTF-16 surrogate pairs. Basically, they stole 2,048 code points from the BMP and used those to provide about an extra million possible characters.
- Btw., if they HAD allowed any of those 10+10 bits to be freely used in both the high and low surrogate positions, they could have gotten 64 extra planes instead of just 16 extra planes out of that hack. They restricted the use of only the lesser 10-bit values to the high surrogate and only the greater 10-bit values to the low surrogate for self-synchronisation. If that seems the wrong way around, it's because of endianness.
- And if this all seems complicated, that's precisely why real world devs didn't like the expansion of Unicode past the BMP. At least not in the way UTF-16 did it, and all of this was BEFORE the invention of UTF-8. So basically you're back in the past, before UTF-8 was a draft on a napkin, and the Unicode people want to expand their character set past the BMP, and it seems the only way they suggest doing it is this surrogate pair hack and UTF-16. Developers don't like it. Just because Unicode says, go support this fancy surrogate pair hack for so much more room for activities and extra characters doesn't mean that new UTF-16 standard will see ubiquitous real-world adoption. So the UTF-16 standard gets published and updated, but devs only support the subset that they like, which is the BMP. Not the surrogate pairs, so no extra 16 planes. There's always a delay between standard publication and adoption, but standards can also fail and just not really be adopted, or only adopted partially. The real world might not actually do exactly what the standard says. Eventually, UTF-8 gets invented, but meanwhile people have adopted UCS-2, their surrogate pair-free subset of UTF-16. They've just changed horse from code pages (and maybe ISO/IEC 2022 with its own shift-in/shift-out hackishness) to UCS-2. They're not instantly changing to UTF-8. Adoption of UTF-8 is quick, because it's good and backwards-compatible with ASCII, but it's not instantaneous. There's inertia. And for those who have already adopted UCS-2, there's possibly even more inertia, but on the UCS-2/UTF-16 side, still people don't really like to support anything past the BMP, because surrogates are such a pain, and 65,536 characters ought to be enough for everybody.
- Enter China.
- Note, this was a time when America had not yet started another Cold War, and was not pumping out anti-China rhetoric and policies 24/7. This was a time when even America's notorious regime-changers sort of believed in being nice to China, so capitalism would creep in, and obviously then they'd exploit it. The point is, America's political and business elite were all looking to do much MORE business with China, not less. So when China said, for any systems you want to sell to China, you have to support all these other characters too, that are beyond the BMP, suddenly people had a reason to want to do that. And that really was a pivotal, seminal, historical moment in the history of tech. In a mirror universe where China doesn't mandate a standard implying the requirement for a greater character set, the whole 16 extra Unicode planes thing may have failed. Unicode possibly still may have expanded in some other way, only much later. But in this timeline, companies then and there wanted to get at that big huge billion-plus consumer market, so they had to implement support for characters beyond the BMP. So ultimately people either bit the bullet and implemented UTF-16, surrogate pairs and all, or they possibly jumped straight to UTF-8, but the point is, they were motivated to jump.
- I wouldn't get too hung-up on the question of "full support" for X though. Support for planes past the BMP as opposed to only the BMP makes a big difference in UTF16-like implementations, because of surrogate pairs. It does not make as huge a difference whether you're supporting this or that plane in UTF-8. But if you want to support so many characters beyond what fits on the BMP, well, you need a solution that can do that one way or another. Either UTF-16 including surrogate pairs or UTF-8 or literally GB 18030.
- What I don't actually know is the extent to which these events caused American companies to directly implement GB 18030 (the actual standard, not just the size of its character set) and how that now compares to UTF-8 in terms of market share in China. Maybe you know, and maybe you can teach me. ReadOnlyAccount (talk) 11:52, 15 October 2023 (UTC)
Continued relevance of GB 18030
[edit]In light of this recent update, it might be interesting to know, and clarify in the article, to what extent GB 18030 continues to be used in parallel or in preference to UTF-8. Earlier on there was some suggestion Chinese websites were moving to UTF-8, but what about most Chinese hosts (PCs, tablets, smartphones, etc.). Do most Chinese users use UTF-8 or GB 18030? If the former, then exactly what is the continued relevance of GB 18030? If the latter, then how does most of China using GB 18030 work in practice, given that most of the world seems to be moving to UTF-8? ReadOnlyAccount (talk) 07:54, 15 October 2023 (UTC)