|This article is of interest to the following WikiProjects:|
now we have an understandable explanation in english
now we have an understandable explanation in english from sun anyone fancy expanding this article? Plugwash 17:43, 22 January 2006 (UTC)
One thing to beware of in using the Sun article - it has an error where it mentions Unicode 2.1 instead of 1.1 as basis for GBK. I checked elsewhere at length, and the Wikipedia GBK page is actually correct. BTW I did the recent updates to GBK, GB2312 and GB18030. -- Richard Donkin 06:39, 27 January 2006 (UTC)
- got anymore detail? is it just official incorporation of stuff introduced by unicode since the original standard? Plugwash 17:18, 2 February 2006 (UTC)
- No more tech details on the website, and search yield no results :(--Skyfiler 22:30, 23 February 2006 (UTC)
Unicode code points
>which is easily sufficient to cover Unicode's 1,114,112 (17*65536) code points.
this counted the 2048 surrogate points which don't need to be encoded, e.g. un UTF-16 you can't encode U+D800 as this isn't a valid real code point:
where does this figure come from. I undestand that UTF-16 can encode all code points but it only covers 1,112,064:
BMP is 0x10000 - (0xE000 - 0xD800), i.e. don't count surrogate code points = 0xF800 (63488) The rest have 20 bits in UTF-16 four bytes sequences (0x1000 is subtracted) = 2 ^ 20 = 1048576 (0x100000) 0xF800 + 0x100000 = 0x10F800 (1,112,064)
- Sorry i counted the overall range of code points and didn't account for the ones that are permanently reserved for uses other than encoding characters (e.g. the surrogates you mentioned). Plugwash 17:26, 14 July 2006 (UTC)
The need for a new mapping table
It appears that the mapping table, while probably based on more or less official mappings, refers to an older version of Unicode, and, taking the section further up on the page into account, perhaps also of GB18030, and lacks several mappings that have now become available. Notably, it maps several characters present in GB18030 to characters in Unicode's Private Use Area, although according to Kenneth Whistler of Unicode, Inc, these characters were already mapped in Unicode 4.1. The same appears true of the mapping table included with my copy of Ubuntu Linux, which may or may not be the same table.
It would appear that no up-to-date table is available in the public domain, however, so this may be the most up-to-date table that's available. At any rate, I think we should keep our eyes open in case a more recent table surfaces. Rōnin 20:07, 21 February 2007 (UTC)
- All the sources i can find written in english online seem to say that GB18030 is supposed to be a 1:1 mapping of all parts of unicode including the private use area and all code points that are currently unassigned. Do you have an authoritive source that says (either directly or through tables that can be read in conjunction with that big xml file) that GB18030 values that map to private use unicode code points are used in the GB18030 standard for something other than private use? (note, it seems that most but not all of the bmp private use area is mapped either to codes that our gbk article calls out as private use or to 4 byte codes) Plugwash 04:00, 16 March 2007 (UTC)
- I have a few, actually. http://www.unicode.org/faq/han_cjk.html#23 says this:
- A. That used to be true, as of Unicode 4.0. There were in fact a small number of characters in GB 18030 that had not made it into Unicode (and ISO/IEC 10646). However, to avoid having to map characters to the PUA for support of GB18030, the missing characters were added as of Unicode 4.1, so of course, they are in Unicode 5.0 and later versions.
- You can find the characters in question in Annex C (p. 92) of GB 18030-2000. All now have regular Unicode characters. These can be found in the ranges: U+31C0..U+31CF (for CJK strokes) and U+9FA6..U+9FBB (for various CJK characters and components).
- I am aware that a new version of GB 18030 has been released which shows reference glyphs for a wider range of characters (including supplementary ones, I believe) and updates the mappings from Unicode 3.0 to Unicode 4.0 or higher - which changes some of them from PUA code points to assigned code points.
- Some assigned characters are mapped from 2-byte parts of GBK and GB 18030 to the Private-Use Area in the BMP (U+E000..U+F8FF). A small portion of these mappings have changed between GBK and GB 18030, and GB 18030 maps them instead to Unicode characters that were introduced in Unicode 3.0.
- It seems to me like the table in the article maps a lot of codes in the range E000-F8FF, which could mean that it actually maps a lot of now assigned code points to the Private Use Area. Though as seen in the post from the mailing list quoted above, the people at the ICU project are aware of it.
- Rōnin 04:44, 17 March 2007 (UTC)
- The article also makes it clear now that the up-to-date version of the encoding is called GB18030-2005, while the mapping table is for the earlier version GB18030-2000. ICU does not provide an updated table, and the only one I've seen so far is one that allegedly originated from a blog. Something tells me GB18030-2005 isn't generating massive amounts of interest. Rōnin (talk) 21:48, 19 August 2008 (UTC)
Section on compatibility?
Can we see a section or subsection on compatibility with other formats, or put the information in the lede? It sounds like the code page is compatible with Unicode, but I don't know for sure if what I'm reading is worded such that other information I'm not seeing would contradict it in some situations. It would be much less confusing if it were stated exactly where all the relevant information will be. ᛭ LokiClock (talk) 12:29, 14 March 2011 (UTC)