Talk:Comparison of Unicode encodings

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing / Software  
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 ???  This article has not yet received a rating on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Software.

CJK characters uses 3 bytes in UTF-8?[edit]

This article states that "...There are a few, fairly rarely used codes that UTF-8 requires three bytes whereas UTF-16 requires only two..."; but it seems to me that most CJK characters take 3 bytes in UTF-8 but 2 bytes in UTF-16? (talk) 08:32, 25 February 2008 (UTC)

I think you are right. And western people generally don't care about that... -- (talk) 16:20, 13 December 2011 (UTC)

This does sound poorly worded. However it should also be made clear that real CJK text in actual use on computers usually contains so much ASCII (number, spaces, newlines, XML markup, quoted English, etc) that they are *still* shorter in UTF-8 than UTF-16. In addition it should be pointed out that most CJK characters are entire words equivalent to 3-7 characters in English and thus they already have a huge compression advantage.
Some alphabetic languages from India do have a 3-byte UTF-8 encoding of all their letters. Since their words consist of multiple characters they can end up bigger in UTF-8 than UTF-16, and there have been complaints about this. Any comment about length should mention these languages where it actually is a problem.Spitzak (talk) 21:13, 13 December 2011 (UTC)

Requested move[edit]

This article appears to be a sub-page of Unicode, which is ok; but it should have an encyclopedic name that reflects its importance (that of an article on Unicode encodings, rather than some evaluative comparison). —donhalcon 16:26, 7 March 2006 (UTC)

It should be moved to Unicode encodings. Once that's done, the opening sentences should be redone to inform readers on the basic who/what/why. --Apantomimehorse 10:11, 10 July 2006 (UTC)


hex 110000, the grand total of 17 Planes, obviously takes 21 bits, which comfortably fit into 3 bytes (24 bits). So why would anyone want to encode 21 bits in 32 bits? the fourth byte is entirely redundant. What, then, is the rationale behind having UTF-32 instead of "UTF-24"? Just a superstitious fear of odd numbers of bytes? dab () 12:47, 6 July 2006 (UTC)

It's more than superstitious fear of odd numbers of bytes - it is a fact that most computer architectures can process multiples of bytes equal to their word size quicker. Most modern computers use either a 32 bit or 64 bit word. On the other hand, modern computers are fast enough that the speed difference is irrelevant. It is also true that most computer languages provide easy ways to refer to those multiples. (For example, in C on a 32 bit machine, you can treat UTF-32 in the machines's native byte order as an array of integers.) --LeBleu 23:01, 7 July 2006 (UTC)
Why not ask why we don't have UTF-21, since the last three bits in UTF-24 would be entirely redundant? Same issue, basically, but on a different scale (the hypothetical UTF-21, if actually stored as 21-bit sequences, would be much slower to process without noticeable size gain). Word sizes tend to be powers of two, so if data can be presented as (half)-word sized at little extra cost, this will be done unless there are overriding reasons of space economy. And if you want space economy, you should use UTF-16 anyway, since the extra processing power you must pay for characters outside the BMP is (usually) not significant enough to warrant using twice as much storage.
Nothing actually prohibits you from layering another encoding over UTF-32 that stores the values in three bytes, as long as you supply the redundant byte to anything that advertises itself as processing UTF-32. This is unlikely to be of much advantage, though. 11:36, 10 July 2006 (UTC)
so the fourth byte is really redundant, and hangs around in memory for faster processing speed. I imagine that all UTF-32 files will have to be compressed as soon as they are stored anywhere; the question then is, which is more of a waste of processing power, compressing and uncompressing the files, or adding a zero byte at read-time before further processing? UTF-8 is only economical if the overwhelming majority of characters are in the low planes. Assume (for argument's sake) a text with characters evenly distributed in the 17 planes: UTF-8 would be out of the window, but 'UTF-24' might have an advantage over UTF-32 (obviously "UTF-21" would be even more economical, but that would really mean a lot of bit-shifting). dab () 18:14, 29 July 2006 (UTC)
To answer your direct question adding an extra byte after every 3 will be far far less processing than implmenting something like deflate having said that i can't see many situations where you would do it.
Text with characters evenly distributed among the planes is going to be very very rare. Only 4 planes have ever had any allocations at all (BMP, SMP, SIP and SSP), only two of those contain character ranges for complete scripts (the other two are rare CJK ideographs and special control codes) and most texts will be highly concentrated on a few small ranges.
If you are concerned with storage space and you are dealing with a lot of non-bmp characters in your text (say an archive of tolkins tengwar and kirth manuscripts) then you will have to choose between possibilities such as a custom encoding, compressing encodings like SCSU and BOCU and general purpose compression algorithms like deflate. With most systems however even if individual documents are non-bmp the overwhelming majority of characters in the system as a whole are in the BMP.
A final point, if heavy use is made of HTML or XML or similar markup languages for formatting the ascii characters of the markup can easilly far outnumber the characters of the actual document text. Plugwash 23:06, 29 July 2006 (UTC)
My 2¢ regarding the possibility of UTF-24. I don't think the reasons given so far hold much water, and tend to simply justify a situation as it exists now, rather than recognizing Unicode an evolving standard. Before considering the need for UTF-24, its important to consider the way Unicode actively seeks to assign codepoints. Since Unicode tries to assign all graphemes from actively used languages in the BMP, UTF-16 surrogate characters are seldom needed for most documents except in certain esoteric circumstances. That means that in most cases character count equals bytes ÷ 2. It also means most documents have the bulk of their content expressed without surrogate characters in UTF-16 or with only 1-3 bytes in UTF-8 (since UTF-8 only turns to 4 bytes for characters outside the BMP). Therefore due to the way Unicode assigns graphemes to codepoints the bulk of most documents will be characters from the BMP drawing only occasionally on non-BMP characters.
UTF-32 has the advantage of always making it very quick and easy to count characters (bytes ÷ 4). However, UTF-32 also leads to much larger memory use (2 to 4 times as much depending on the text processed). The nice thing about UTF-24 is that it would provide the speed benefits without while still conserving 1/3 the memory (at least for text processing outside the BMP). In many ways I think that UTF-24 offers little over UTF-32 for internal processing of text. However, for storage of text outside the BMP (largely academic centered documents from ancient scripts and using seldom used characters). However for special cases of academic documents, UTF-24 could provide a valuable space-saving transform format for Unicode characters. Especially files containing music characters, ancient writing, and perhaps even academic CJK writing, UTF-24 could conserve disk space (and as Plugwash said, things like music manuscripts will likely have a large proportion of latin1/ascii block characters where UTF-8 might conserve as much disk space as a hypothetical UTF-24).
One last thing about a fixed-width encoding (which was among some of the original goals of Unicode that didn't really take hold). The development of Unicode has shown that more important than raw character count is likely less important than grapheme cluster count (and this affects characterAtIndex as well). While this may be often acknowledged, I am not aware of many (any?) implementations that string count methods/functions grapheme cluster rather than characters. In fact I think some implementations return UTF-16 string counts as bytes ÷ 2, so the implementation actually ignores the surrogate problem entirely. So with these complications, an implementation really needs to count characters in a different way that it counts bytes to really return an accurate grapheme cluster count, taking into account surrogate pairs as a single grapheme and combining characters as not part of the count. So UTF-24 doesn't really help with this all that much either (though it does eliminate the surrogate pair issue).
In the end I think UTF-24 could become useful for special case academic documents. UTF-32 might become more popular for internal processing (as in not for file formats) since we already see memory usage go from 32 bits to 64 bits for things like pointers and longs, its not such a stretch to see Unicode implementations go from 16 to 32 bits too. Much of this likely depends on what else gets assigned outside the BMP. —Preceding unsigned comment added by Indexheavy (talkcontribs) 03:59, 4 November 2008 (UTC)
In my experiance the string length functions in languages typically give you a length in code units (bytes for UTF-8, 16 bit words for UTF-16), This is generally what you want since afaict the most common uses of string length are either to iterate over the string or do some sanity checking of the size. If you need the grapheme cluster count, code point count or console position count you will have to use special functions to get them but afaict most applications don't need any of them. Plugwash (talk) 18:53, 7 November 2008 (UTC)

UTF-7,5 ?[edit]

See this page [1] which describes the encoding. Olivier Mengué |  23:19, 22 May 2007 (UTC)


So what is the most popular encoding??? —Preceding unsigned comment added by (talk) 07:52, 15 February 2008 (UTC)

UTF-8 is popular for latin-based text, while UTF-16 is popular for asian text. And everyone hates UTF-32 ;-) (talk) 18:32, 27 March 2008 (UTC)
Not really, while UTF-8 is more compact than UTF-16 for most alphabetic scripts and UTF-16 is smaller than UTF-8 for CJK scripts then UTF-8 the descision is often based on considerations other than sise (legacy encodings are also commonly used but we will focus on unicode encodings here).
In the unix and web worlds UTF-8 dominates because it is possible to use it with existing ascii based software with little to no modification. In the windows NT .net and java worlds UTF-16 is used because when those APIs were designed unicode was 16 bit fixed width and UTF-16 was the easiest way to retrofit unicode support. There are one or two things that use UTF-32 (I think python uses it under certain compile options and some C compilers make wchar_t 32 bit) but mostly it is regarded as a very wastefull encoding (and the advantage of being fixed width turns out to be mostly an illusion once you implement suport for combining characters). Plugwash (talk) 21:43, 4 April 2008 (UTC)
Why not to teke into account those considerations in an internet/popularity section? — Preceding unsigned comment added by (talk) 20:13, 26 June 2012 (UTC)

Mac OS Reference[edit]

This seems to be a bit out of date. I just searched the reference library and can not come up with anything in the current version of Mac OS regarding UTF-16. Since the cited material is two revisions (10.3 vs. the current 10.5) AND since Mac OS has understands UTF-8, the fact that it uses UTF-16 in a previous version for INTERNAL system files, is irrelevant. I suggest this be removed. Lloydsargent (talk) 14:08, 24 April 2008 (UTC)

UTF-8 with BOM!!![edit]

"A UTF-8 file that contains only ASCII characters is identical to an ASCII file"—Only with the strictest (too strict) reading is this true. A UTF-8 file could have a BOM, which then would not "[contain] only ASCII characters." Can someone re-word this without making it require such a strict reading yet still be simple? —Preceding unsigned comment added by (talk) 18:07, 3 September 2009 (UTC)

If it has a BOM then it does not consist only of ASCII characters. The BOM must not be required for the file to be handled as UTF-8, this sort of short-sightedness is stopping I18N from being implemented as too much software cannot handle garbage bytes at the start of the file but would have no problem with these bytes inside the file (such as in a quoted string constant).Spitzak (talk) 21:36, 9 November 2009 (UTC)
If the file doesn't start with a BOM, the encoding cannot be sniffed in a way that is reliable and not too wasteful. I wish we could just always assume UTF-8 by default, but there are too many files in crazy legacy encodings out there. Software that cannot handle (or even tolerate) Unicode will have to go eventually.-- (talk) 23:26, 20 May 2011 (UTC)
The encoding *can* be sniffed quite well if you assume UTF-8 first, and assume legacy encodings only if it fails the UTF-8 test. For any legacy encoding that uses bytes with the high bit set, the chances of it forming valid UTF-8 are miniscule, like 2% for a 3-character file, and rapidly dropping as the file gets longer. (the only legacy encoding that does not use the high bit set that is at all popular today is ASCII, which is already identical to UTF-8). One reason I really dislike the BOM in UTF-8 files is that it discourages programmers from using this method to determine encoding. The result is that it *discourages* I18N, rather than helping it.Spitzak (talk) 03:09, 21 May 2011 (UTC)
STD 63 = RFC 3629 agrees with you, but UTF-8 processors still must be ready to accept (and ignore) a signature (formerly known as BOM, but it clearly is no BOM in UTF-8). Others, notably the XML and Unicode standards, don't agree with you, and a Wikipedia talk page anyway isn't the place to change standards. Sniffing is no good option, UTF-8 as well as windows-1252 can be plain ASCII for the first Megabytes, and end with a line containing ™ - some W3C pages do this (of course UTF-8 without signature, but still an example why sniffing is not easy). – (talk) 21:14, 10 June 2011 (UTC)
Your example of a file that is plain ASCII for the first Megabytes works perfectly with assumptions that it is UTF-8. It will be drawn correctly for all those megabytes. Eventually it will hit that character that is invalid UTF-8 and it can then decide that the encoding should be something other than UTF-8. Sorry, you have not given an example that does anything other than prove the assumption of UTF-8 is correct. I agree that software should ignore BOM bytes (not just at the start, but imbedded in the file), or treat it as whitespace.Spitzak (talk) 01:27, 11 June 2011 (UTC)
Older charsets will go away, but not this year. The IETF policy on charsets predicted 40 years, still more than 25 years until UTF-8 will have gained world dominance. At least all new Internet protocols must support UTF-8. The HTML5 folks will in essence decree that "Latin-1" (ISO 8859-1) is an alias for windows-1252, and the next HTTP RFC also addresses the problem, so for now you can at least assume that Latin-1 is almost always windows-1252. Billions of old Web pages will stay as is until their servers are shut down. – (talk) 21:25, 10 June 2011 (UTC)
I believe about 90% of the reason older character sets are not going away is due to people not assuming UTF-8 in their detectors, and that the existence of the UTF-8 BOM is the primary reason these incorrect detectors are being written (since programmers then think they should check for it). Therefore the UTF-8 BOM is the primary reason we still have legacy character sets. Web pages in windows-1252 would display correctly if browsers displayed invalid UTF-8 by translating the individual bytes to the windows-1252 characters, and there would be NO reason for any charset and the default could still be UTF-8.Spitzak (talk) 01:27, 11 June 2011 (UTC)

Code points are not bytes, but just opposite thing[edit]

Spitzak, your summary in [2] is wrong. UTF-8 has 8-bit bytes (i.e. coded message alphabet has ≤ 256 symbols, more exactly, 243), but its code points are just Unicode ones. Surrogates are code points, they are not characters indeed. Please, recall the terminology. The word "character" is inappropriate. Incnis Mrsi (talk) 19:47, 23 March 2010 (UTC)

I believe you are right. I was confusing this with the units used to make the encoding.Spitzak (talk) 01:00, 24 March 2010 (UTC)

Historical: UTF-5 and UTF-6[edit]

I propose to delete this section. "UTF-5" and "UTF-6" are unimplemented vaporware; they were early entries in an IDNA competition which Punycode ultimately won. Doug Ewell 20:14, 20 September 2010 (UTC) —Preceding unsigned comment added by DougEwell (talkcontribs)

IMO nothing is wrong with mentioning encodings in obsolete Internet Drafts, after all Martin Dürst is one of the top ten i18n developers I could name, and helped to develop UTF-5 before they decided to use PunyCode for their IDNA purposes. IIRC I added this section, and refrained from adding my "original research" UTF-4 ;-) – (talk) 20:35, 10 June 2011 (UTC)

0x10FFFF limit in UTF-16[edit]

Found this in the UTF-16 RFC: "Characters with values greater than 0x10FFFF cannot be encoded in UTF-16." ( I'm wondering if this is a specific limit of UTF-16? Can characters above 0x10FFFF be encoding in UTF-8, for example? (Are there any such characters?) (talk) 10:08, 30 April 2012 (UTC)

Yes this is a limit of UTF-16. UTF-8 and initially Unicode was designed to encode up to 0x7FFFFFFF. The Unicode definition was changed to match the limits of UTF-16 by basically saying any encoding of a value greater than 0x10FFFF, no matter how obvious it is, is invalid. No characters are or will ever be assigned codes above 0x10FFFF (personally I feel this is bogus, people believed 128, 2048, and 65536 were limits to character set sizes and these were all broken).Spitzak (talk) 18:10, 30 April 2012 (UTC)