From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing / Software (Rated Start-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Software (marked as Low-importance).

This is not an encoding[edit]

This is not an encoding, but simply Unicode itself. If a character's Unicode code is 42, then the 32 bit integer which holds 42 is not "UTF-32". It's just the code of that character. UTF-32 is merely a synonym for Unicode. UTF-32('A') is 65, and Unicode('A') is 65. Same thing.

When we have a string of ASCII characters encoded as 8 bit values, do we call that ATF-1? No, it is just an ASCII string.

Another problem is that if this is an encoding, what is the byte order? Is the character 'A' stored as 00 00 00 41 or is it stored as 41 00 00 00? If this is an encoding, we should be able to answer such a basic question. Encoding means settling all issues of bitwise representation for the purposes of transport. (talk) 22:49, 29 October 2011 (UTC)

Why 4 byte? Why not 3?[edit]

Why is there no 3-byte encoding? 2^24 is 16,177216, much more than is needed to represent the 1,114,112 character codes of Unicode. Is this because of word boundaries? I understand there are tradeoffs, but wouldn't someone somewhere have use of a simple to process encoding that didn't waste a whole byte for each character? --Apantomimehorse 06:47, 9 September 2006 (UTC)

Truth is if they had planned things properly from the beggining i doubt this encoding would exist. If you are going to the trouble of supporting suplementry characters you will probablly wan't other advanced text features too which will nullify most of the advantages of a fixed width encoding.
In any cace you'd be pretty mad to use UTF-32 or UTF-24 for storage or transfer purposes and if you wan't to use a 3 byte encoding internally in your app or app framework theres nothing to stop you (though i strongly suspect it will perform far worse than either a well written UTF-16 or UTF-32 system). Plugwash 00:39, 11 September 2006 (UTC)
The reason for a 4-byte and not a 3-byte encoding should be simple, a 32-bit number is a native unit for today's dominating 32-bit and 64-bit processors. For example, eading memory in units of 24 bits would be much more expensive than the larger chunk of 32-bits for this reason. -- Sverdrup (talk) 18:58, 8 November 2011 (UTC)


Is it just the way I'm reading this article, or does it stink of a total lack of NPOV? Almost reads like a case for everybody forgetting about UTF-32.. UTF-32 space inefficient? Not if you're Japanese. The whole reason the character handling is in the state it's in is because people didn't care about the needs of other people. It was pretty clear a long time ago that a solution was needed to i18n and that something not unremarkably like UTF-32 was needed.

"Also whilst a fixed number of bytes per code point may seem convenient at first it isn't really that much use. It makes truncation slightly easier but not significantly so compared to UTF-8 and UTF-16. It does not make calculating the displayed width of a string any easier except in very limited cases since even with a “fixed width” font there may be more than one code point per character position (combining marks) or indeed more than one character position per code point (for example CJK ideographs). Combining marks also mean editors cannot treat one code point as being the same as one unit for editing."

Well, no. If you're talking about drawing glyphs sure, but it has absolutely no pros/cons as compared to other charsets in that context. It makes i18n string handling easier by an order of magnitude though. All you do is divide everything by four put simply. Try counting the length of a string in UTF-8 or UTF-16.. It's just about impossibly to do in a stable way.. Look at the whole "Bush hid the facts" bug in notepad.. the *perfect* example of an issue that would never have occurred with UTF-32. --Streaky 03:35, 30 November 2006 (UTC)

Inefficiant is definately true, in the best case its no better than either UTF-8 or UTF-16 and in the common cases (yes that includes chineese and japaneese) it is far worse.
What *IS* the code point count usefull for? Most of the time what matters is either size in memory, grapheme cluster count or console position count.
As for the <name> hid the facts "bug" you mentioned, it doesn't look like a charset issue to me (and is almost certainly not related to either UTF-8 or UTF-16). To me it looks like a deliberate easter egg but unless someone can translate Plugwash 12:52, 30 November 2006 (UTC)
The "bush hid the facts" issue is a technical side effect, neither a bug nor an easter egg. Since a text file doesn't carry information about its encoding, you have to guess. Especially for short strings it sometimes comes out wrong, treating a text with encoding X like one with encoding Y. More details. -- 17:21, 27 June 2007 (UTC)
"Bush hid the facts" is in fact *caused* by the use of a non-byte encoding (UCS-2), rather than an ASCII-compatible encoding such as UTF-8. Use of UTF-32 would result in similar bugs. So in fact it is an argument *against* using UTF-32.Spitzak (talk) 22:18, 21 April 2010 (UTC)
Not sure what the article means by "more than one character position per code point (for example CJK ideographs)", won't these be one (CJK) character per code point as well? Regarding whether this article is NPOV, since most commonly used CJK characters are in the BMP, which can be represented with only 2 bytes, always using 4 bytes to represent these is wasteful even if you are Japanese. Raphanid 22:59, 29 June 2007 (UTC)
Raphanid: In tradational fixed witdth CJK fonts an ideograph is 2 character positions wide (that is twice the width of a latin alphabet letter). Do you have a source for the hid the facts thing being a misdetection (that msdn page isn't one). Given the length and pure english nature of the message it seems pretty unlikely. Plugwash 17:42, 30 June 2007 (UTC)

UTF-32 not used?[edit]

For these reasons UTF-32 is little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode text

i disagree with this statement. wchar_t in unix/linux C applications is in UTF-32 format. This makes it pretty often used. Vid512 17:28, 19 March 2007 (UTC)

Also UTF-32 is used as the internal format for strings in the Python programming language—the C-based reference implementation at any rate. (Actually it uses UCS-4, as Python does not impose the restriction against lone high or low surrogates from being encoded). As there are no referenced facts to support this statement and there are clear uses of UTF-32/UCS-4 in systems today, I'm removing the claim that it's not used. - Dmeranda 04:44, 5 August 2007 (UTC)
I noticed this and also disagreed. But when I read that Dmeranda removed it, well, it was still there! Checking, it appears he removed another statement instead. I've restored that and removed the above claim. mdf (talk) 15:38, 27 November 2007 (UTC)
CPython can be compiled to use either UCS-4 or UCS-2 internally. It defaults to UCS-2, but many Linux distributions compile it to use UCS-4. Agthorr (talk) 15:44, 9 May 2010 (UTC)

Removing cleanup and dubious tags[edit]

Well first, the edit comment is a bit inaccurate. I quickly looked through the history and I thought I saw the cleanup message there but upon further examination it seems I was wrong. So it should instead say that the cleanup tag is wrong in that it was most certainly not there since September 2007, and the article is in fine shape (though not the best it could be), so it shouldn't be there.

As for the dubious tags, the claim that it's more space efficient is well justified by the following sentence, which notes that non-BMP characters are rare. This is by design; the BMP is intended to contain pretty much every character in major (and most minor) modern languages, as the standard notes [1]. The BMP takes 2 bytes in UTF-16 and 1 to 3 in UTF-8, so for text consisting of BMP characters, UTF-32 obviously takes more space. For a real world example, with a file consisting of large amounts of Japanese and ASCII text (all in the BMP), it is 10MB with UTF-8, 14MB with UTF-16, and 28MB with UTF-32.

For the claim that it's rarely used, Unixy systems use UTF-8, Windows uses UTF-16, various programming languages mostly use either (though I know Python can use UCS-4 if you compile it so). Another message on this page talks about wchar_t, but that's implementation-specific, and the Unicode standard even advises against it for code that's supposed to be portable for this reason [2]. In my experience it doesn't seem to be used nearly as much as others, though I admit my experience in this area isn't quite vast. Regardless, a completely implementation-specific data type in a single language hardly changes matters.

By those reasons I've removed those tags. The article might do with a few citations, but there's no dubious information in it, and though it could be improved, it's written well enough that it does not require a cleanup. (talk) 07:15, 19 November 2009 (UTC)

Character vs. code point[edit]

The History section reads: "UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF." Shouldn't this be "each encoded code point"? A 32-bit value doesn't necessarily represent one character, come characters are composed of several values. Tigrisek (talk) 19:45, 16 January 2011 (UTC)

I agree, this fix was already done for UTF-8/16 pages.Spitzak (talk) 03:41, 18 January 2011 (UTC)


I'm a seasoned software developer, and I believe that it's damn convenient that the Nth character of a string can be found by indexing to position [N-1] in an array (or [N], if 1-based). (talk) 22:18, 29 October 2011 (UTC)

As a "seasoned software developer" I would be interested in you locating actual code you have written where it looked at character N in a string without first looking at characters 0..N-1. Using the return value from another function that looked at characters 0..N-1 does not count. The N must be generated without ever looking at the string. Any other use can be rewritten to use code unit offsets or iterator objects and is not an argument for fixed-sized code points.Spitzak (talk) 22:20, 31 October 2011 (UTC)

Checking the end of a string not useful??[edit]

However there are few (if any) useful algorithms that examine the n'th code point without first examining the preceding n-1 code points, therefore a somewhat more difficult code replacement will almost always negate this advantage

This is bull. I'm a senior software developer, and I can't even count how often I need to examine the last few characters at the *end* of a string (examples: Check if a path is a directory or a file; check a file's extension)

EDIT: Seeing how others have already had almost exactly the same complaint, I'll now remove / rephrase the mentioned section.

This statement should be removed entirely. — Preceding unsigned comment added by (talk) 09:43, 22 April 2012 (UTC)

All you are saying is that an offset (such as the length) should be in code units, not "characters". You can find the end of a UTF8 or 16 string instantly if the length is in code units, negating any advantage of UTF32. Anyway there is more text about this below, so this is ok. You should be warned however that thinking offsets must be in "characters" does not match claiming you are some kind of expert.Spitzak (talk) 15:50, 23 April 2012 (UTC)