Talk:UTF-32

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing / Software (Rated Start-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Start-Class article Start  This article has been rated as Start-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Software (marked as Low-importance).
 

This is not an encoding[edit]

This is not an encoding, but simply Unicode itself. If a character's Unicode code is 42, then the 32 bit integer which holds 42 is not "UTF-32". It's just the code of that character. UTF-32 is merely a synonym for Unicode. UTF-32('A') is 65, and Unicode('A') is 65. Same thing.

When we have a string of ASCII characters encoded as 8 bit values, do we call that ATF-1? No, it is just an ASCII string.

Another problem is that if this is an encoding, what is the byte order? Is the character 'A' stored as 00 00 00 41 or is it stored as 41 00 00 00? If this is an encoding, we should be able to answer such a basic question. Encoding means settling all issues of bitwise representation for the purposes of transport.

24.85.131.247 (talk) 22:49, 29 October 2011 (UTC)

Why 4 byte? Why not 3?[edit]

Why is there no 3-byte encoding? 2^24 is 16,177216, much more than is needed to represent the 1,114,112 character codes of Unicode. Is this because of word boundaries? I understand there are tradeoffs, but wouldn't someone somewhere have use of a simple to process encoding that didn't waste a whole byte for each character? --Apantomimehorse 06:47, 9 September 2006 (UTC)

Truth is if they had planned things properly from the beggining i doubt this encoding would exist. If you are going to the trouble of supporting suplementry characters you will probablly wan't other advanced text features too which will nullify most of the advantages of a fixed width encoding.
In any cace you'd be pretty mad to use UTF-32 or UTF-24 for storage or transfer purposes and if you wan't to use a 3 byte encoding internally in your app or app framework theres nothing to stop you (though i strongly suspect it will perform far worse than either a well written UTF-16 or UTF-32 system). Plugwash 00:39, 11 September 2006 (UTC)
The reason for a 4-byte and not a 3-byte encoding should be simple, a 32-bit number is a native unit for today's dominating 32-bit and 64-bit processors. For example, eading memory in units of 24 bits would be much more expensive than the larger chunk of 32-bits for this reason. -- Sverdrup (talk) 18:58, 8 November 2011 (UTC)

NPOV?[edit]

Is it just the way I'm reading this article, or does it stink of a total lack of NPOV? Almost reads like a case for everybody forgetting about UTF-32.. UTF-32 space inefficient? Not if you're Japanese. The whole reason the character handling is in the state it's in is because people didn't care about the needs of other people. It was pretty clear a long time ago that a solution was needed to i18n and that something not unremarkably like UTF-32 was needed.

"Also whilst a fixed number of bytes per code point may seem convenient at first it isn't really that much use. It makes truncation slightly easier but not significantly so compared to UTF-8 and UTF-16. It does not make calculating the displayed width of a string any easier except in very limited cases since even with a “fixed width” font there may be more than one code point per character position (combining marks) or indeed more than one character position per code point (for example CJK ideographs). Combining marks also mean editors cannot treat one code point as being the same as one unit for editing."

Well, no. If you're talking about drawing glyphs sure, but it has absolutely no pros/cons as compared to other charsets in that context. It makes i18n string handling easier by an order of magnitude though. All you do is divide everything by four put simply. Try counting the length of a string in UTF-8 or UTF-16.. It's just about impossibly to do in a stable way.. Look at the whole "Bush hid the facts" bug in notepad.. the *perfect* example of an issue that would never have occurred with UTF-32. http://www.evilshroud.com/bushhidthefacts/ --Streaky 03:35, 30 November 2006 (UTC)

Inefficiant is definately true, in the best case its no better than either UTF-8 or UTF-16 and in the common cases (yes that includes chineese and japaneese) it is far worse.
What *IS* the code point count usefull for? Most of the time what matters is either size in memory, grapheme cluster count or console position count.
As for the <name> hid the facts "bug" you mentioned, it doesn't look like a charset issue to me (and is almost certainly not related to either UTF-8 or UTF-16). To me it looks like a deliberate easter egg but unless someone can translate Plugwash 12:52, 30 November 2006 (UTC)
The "bush hid the facts" issue is a technical side effect, neither a bug nor an easter egg. Since a text file doesn't carry information about its encoding, you have to guess. Especially for short strings it sometimes comes out wrong, treating a text with encoding X like one with encoding Y. More details. --193.99.145.162 17:21, 27 June 2007 (UTC)
"Bush hid the facts" is in fact *caused* by the use of a non-byte encoding (UCS-2), rather than an ASCII-compatible encoding such as UTF-8. Use of UTF-32 would result in similar bugs. So in fact it is an argument *against* using UTF-32.Spitzak (talk) 22:18, 21 April 2010 (UTC)
Not sure what the article means by "more than one character position per code point (for example CJK ideographs)", won't these be one (CJK) character per code point as well? Regarding whether this article is NPOV, since most commonly used CJK characters are in the BMP, which can be represented with only 2 bytes, always using 4 bytes to represent these is wasteful even if you are Japanese. Raphanid 22:59, 29 June 2007 (UTC)
Raphanid: In tradational fixed witdth CJK fonts an ideograph is 2 character positions wide (that is twice the width of a latin alphabet letter).
193.99.145.162: Do you have a source for the hid the facts thing being a misdetection (that msdn page isn't one). Given the length and pure english nature of the message it seems pretty unlikely. Plugwash 17:42, 30 June 2007 (UTC)

UTF-32 not used?[edit]

For these reasons UTF-32 is little used in practice with UTF-8 and UTF-16 being the normal ways of encoding Unicode text

i disagree with this statement. wchar_t in unix/linux C applications is in UTF-32 format. This makes it pretty often used. Vid512 17:28, 19 March 2007 (UTC)

Also UTF-32 is used as the internal format for strings in the Python programming language—the C-based reference implementation at any rate. (Actually it uses UCS-4, as Python does not impose the restriction against lone high or low surrogates from being encoded). As there are no referenced facts to support this statement and there are clear uses of UTF-32/UCS-4 in systems today, I'm removing the claim that it's not used. - Dmeranda 04:44, 5 August 2007 (UTC)
I noticed this and also disagreed. But when I read that Dmeranda removed it, well, it was still there! Checking, it appears he removed another statement instead. I've restored that and removed the above claim. mdf (talk) 15:38, 27 November 2007 (UTC)
CPython can be compiled to use either UCS-4 or UCS-2 internally. It defaults to UCS-2, but many Linux distributions compile it to use UCS-4. Agthorr (talk) 15:44, 9 May 2010 (UTC)

Removing cleanup and dubious tags[edit]

Well first, the edit comment is a bit inaccurate. I quickly looked through the history and I thought I saw the cleanup message there but upon further examination it seems I was wrong. So it should instead say that the cleanup tag is wrong in that it was most certainly not there since September 2007, and the article is in fine shape (though not the best it could be), so it shouldn't be there.

As for the dubious tags, the claim that it's more space efficient is well justified by the following sentence, which notes that non-BMP characters are rare. This is by design; the BMP is intended to contain pretty much every character in major (and most minor) modern languages, as the standard notes [1]. The BMP takes 2 bytes in UTF-16 and 1 to 3 in UTF-8, so for text consisting of BMP characters, UTF-32 obviously takes more space. For a real world example, with a file consisting of large amounts of Japanese and ASCII text (all in the BMP), it is 10MB with UTF-8, 14MB with UTF-16, and 28MB with UTF-32.

For the claim that it's rarely used, Unixy systems use UTF-8, Windows uses UTF-16, various programming languages mostly use either (though I know Python can use UCS-4 if you compile it so). Another message on this page talks about wchar_t, but that's implementation-specific, and the Unicode standard even advises against it for code that's supposed to be portable for this reason [2]. In my experience it doesn't seem to be used nearly as much as others, though I admit my experience in this area isn't quite vast. Regardless, a completely implementation-specific data type in a single language hardly changes matters.

By those reasons I've removed those tags. The article might do with a few citations, but there's no dubious information in it, and though it could be improved, it's written well enough that it does not require a cleanup. 24.76.174.152 (talk) 07:15, 19 November 2009 (UTC)

Character vs. code point[edit]

The History section reads: "UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF." Shouldn't this be "each encoded code point"? A 32-bit value doesn't necessarily represent one character, come characters are composed of several values. Tigrisek (talk) 19:45, 16 January 2011 (UTC)

I agree, this fix was already done for UTF-8/16 pages.Spitzak (talk) 03:41, 18 January 2011 (UTC)

NPOV[edit]

I'm a seasoned software developer, and I believe that it's damn convenient that the Nth character of a string can be found by indexing to position [N-1] in an array (or [N], if 1-based).24.85.131.247 (talk) 22:18, 29 October 2011 (UTC)

As a "seasoned software developer" I would be interested in you locating actual code you have written where it looked at character N in a string without first looking at characters 0..N-1. Using the return value from another function that looked at characters 0..N-1 does not count. The N must be generated without ever looking at the string. Any other use can be rewritten to use code unit offsets or iterator objects and is not an argument for fixed-sized code points.Spitzak (talk) 22:20, 31 October 2011 (UTC)
For instance: Take a look at the -c switch in the "cut (Unix)" article. — Preceding unsigned comment added by 62.159.14.9 (talk) 10:14, 8 February 2016 (UTC)
Sorry, wrong. The cut command reads utf-8, and therefore has scanned all the "characters" before n and can count while doing so. And in fact unless the writers are complete idiots, this is how it would be written. Note the -n switch ("don't split multibyte characters"), this is a good indication that cut does not convert to UTF-32 at any point. I would also like to see an actual script that uses the -c switch and would fail to do the desired result if -b and -n were used instead.Spitzak (talk) 02:03, 9 February 2016 (UTC)
Don't know about that guy, but I am currently writing a soft real-time appliance (running on BareMetalOS) in assembly. Fixed width encoding is very useful to me, even though I too don't necessarily consider myself a strictly "novice" programmer as mentioned in the article. This does have real uses. Sometimes space just needs to be bounded but can be arbitrary in principle, as long as you can do your stuff in a low fixed number of processor cycles. There are real use cases for this stuff, even for people who are not novices (not that I'm an expert on anything either). That's why it finds use. That wording in the article just seems unnecessary. If enough people agree, I would be personally willing to come up with a complete rewrite of most sections to discuss and see if it might improve the article. Does Wikipidia have a way of proposing large rewrites on the talk page without actually changing the article immediately? Like a pull request? --79.230.175.7 (talk) 19:42, 28 May 2016 (UTC)

Checking the end of a string not useful??[edit]

However there are few (if any) useful algorithms that examine the n'th code point without first examining the preceding n-1 code points, therefore a somewhat more difficult code replacement will almost always negate this advantage

This is bull. I'm a senior software developer, and I can't even count how often I need to examine the last few characters at the *end* of a string (examples: Check if a path is a directory or a file; check a file's extension)

EDIT: Seeing how others have already had almost exactly the same complaint, I'll now remove / rephrase the mentioned section.

This statement should be removed entirely. — Preceding unsigned comment added by 82.139.196.68 (talk) 09:43, 22 April 2012 (UTC)

All you are saying is that an offset (such as the length) should be in code units, not "characters". You can find the end of a UTF8 or 16 string instantly if the length is in code units, negating any advantage of UTF32. Anyway there is more text about this below, so this is ok. You should be warned however that thinking offsets must be in "characters" does not match claiming you are some kind of expert.Spitzak (talk) 15:50, 23 April 2012 (UTC)

Citation needed[edit]

There are two citation needed tags for statements saying something is rare to non-existent. In both cases, I don't see how anyone would find a reference. For one, you need a definition of rare to actually know, but also you don't know how many documents people have written and stored on their own computers. Unless someone decides to do a random survey of all documents, it isn't likely we will ever know. I think the tags should be removed. Gah4 (talk) 21:25, 5 October 2015 (UTC)