Jump to content

Talk:UTF-8

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 175.157.213.232 (talk) at 12:00, 6 November 2015 (→‎Huge number of incorrect edits!). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Modified UTF-8

In this section it is stated:

"Modified UTF-8 uses the 2-byte overlong encoding of U+0000 (the NUL character), 11000000 10000000 (hex C0 80), rather than 00000000 (hex 00). This allows the byte 00 to be used as a string terminator."

It ides not explain why a single byte 00000000 can not be used as a string terminator. Given that all other bytes commencing 0 are treated as ASCII, why should 00 be any different?

FreeFlow99 (talk) 11:53, 8 September 2015 (UTC)[reply]

If you use a single null byte (00) to indicate the end of a string, then whenever you encounter 00 you have to decide whether it indicates an actual character (U+0000) in the string or the end of the string itself. The point of the modified scheme is that 00 would never appear in character data, even if the data included U+0000. This has two advantages: (1) 00 does not code characters, so it is freed up for local use as the traditional metacharacter to indicate unambiguously the end of a string, and (2) 00 never appears in encoded data, so that software that chokes on 00 or ignores 00 can still handle data that includes U+0000. The downside, of course, is that you have to avoid software that chokes on the overlong sequence. -- Elphion (talk) 13:34, 8 September 2015 (UTC)[reply]
Thanks for your reply. However I'm not sure I completely understand the merits. If we are talking about a text only file, 00x should never be in a string (paper tape being obsolete), [unless it is used to pad a fixed length field, in which case I understand the point]. What you say would be a problem for data files because bytes aren't just used to store characters but can store values such as integers, reals etc, and these can conflict with control characters; but this problem is not limited to 00x, it applies to all control codes. The only reason I can see to treat NUL separately is that 00x is very common. To avoid ambiguity completely we would need to limit control code usages to text only files, and for data files use a structure eg where each component has an id and a size followed by the actual data, or use some different system with a special bit that is out of reach of any possible data values (conceptually a 'ninth' bit) that flags the 'byte' as a control character. Or have I missed something? FreeFlow99 (talk) 14:54, 8 September 2015 (UTC)[reply]
Your statement "if we are talking about a text only file, 00x should never be in a string" is the misunderstanding. U+0000 is a perfectly valid Unicode character. Unicode does not define its meaning beyond its correspondence with ASCII NUL, and applications use it in a variety of ways. It is in fact common in real data. You should therefore expect character data to contain it, and be prepared to handle it (and preserve it as data). Modified UTF-8 is one way of doing that without giving up the notion of 00 as string terminator. The alternative pretty much requires that string data be accompanied by its length (in bytes or characters) -- which is not a bad idea. -- Elphion (talk) 16:33, 8 September 2015 (UTC)[reply]
Why should UTF-8 encoded data use U+0000 as valid data? UTF-8 is meant to encode Unicode text, and then U+0000 is a ASCII NUL, not a character. If you want to encode binary data in e.g. XML files, it would be better to use base64, not UTF-8 which has worse size factor between input and output, and should also not be used to encode U+DC00, U+FFFE etc.--BIL (talk) 22:09, 5 November 2015 (UTC)[reply]
The simple answer: show me a Unicode Consortium reference saying that U+0000 is not a character. It's a valid codepoint; you don't get to dictate how others use it. It is widely used in database fields and as a separator, in what is otherwise text data, not binary data. -- Elphion (talk) 23:27, 5 November 2015 (UTC)[reply]

Comment from lead

Move the following material to Talk (originally in the lead of the article and since August 2014 commented out)

UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.{{cn}}

Possible Citations, I'm sure you could find lots more, but these are good samples showing that the low-level api is in UTF-8:

https://developer.apple.com/library/mac/qa/qa1173/_index.html

https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text

http://wayland.freedesktop.org/docs/html/protocol-spec-interface-wl_shell_surface.html (look for set_title)

End moved material -- Elphion (talk) 12:35, 10 September 2015 (UTC)[reply]

Misleading caption

Graph indicates that UTF-8 (light blue) exceeded other main encodings of text on the Web, that by 2010 it was nearing 50% prevalent, and up to 85% by August 2015.[2]

While the 85% statement may be true, the graph doesn't indicate any such thing. NotYourFathersOldsmobile (talk) 22:13, 10 September 2015 (UTC)[reply]

Yes, attempts to update the graph to a newer version produced by the same engineer at Google have been reverted because he has not made clear what the copyright is on the new graph.Spitzak (talk) 23:26, 10 September 2015 (UTC)[reply]

where's the source referred to in #WTF-8?

Section WTF-8 says "The source code samples above work this way, for instance." – I can't find these samples, where are they? --Unhammer (talk) 06:25, 24 September 2015 (UTC)[reply]

It's gone as discussed above, so I removed the sentence. Someone should probably add a note about WTF-8 being jocularly used to mean a kind of encoding error before Rust people hijacked it. — mwgamera (talk) 23:09, 27 October 2015 (UTC)[reply]

Huge number of incorrect edits!

112.134.187.248 made a whole lot of mis-informed edits. With any luck he will look here. Sorry but it was almost impossible to fix without a complete revert, though he was correct exactly once when he used "will" instead of "could", I will try to keep that..

1. An indic code point cannot take 6 bytes. He is confusing things with CESU-8 I think, though even that will encode these as 3 bytes as they are still in the BMP.

2. You cannot pass ASCII to a function expecting UTF-16. Declaring the arguments to be byte arrays does not help, and I doubt any sane programmer would do that either. Therefore it is true that UTF-16 requires new apis.

3. Many attempts to claim there is some real-world chance of Asian script being larger in UTF-8 than UTF-16, despite obvious measurements of real documents that show this does not happen. Sorry you are wrong. Find a REAL document on-line that is larger. Stripping all the markup and newlines does not count.

4. Deletion of the fact that invalid byte sequences can be stored in UTF-8, by somehow pretending that they will magically not happen. Sorry, try actually programming before you believe such utter rubbish. Especially odd because some other edits indicate that he things invalid sequences will happen and are somehow a disadvantage of UTF-8.

5. Belief that markup tags using non-ASCII can remove the fact that it makes UTF-8 smaller. Sorry but markup contains far more slashes and angle brackets and spaces and quotes and lots and lots of ASCII-only tags so this will just not happen, no matter how much you wish it could.

6. Claim that invalid sequences of 4 bytes are somehow a problem, while ignoring invalid sequences in UTF-16 and all other invalid sequences in UTF-8. This is despite earlier edits where he basically pretends invalid sequences magically don't happen. Sorry you can't have it both ways.

7. Complete misunderstanding of why multibyte sequences in UTF-16 cause more problems than in UTF-8: because they are RARE. Believe me, NOBODY screws up UTF-8 by "confusing it with ASCII" because they will locate their mistake the very first time a non-ASCII character is used. That is the point of this advantage.

Spitzak (talk) 17:34, 19 October 2015 (UTC)[reply]

Includes response to Spitzak (talk)'s UTF-8 propaganda.
  • 1: It was not an "Indic codepoint", an indic character. Characters are not codepoints, and, see TSCII. The letter "பு" takes 6 bytes in UTF-8 but only one in TSCII. Nice straw-man. "Sorry, try actually programming before you believe such utter rubbish." : Same goes for you, a character is not necessarily a codepoint. Try reading Unicode TRs and standards before you even attempt to program using encodings.
  • 2: Well, UTF-8 requires new APIs too, to pass NUL as an argument.
  • 3: Why don't you cite a reliable source? Anecdotes or WP:NOR. And SGML can still use non-ASCII for markup; there is no restriction on this.
  • 4: Try to learn some English first; "Thinks". I programmed, too.
  • 5: "Will just not happen" is normative, not positive. It is your opinion, and Wikipedia is not Facebook.
  • 6: I can understand this. Sorry, I apologize.
  • 7: Normative (opinion). If there is an error, I say that we spot it even with Latin, in UTF-16 if there is too much NUL or too less NUL. It depends on the programmer.
Still, UTF-16 can recover using surrogate points because there is a particular order of encoding surrogates, and they are paired; if a surrogate is missing, the other is just discarded and if both are there at right places, it could be used as the start pointer for recovery.
Invalid filenames: UTF-8 encoding says anything over 4 bytes as a code-point is illegal, and therefore it should be discarded; bugs are not to be considered features.
112.134.196.135 (talk) 06:32, 20 October 2015 (UTC)[reply]
"Deletion of the fact that invalid byte sequences can be stored in UTF-8": Invalid sequences can happen ANYWHERE; it is not an advantage for UTF-8.
Interspersed comments do not work at Wikipedia (consider what the page would look like if anyone wanted to reply to one of your points). Therefore I have extracted your responses and posted them above. I have not looked at the issue yet, but the tone of the above responses is what would be expected from a belligerent rather than someone wanting to help build the encyclopedia. Johnuniq (talk) 06:51, 20 October 2015 (UTC)[reply]
Thanks for editing, Johnuniq . The edits by Spitzak includes a lot of normative content; haven't you noticed his tone? I am trying to build a good encyclopedia, with less opinion-based information and good neutrality.
1: The number of bytes that a character takes in any Unicode format is unbounded, though I think Unicode permits some limit on the number of combining characters to be stacked on one base character. That TSCII encodes as one codepoint what Unicode does as two is not really relevant to this page.
2: If you want to include NUL in a string, you need an API that will let you do it. If you don't care, you can use NUL-terminated strings. There's absolutely no difference between ASCII and UTF-8 in this regard, and many people use NUL terminated UTF-8 strings.
3: Cites are important where there's a real dispute, but this is citable, I believe, and in practice it is true. I don't see why SGML, which is a generic form of markup instead of markup itself, is relevant here; of course a markup language in theory could use non-ASCII characters, but I've never seen it done, and HTML, TeX and other common forms of ASCII markup are probably a million times more common then any non-ASCII markup.
5: Bah. It's measurable. You can demand cites, but don't demean other posts by dismissing facts as their opinion.--Prosfilaes (talk) 15:38, 20 October 2015 (UTC)[reply]
1. It is still true that TSCII is a multi-byte (or single-byte on how you view it) encoding, and still deals with this problem. Even though the 2CP problem is due to Unicode itself, it still applies because it is a multi-byte encoding to encode the same characters; just more efficient for a certain script.
2: Yes, they do use null-terminated strings, I can understand that these are used in argv[]. But still, null-terminated strings are what house a lot of other bugs; UTF-8 does not expose them first, but UTF-16 forces them to care about NUL problems.
3: Yes, SGML places no restrictions on this, HTML does; this is just what popular document-tools do. However, when the Chinese content is higher than the markup tags, it is space-advantagerous for UTF-16. I am not saying that this does not happen, but what the author says 'often' is something that needs citation on 'how often'; the citation provided was anecdotal. For example, for a non-fancy document, or some document that properly separates documents, scripts and stylesheets. 'Sometimes' is a better word.
5. WP:NOR still applies.
To the point 7: People who treat UTF-16 as UCS-2 are similar to people who treat UTF-8 as ASCII. Which editors happen to be the majority? Editors who treat UTF-8 as ASCII, I can say from anecdotes. Right now, rarely any editors have these problems; Blaming UTF-16 for UCS-2 is like blaming UTF-8 for Latin; the blame-game is based on obsolete encoding assumptions. Who says "NOBODY"? Weasel?
How exactly is saving the illegal characters an advantage over UTF-16? If it is a stream, both UTF-8 and UTF-16 do them. Both these standards tell to discard invalid codepoints; you may claim that anything beyond 4-bytes after transformation can still be ignored, but any other invalid surrogates can be ignored in UTF-16 as well. I am the same person, again.112.134.200.102 (talk) 18:39, 20 October 2015 (UTC)[reply]
1: The 2CP problem? Sounds like OR to me. Sure as heck doesn't come up in a web search.
2: NUL terminated strings are just fine for many problem; they do not "house a lot of bugs".
3: So your OR wins over a citation? SGML is irrelevant here; there are many forms of markup, and none of the popular ones use characters outside of ASCII.--Prosfilaes (talk) 02:05, 21 October 2015 (UTC)[reply]
1. 2CP problem: 2 code-point per character problem. It is not an OR; it is a problem mentioned above; it is very relevant in 'specific multi-byte encodings' or 'specific single-byte encoding'. It was already mentioned here, but someone said it's irrelevant; it is still an encoding, well-capable of encoding that " two code points" in single byte.
3: So as UTF-16 is fine for me using non-NUL-terminated strings. What's the problem here? Null-terminated strings still house a lot of bugs.[1][2]. If programmer's ignorance is not to be cited, then remove the thing about UCS-2, and, all implementation-related bugs/faults as well. Edit: Sorry, someone else removed it, or, I removed it; it is no longer there; this point is done. The point was that programmer's ignorance about mistaking UTF-16 for UCS-2 is not a weakness of UTF-16 per se, any more than the weakness of UTF-8 getting mistaken for some Latin-xxxx especially when we have specific BOMs.
3: My claim was no OR; the author's claim was the OR. It is up to the claimant to prove something; I am not saying that most popular markups do not use ASCII; It is up to author to prove (or at least cite from a reliable source) that the size of markup is larger than the size of other language text, for him to use the word " often" here. "Sometimes" is a better word. Moreover, I can still say that usually, a web page is generated on request; they do not store HTML files there, just text that will be inserted there, in a database, with very little markup. It is not whether it is my OR or not; it is just not giving an anecdote as a citation to show how often this happens, even if you think SGML is irrelevant.112.134.234.218 (talk) 05:27, 21 October 2015 (UTC)[reply]
I removed the one about UTF-8 encoding invalid strings; some UTF-16 processor which can still process invalid sequences can either ignore invalid sequences. It is quite the same as UTF-16; you either have this advantage both ways or neither way. Moreover, Unix-filenames/paths are encoding-unaware; they still treat the file paths/names as ASCII; only having one NUL is a problem, so as having a NUL is not a problem in length-prefixed filenames or other length-aware methods or methods like double NUL. If what he means by invalid filename is invalid character sequence, any encoding-unaware stream processor can do it (either one that does not discard anything after 4 bytes or after 2nd ordered surrogate), based on internal limitations like NUL or NULNUL or other ones caused by pre-encoding the stream length.
Removed the sentence blaming UTF-16 for inconsistent behaviour with invalid strings, while not explaining about the consistency of invalid sequences in invalid UTF-8 streams (the over-4-byte problem). If there is a standard behaviour that makes UTF-8 more consistent with invalid sequences than UTF-16, please mention the standard.112.134.188.165 (talk) 16:11, 21 October 2015 (UTC)[reply]
Unix filenames do not treat the name like ASCII; they treat it as an arbitrary byte string that does not include 00 or 2F. Perfect for storing UTF-8 (which was designed for it), impossible for storing UTF-16. Your links to complaints about NUL-terminated strings are unconvincing and irrelevant here.--Prosfilaes (talk) 08:17, 25 October 2015 (UTC)[reply]
Prosfilaes: Yes, UNIX just does them as arbitrary NUL-terminated strings without 00/2F. So as Windows handles it differently. These are based on invalid limitations. My point is that 'possibility to store invalid strings' exists in both of them; that's not a UNIX-only bug/feature. He responded like encoding a broken UTF-8 string is a UTF-8-only feature while you can have broken streams pretty much in any place where (1) it does not violate internal limitations and (2) it is encoding-unaware. I do not have any problem with NUL terminated files per se, but I don't see why should a broken bytestream considered feature in one and not a feature in the other.175.157.233.124 (talk) 15:43, 25 October 2015 (UTC)[reply]
Cleaned up the article due to WP:WEASEL ("... suggested that...").
Cleaned up the article, again WP:WEASEL ("...it is considered...") by whom? Weasels?175.157.125.211 (talk) 07:48, 22 October 2015 (UTC)[reply]
112.134.190.56 (talk) 21:12, 21 October 2015 (UTC)[reply]
Cleaned up: "If any partial character is removed the corruption is always recognizable." Recognizable how? UTF-8 is not an integrity-check mechanism; if one byte is missing or altered, it will be processed AS-IS. However, the next byte is processable, and it is already listed there.175.157.94.167 (talk) 09:18, 22 October 2015 (UTC)[reply]
The point you mentioned, '4. Deletion of the fact that invalid byte sequences can be stored in UTF-8, by somehow pretending that they will magically not happen. Sorry, try actually programming before you believe such utter rubbish. Especially odd because some other edits indicate that he things invalid sequences will happen and are somehow a disadvantage of UTF-8.'
I did not delete it. Please, be respectful to others; do not accuse others falsely. Invalid sequences can be stored in anything, especially in encoding-unaware or buggy implementations. I could see where this is coming from, currently, I am from Sri Lanka; am not them, who are probably from somewhere else.
Spitzak: See your edit here: https://en.wikipedia.org/w/index.php?title=UTF-16&type=revision&diff=686526776&oldid=686499664 . Javascript is NOT Java, and the content you removed says nothing implying failure to support UTF-16; it is just that the distinction was made so that surrogate pairs are counted separately than code points. In fact, it's I who added the word 'Javascript' first to that article.
I removed the implication that counting code units somehow makes it "not support UTF-16". Whether you meant it or not, the wording sounded like "Javascript tries but counts surrogate pairs as 2 and this is wrong". Also, a paragraph at the end points out that *ALL* the languages using UTF-16 work this way, so Javascript is not special.Spitzak (talk) 06:13, 2 November 2015 (UTC)[reply]
Please refrain from using phrases like "any sane programmer", "NOBODY screws" which are examples of No true Scotsman fallacy.175.157.94.167 (talk) 10:07, 22 October 2015 (UTC)[reply]
For edits like 1, please see WP:AGF.
Prosfilaes: I clarified your point in the article; If you think it is too much for an encyclopedia, please feel free remove it. Thanks :) 175.157.233.124 (talk) 15:54, 25 October 2015 (UTC)[reply]
Prosfilaes: The article (Indic-specific) TSCII exists in Wikipedia; that's why I stated it; it works as I mentioned (it is in the table in the article) and it is capable of encoding one-byte for two-codepoints for one letter. The standard is here: https://www.iana.org/assignments/charset-reg/TSCII . I am sorry if it is considered OR or rare.

UTF-8 does not require the arguments to be nul-terminated. They can be passed with a length in code units. And UTF-16 can be nul-terminated as well (with a 16-bit zero character), or it can also be passed with a length in code units. Please stop confusing nul termination. The reason an 8-bit api can be used for both ASCII and UTF-8 is because sizeof(*pointer) is the same for both arguments. For many languages, including C, you cannot pass an ASCII array and a UTF-16 array as the same argument, because sizeof(*pointer) is 1 for ASCII and 2 for UTF-16. This is why all structures and apis have to be duplicated.

Everybody seems to be completely unable to comprehend how vital the ability to handle invalid sequences is. The problem is that many systems (the most obvious are Unix and Windows filenames) do not prevent invalid sequences. This means that unless your software can handle invalid sequences and pass them to the underlying system, you cannot access some possible files! For example you cannot make a program that will rename files with invalid names to have valid names. Far more important is the ability to defer error detection to much later when it can be handled cleanly. Therefore you need a method to take a fully arbitrary sequence of code units and store it in your internal encoding. I hope it is obvious that the easiest way to do this with invalid UTF-8 is to keep it as an array of code units. What is less obvious is that using standard encodings you can translate invalid UTF-16 to "invalid" UTF-8 by using the obvious 3-byte encoding of any erroneous surrogate half. This means that UTF-8 can hold both invalid Unix and invalid Windows filenames. The opposite is not true unless you divert in seriously ways away from how the vast majority of UTF-8/UTF-16 translators work by defining some really odd sequences as valid/invalid.

The idea that SGML can be shorter if you just use enough tags that contain Chinese letters is ridiculous. I guess if the *only* markup was <tag/> (ie no arguments and as short as possible with only 3 ascii characters per markup), and if every single tag used contained only Chinese letters, and they averaged to greater than 3 characters per tag, then the markup would be shorter in UTF-16. I hope you can see how totally stupid any such suggestion is.

I and others mentioned an easy way for you to refute the size argument: point to an actual document on-line that is shorter in UTF-16 verses UTF-8. Should be easy, go do it. Claiming "OR" for others inability to show any examples is really low. Spitzak (talk) 06:13, 2 November 2015 (UTC)[reply]

Spitzak:

For god's sake, I never said that UTF-8 requires NUL-termination; I just said that it is possible to pass a NUL-less UTF-8 string around with traditional NUL-terminated char arrays in legacy code. In that case, any ASCII-encodable encoding can do NUL termination. Please keep in mind that you are editing English Wikipedia. Single-byte NUL-terminators are a UNIX-specific internal limitation, so as others have their own.
"I hope you can see how totally stupid any such suggestion is.": Whatever it is, regardless of that, the problem is that it is an un-cited non-neutral claim, which is against the policies of a general Wikipedia article[3]. Don't claim "most" or the frequency if you cannot cite it anyway, from a large number of samples. The current citation is currently good enough only as an Existential quantifier.
"Claiming "OR" for others inability to show any examples is really low.": Wikipedia is not Facebook or any other general social media to shout opinions or anecdotal evidence without verifiability. The burden of proof lies on the claimant, not someone else to disprove; so, it needed citation anyway from a large study. I am not AGAINST this, but it needed citation.
As a side note: if phrases like "is ridiculous" is acceptable here and to you, see how many edits of you have been ridiculous because you misunderstood the English there; I just pointed them out nicely. Remember, it does not take some extreme exceptional skill to become not nice. "I removed the implication that counting code units somehow makes it "not support UTF-16". Whether you meant it or not, the wording sounded like "Javascript tries but counts surrogate pairs as 2 and this is wrong". Also, a paragraph at the end points out that *ALL* the languages using UTF-16 work this way, so JavaScript is not special.": I don't see how did it imply that JavaScript does not support UTF-16. Supporting UTF-16 means supporting UTF-16, regardless of whether it is wrong way or right way. UTF-16 is an encoding standard. You could have chosen to remove just the part that offended (while it did not, anyway). If you are unsure and had a problem with repetition on that page, there is a Wikipedia's way of doing that using templates, too, it looks like this: [needs copy edit]. There is an entire essay regarding this, worth reading: WP:TEARDOWN. The word 'JavaScript' wasn't even there until I added it, there is no purpose in arguing whether it is special or not. JavaScript is not Java and deserves a separate mention in that section. " An indic code point cannot take 6 bytes. He is confusing things with CESU-8 I think...": It was not a codepoint, it was a character. And you should see how TSCII works, to know how those 6 bytes are simplified into one. Now, I did not call them ridiculous. 112.135.97.96 (talk) 14:47, 5 November 2015 (UTC)[reply]
Surprisingly, WP:WEASEL there, too: "... leads some people to claim that UTF-16 is not supported". Who? Can't anyone just remove them instead of accusing me? 112.135.97.96 (talk) 15:10, 5 November 2015 (UTC)[reply]
"Everybody seems to be completely unable to comprehend how vital the ...": This is what we mean by (not-fully)-encoding-unaware. Filenames work at a very low level; it is just that Windows uses wchar and UNIX uses just bytes. This is just limited by internal limitations of both - a UNIX filename cannot contain a 00. Why do you keep saying that 'the obvious way'? UTF-8 is not any more obvious than UTF-16. "standard encodings you can translate invalid UTF-16 to "invalid" UTF-8 by ": It isn't standard if it is invalid. I kind of get your point (like encoding higher codepoints), but this is too much OR and needs sources. Remember that standard UTF-32 can encode even more codepoints than the 6-byte UTF-8 can do (32 bits vs 31 bits), twice as much. Good luck encoding 0xFFFFFFFE in a six-byte UTF8 stream. 112.134.149.92 (talk) 06:39, 6 November 2015 (UTC)[reply]
Spitzak: https://en.wikipedia.org/w/index.php?title=UTF-16&type=revision&diff=688641878&oldid=688640790
This simplifies searches a great deal: This is just not encyclopedic. Simplifies search according to whom? Wikipedia strives to be a high-quality encyclopedia with reliable references.112.135.97.96 (talk) 15:10, 5 November 2015 (UTC)[reply]
Spitzak: I am kind of satisfied with your last edit on this page, though, and the current status of it. Thanks.112.135.97.96 (talk) 15:53, 5 November 2015 (UTC)[reply]

175.157.213.232 (talk) 11:56, 6 November 2015 (UTC) 175.157.213.232 (talk) 12:00, 6 November 2015 (UTC)[reply]