|This is the talk page for discussing improvements to the UTF-8 article.
This is not a forum for general discussion of the article's subject.
|Archives: Index, 1, 2, 3, 4|
|This article is of interest to the following WikiProjects:|
The table recently added to describe UTF-1 is interesting, but it should go in UTF-1, not here. Here it suffices to say what the first paragraph says, together with the link to UTF-1. -- Elphion (talk) 15:24, 23 July 2016 (UTC)
- I formatted the table as much as possible in the style of the others (which closely resemble the table in the FSS-UTF specification), specifically so that they could be considered together (for better appreciation of the history). It's not as if it's taking up that much space relative to the huge table further down in the article, and why would there not be a table to illustrate the text (or should that also have to be moved?)... I had indeed considered what you are pointing out, but that even seemed less preferable than (redundantly) having the table in both places. Perhaps let's wait for others' opinions about this? — RFST (talk) 07:26, 24 July 2016 (UTC)
I think that the part about UTF-1 is essential context, especially given the story about how UTF-8 was invented by one superhuman genius (in a diner, on a placemat...). How can you appreciate what problem was being solved, what the respective contributions were, without a juxtaposition of the various stages? If you don't care about what UTF-1 looked like, then you also don't really care about the history of UTF-8. Without it, you might as well delete the whole section. — RFST (talk) 14:25, 1 March 2017 (UTC)
- It might not be out of place to briefly mention some aspects in which UTF-1 was lacking, in order to indicate the problems which UTF-8 was attempting to solve, but any detailed explication or analysis of UTF-1 would be out of place. AnonMoos (talk) 04:27, 2 March 2017 (UTC)
- I believe Ken Thompson invented UTF-8. It was called UTF-2 back then. But UTF-1 found two very important concepts that Thompson has nothing to do with. Multi-byte encoding and backward compatibility with ASCII. Off topic, C programming language is actually just B with types, and B was created by Ken Thompson. Still we consider C as Dennis Ritchie's invention, so obviously this is how things work. Coderet (talk) 13:40, 1 August 2017 (UTC)
Page is misleadingly organized
A page about UTF-8 should start by giving the definition of UTF-8, not by exploring the whole history of how it developed. History should come later. The current organization will cause many people to misunderstand UTF-8.
- I agree that the description should be before the history, and that the history can be placed after Derivatives.--BIL (talk) 20:04, 22 February 2017 (UTC)
Hello fellow Wikipedians,
I have just modified 2 external links on UTF-8. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Added archive https://web.archive.org/web/20140723192908/http://bsittler.livejournal.com/10381.html to http://bsittler.livejournal.com/10381.html
- Corrected formatting/usage for http://plan9.bell-labs.com/sys/doc/utf.pdf
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
You may set the
|checked=, on this template, to true or failed to let other editors know you reviewed the change. If you find any errors, please use the tools below to fix them or call an editor by setting
|needhelp= to your help request.
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
If you are unable to use these tools, you may set
|needhelp=<your help request> on this template to request help from an experienced user. Please include details about your problem, to help other editors.
Why do so many email addresses have "UTF-8" in them?
For example: "=?utf-8?Q?FAIR?= <email@example.com>"
- =?utf-8?Q?FAIR?= is not the email address itself, but a name connected to it. UTF-8 enables names using letters outside the English or Western European alphabets to be used. If the name is shown as =?utf-8?Q?FAIR?= then the receiving email application can't decode the name format. There is always a period when some software does not support the standards on the net. The actual email address is firstname.lastname@example.org, and as far as I know UTF-8 or anything outside ASCII is not permitted (according to the article email address, efforts are being made to allow such addresses, but they will be incompatible unless every router and receiver supports such addresses). Your email at least reach its destination.--BIL (talk) 18:37, 21 April 2017 (UTC)
I am trying to fix text that seems to be confused about how the backward compatability with ASCII works. I thought it was useful to mention extended ASCII but that seems to have produced much more confusion, including a whole introduction of the fact that UTF-8 is very easy to detect and that invalid sequences can be substituted with other encodings, something I have been trying to point out over and over and over, but really is not a major feature of UTF-8 and does not belong here.
The underlying statement is that a program that looks at text, copying it to an output, that only specially-treats some ASCII characters, will "work" with UTF-8. This is because it will see some odd sequences of bytes with the high bit set, and maybe "think" that is a bunch of letters in ISO-8859-1 or whatever, but since it does not treat any of those letters specially, it will copy them unchanged to the output, thus preserving the UTF-8 encoded characters. The simplest example is printf, which looks for a '%' in the format string, and copies every other byte to the output without change (it will also copy the bytes unchanged from a "%s" string so that string can contain UTF-8). Printf works perfectly with UTF-8 because of this and does not need any awareness of encoding added to it. Other obvious example is slashes in filenames.
My initial change was to revert some text that implied that a program that "ignores" bytes with the high bit set would work. The exact meaning of "ignore" is hard to say but I was worried that this implied that the bytes were thrown away, which is completely useless as all non-ASCII text would vanish without a trace.
Code that actually thinks the text is ISO-8859-1 and does something special with those values will screw up with UTF-8. For instance a function that capitalizes all the text and thinks it knows how to capitalize ISO-8859-1 will completely destroy the UTF-8. However there is a LOT of functions that don't do this, for instance a lot of "capitalization" only works for the ASCII letters and thus works with UTF-8 (where "work" means "it did not destroy the data").
- First, your notion that you are some kind of Unicode god and that everybody else is a mere mortal and confused is a little aggravating to say the least. But thanks for at least posting something on the Talk page before reverting willy-nilly like you usually do.
- The original text that you reverted did not state that code which ignored 8-bit characters would work. It said that an ASCII processor which ignored 8-bit characters would see all and only the ASCII characters in a UTF-8 byte stream, in the correct order. This is a true. The ASCII content in an UTF-8 can be separated from the non-ASCII content without any decoding. The bytes with the high bit set are a separate non-ASCII stream, so to speak. There aren't any 7-bit characters hidden inside the multi-byte 8-bit sequences because overlong sequences are not allowed. Whether it is acceptable to separate the 7-bit stream from the 8-bit is of course dependent on the situation. When you ignore some of the input, you are, you know, ignoring some of the input. Passing the 8-bit characters through uninterpreted, which you think is "the point", may also work depending on the purpose of the processing and whether there is some further processing that knows what to do. It won't be useful in all cases, such as if the processing is the end of the line, for the purpose of textual display or page rendering for example. That is where fallback/replacement comes in. Anyway, when I unreverted the original, I removed this whole point not because it is false, but because it is apparently confusing, and there are better and more succinct ways to get at the point. As for fallback and autodetection, you commented it out because it is "important" but not "salient". Come on. I'm changing the intro sentence to read "important" features rather than "salient", so you'll be happy. Now we have a list of important features, including fallback and autodectection, rather than "salient" features. We have one thing in the list which is merely "important" but doesn't rise to the level of "salient". But I'll think we'll live. Sheesh. Person54 (talk) 19:24, 6 June 2017 (UTC)
- I also added your point about printf, but I am not sure this is actually correct. One of the features of printf format strings is a field width indication. e.g "%8s", meaning a string field which occupies 8 character width units on the display or page. It isn't clear that printf would substitute UTF-8 strings correctly into such width-specified fields, without decoding the UTF-8 so as to know how many characters they represent, and how many spaces are needed for padding. So the printf example may not be a good example of processing which can treat the UTF-8 input as if it were ASCII. Person54 (talk) 20:00, 6 June 2017 (UTC)
- New one looks good, though still pretty lengthy. I do believe the fact that there are many invalid sequences which thus allow UTF-8 to be mixed with legacy encodings without the need for a BOM or other indicator is important, so I guess it is good that it is here, though I really doubt that was considered when UTF-8 was originally designed.
- "characters" is pretty ill-defined in Unicode and can have many different answers, so number of code units is certainly a better definition of the string width as it has a definite meaning, and also because it is the value of offsets returned by searching UTF-8. I suppose printf could be made "UTF-8 aware" by truncating the printed string at the start of the code point that exceeds this limit but I am not very certain this is an improvement.Spitzak (talk) 16:14, 7 June 2017 (UTC)
- A better example of a way to make printf UTF-8 aware is that snprintf truncates at a given number of code units. It may make sense for it to truncate at the start of a code point. Since it returns the number of code units printed this is less of a problem than having %ns do unexpected actions.Spitzak (talk) 16:22, 7 June 2017 (UTC)
- I fudged the printf example so that what is in the article is correct, if a little vague; but it still might not be the best example of the point. I agree it is sort of long-winded. Other folks have been swooping in and making my contributions more succinct, and these are improvements, which I was happy to see.
- As for what was originally intended, UTF-8 was designed literally on the back of a napkin in a diner by Ken Thompson and then coded up by him in a couple of days. So it wouldn't be surprising if he didn't think of everything, and UTF-8 turned out even better than he imagined. On the other, we're talking about Ken Thompson, so maybe we should not underestimate what he foresaw. Person54 (talk) 16:32, 7 June 2017 (UTC)