Talk:UTF-8: Difference between revisions

Content deleted Content added

Inline

Revision as of 16:06, 25 October 2015

This is the talk page for discussing improvements to the UTF-8 article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: Index, 1, 2, 3, 4, 5: 90 days

Computing B‑class Top‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Top	This article has been rated as Top-importance on the project's importance scale.

Computer science B‑class Mid‑importance

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

B

This article has been rated as B-class on Wikipedia's content assessment scale.

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Typography B‑class Top‑importance

	This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.TypographyWikipedia:WikiProject TypographyTemplate:WikiProject TypographyTypography articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
Top	This article has been rated as Top-importance on the importance scale.

"UTF-8 should be the default choice"

"It was also suggested that UTF-8 should be the default choice of encoding for all Unicode-compliant software."

http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/ is a secondary source, published in a well-known technology magazine, that comments on the UTF-8 Everywhere publication, and there is no indication that it affiliates with the authors of the later. Please explain how it does not count as a valid source for the above claim. Thanks. 82.80.119.226 (talk) 17:00, 15 June 2015 (UTC)[reply]

It's a valid source for the above claim. Who cares about the claim? Lots of things have been suggested; for the most part, the suggestions are not interesting enough to note in Wikipedia.--Prosfilaes (talk) 17:30, 15 June 2015 (UTC)[reply]

Notability is not about being a "suggestion", a person or event, it is about importance. The manifesto clearly gained enough pagerank and endorsements by many respectful groups. It is a subject of on-going revolution and must definitely be referenced by Wikipedia to make Wikipedia better. In fact, I believe it even deserves an entry. I would say link the manifesto, but not TheRegister. 46.117.0.120 (talk) 15:36, 16 June 2015 (UTC)[reply]

What manifesto? "It was also suggested that UTF-8 should be the default choice of encoding for all Unicode-compliant software." says not a thing about any manifesto.--Prosfilaes (talk) 23:36, 16 June 2015 (UTC)[reply]

IMHO, the TheRegister article clearly refers to The Manifesto at http://www.utf8everywhere.org/. Exactly this masterpiece, in my opinion, should be referenced in the article. The only question is where and how. — Preceding unsigned comment added by 46.117.0.120 (talk) 00:47, 17 June 2015 (UTC)[reply]

The problem of referencing http://utf8everywhere.org/ directly is that it is a primary source which is not eligible to be referenced directly on its own. 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)[reply]

So add it not as a reference but as an external link. I am sure you guys understand more about the proper way to link it from Wikipedia than I do. There has to be a solution. If none is suggested, let's revert to the original proposed "it was suggested" formulation. — Preceding unsigned comment added by 46.117.0.120 (talk) 18:02, 17 June 2015 (UTC)[reply]

Many people have attempted to have Wikipedia reference http://www.utf8everywhere.org/ or the proposed boost extensions to make it possible to open files on Windows using UTF-8, but they always get reverted, there appears to be a contingent out there who do not want this opinion to ever be visible and come down fast and hard against anybody who attempts it. The fact that such an opinion exists and is shared by a large and rapidly-growing portion of the programming community is actually pretty important when discussing Unicode encodings and it should be mentioned in Wikipedia articles, but obviously somebody does not want that and is using the excuse that it is "opinion" to delete it every time. That said, I would prefer a link to utf8everywhere, the Reg article is pretty awful.Spitzak (talk) 18:42, 15 June 2015 (UTC)[reply]

I don't have a strong opinion about UTF8Everywhere. I have a strong opinion about "It was also suggested that ..." without saying who suggested it or why we should care that they suggested it.--Prosfilaes (talk) 23:36, 16 June 2015 (UTC)[reply]

Does "It is recommended by parts of the programming community that ..." sound better? This is a factual information, not opinion based, and is definitely relevant to the subject. Your argument of "why we should care that..." is faulty because it can be applied to any other piece of information virtually everywhere on Wikipedia. Why should we care about IMC recommendations? W3C recommendations? How community recommendations have less weight in this respect? 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)[reply]

"It is recommended by parts of the programming community that ..." Unix is being unreasonable for using more than 7 bits per character. (The Unix-Haters Handbook). That everyone should use UTF-16. That everyone should use ISO-2022. W3C recommendations are community recommendations, and from a known part of the community. I don't know why I should take UTF8Everywhere any more seriously then DEC64.--Prosfilaes (talk) 15:56, 17 June 2015 (UTC)[reply]

The Unix-Haters Handbook, UTF-16, ISO-2022, DEC64 -- if you have an indication that these are notable enough opinions (have secondary sources, etc...) then you are more than welcome to include these opinions in the respective articles. 82.80.119.226 (talk) 16:47, 17 June 2015 (UTC)[reply]

No, Wikipedia is NPOV, not positive POV. Even if we often do, we shouldn't have cites to pro-UTF-8 pages on UTF-8 and pro-UTF-16 page on UTF-16. To the extent that public support for UTF-8 over UTF-16 or vice versa is documented on either page, it should be consistent between pages.--Prosfilaes (talk) 19:26, 21 June 2015 (UTC)[reply]

I agree that it's even more important to link The Manifesto from the UTF-16 page. But, we are now discussing the UTF-8 page. 31.168.83.233 (talk) 14:21, 22 June 2015 (UTC)[reply]

I agree about 'suggested' and we can agree that there's no need for a holy war and no doubt about manifesto's notability. So, Prosfilaes, what formulation would you suggest? And where to link it from? I would leave the decision to you. Also whether the manifesto is good or bad or right or wrong is for sure irrelevant. I personally like it very much. 46.117.0.120 (talk) 14:02, 17 June 2015 (UTC)[reply]

The "manifesto" is not a bad piece, but it's one group's opinion and not a very compelling case for "UTF-8 everywhere". Each of the three principal encodings has strengths and weaknesses; one should feel free to use the one that best meets the purpose at hand. Since all three are supported by well-vetted open source libraries, one can safely choose at will. The manifesto's main point is that encoding UTF-16 is tricky; but it's no more tricky than UTF-8, and I've seen home-rolled code making mistakes with either. This really is a tempest in a tea-cup. -- Elphion (talk) 01:53, 17 June 2015 (UTC)[reply]

Whether it's compelling or not is not a criteria for the inclusion of the opinion. And as per manifesto's main point, to my interpretation it is that using one encoding everywhere simplifies everything, and then it gives a bunch of reasons why this encoding must be UTF-8. It also argues that UTF-16 is a historical mistake. 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)[reply]

What is the criteria for the inclusion of the opinion?--Prosfilaes (talk) 15:56, 17 June 2015 (UTC)[reply]

Notability. Just like anything else on Wikipedia. If you think that the one reference provided isn't enough, then say that, I'll find more. 82.80.119.226 (talk) 16:47, 17 June 2015 (UTC)[reply]

The danger of "free to choose" approach is addressed in FAQ#4 of the manifesto itself. This is also the reason for 'by default'. — Preceding unsigned comment added by 46.117.0.120 (talk) 00:47, 17 June 2015 (UTC)[reply]

You are repeating exactly what this manifesto argues against. The whole point of the manifesto is that the encodings do NOT have "equal strengths and weaknesses". It states that the encodings are NOT equal, and that any "weakness" of UTF-8 is shared by all the other encodings (ie variable-sized "characters" exist in all 3, even UTF-32), thus weakness(UTF-8) is less than weakness(others). Now you can claim the article is wrong, but that is an OPINION. You can't take your opinion and pretend it is a "fact" and then use it to suppress a document with a different opinion. The article is notable and is being referenced a lot. The only thing that would make sense is to link both it and a second article that has some argument about either a strength of non-UTF-8 that does not exist in UTF-8 or a weakness of UTF-8 that does not exist in non-UTF-8.Spitzak (talk) 17:18, 17 June 2015 (UTC)[reply]

I would like more references in this article.

I agree with Elphion that "You should always use UTF-8 everywhere" is not a neutral point of view. However, my understanding of the WP:YESPOV policy is that the proper way to deal with such opinions is to include all verifiable points of view which have sufficient weight.

I hear that something "is being referenced a lot". Are any of those places that reference it usable as a WP:SOURCE? If so, I think we should use those places as references in this article.

Are there places that say that maybe UTF-8 shouldn't always be used everywhere? Are any of those places usable as a WP:SOURCE? If so, I think that opposing opinion(s) should also be mentioned, using those places as references, in this article. --DavidCary (talk) 02:52, 18 June 2015 (UTC)[reply]

I have found only one. This: CONSIDERING UTF-16 BE HARMFUL BE CONSIDERED HARMFUL — Preceding unsigned comment added by 46.117.0.120 (talk) 04:38, 18 June 2015 (UTC)[reply]

unborked link: http://www.siao2.com/2012/04/27/10298345.aspx (SHOULD CONSIDERING UTF-16 BE HARMFUL BE CONSIDERED HARMFUL) -- Elphion (talk) 16:30, 18 June 2015 (UTC))[reply]

According to WP:SOURCE the place I referenced is usable as a source. My formulation is fine according to WP:YESPOV, and if Prosfilaes, or anyone else, does think that this makes the section biased in any way, they shall add a mention a WP:SOURCE conforming contradicting opinion, not to revert it. As such, can some experienced Wikipedian do that so I would not engage in edit wars with Prosfilaes? Obviously changing the formulation is also an option, but only alternatives I can currently think of are:

It was also suggested that...
It is recommended by parts of the programming community that...
The UTF-8 Everywhere manifesto published by Pavel Radzivilovsky et al. and supported by parts of the programming community, recommends that...

82.80.119.226 (talk) 10:48, 21 June 2015 (UTC)[reply]

I oppose this. There's simply no evidence this manifesto is notable. Actual use, declarations of intent by notable organizations, maybe something with a massive swell of public support, but not a manifesto by two people with a Register article mention.--Prosfilaes (talk) 19:26, 21 June 2015 (UTC)[reply]

I believe this http://www.reddit.com/r/programming/comments/sy5j0/the_utf8everywhere_manifesto/ and this https://news.ycombinator.com/item?id=3906253 is enough evidence for notability. There's more. — Preceding unsigned comment added by 31.168.83.233 (talk) 14:11, 22 June 2015 (UTC)[reply]

Neither of those sites offer any evidence for notability for anything.--Prosfilaes (talk) 22:55, 22 June 2015 (UTC)[reply]

I believe it does. The huge public debate IS notability. — Preceding unsigned comment added by 46.117.0.120 (talk) 15:40, 23 June 2015 (UTC)[reply]

You are just making up rules, and I doubt you have the right of doing this. According to WP:SOURCE it is notable enough. If you think I got something wrong, please cite the specific text.

Actual use? People have already added these comments to the article:

UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.

Possible Citations, I'm sure you could find lots more, but these are good samples showing that the low-level api is in UTF-8: [1] [2] [3] (look for set_title)

There are more: Boost.Locale, GDAL, SQLite (its narrow char interface is UTF-8 even on Windows, and the wide char isn't supported by the VFS), Go, Rust, D, Erlang, Ruby, and the list is going on...

82.80.119.226 (talk) 15:39, 22 June 2015 (UTC)[reply]

WP:WEASEL says that we don't say that "It is recommended by parts of the programming community that..." I don't specifically have cites for why we don't say "Don Smith says that UTF-8 is unnecessary and breaks half the programs on his system", but we don't, because everyone has an opinion.

The rest of that massively misses the point. I'm not arguing about whether UTF-8 is good. Yes, it's frequently used (though that first sentence is problematic; given the rise of Java-driven Androids and .NET and Windows, is it really increasing? What does iOS use?), and yes, we should list examples of major cases where it's being used. The question under hand is non-notable opinions about it.--Prosfilaes (talk) 22:55, 22 June 2015 (UTC)[reply]

The manifesto's FAQ #4 is a good example of what the authors get wrong. The question asks whether any encoding is allowable internally, and the authors respond (reasonably) that they have nothing against this. But then they argue that std::string is used for all kinds of different encodings, and we should just all agree on UTF-8. First, this doesn't answer the question, which in context is talking about UTF-8 vs UTF-16: nobody would use std::string (a sequence of bytes) for UTF-16 -- you would use std::wstring instead. Second, std::string has a precise meaning: it is a sequence of bytes, period. It knows nothing of Unicode, or of the UTF-8 encoding, or how to advance a character, or how to recognize illegal encodings, etc., etc. If you are processing a UTF-8 string, you should use a class that is UTF-8 aware -- one specifically designed for UTF-8 that does know about such things. Passing your UTF-8 string to the outside as std::string on the assumption that the interface will treat it as Unicode is just asking for trouble, and asking the world to adopt that assumption (despite the definition of std::string) is naive and will never come to pass. It will only foster religious wars. If you need to pass a UTF-8 string as std::string to the outside (to the OS or another library), test and document the interface between your UTF-8 string and the outside world for clarity. The manifesto's approach only muddies the waters.

You are utterly misunderstanding the manifesto, spouting out exactly what it is arguing against as though they are "facts", when in fact the whole paper is about how your "facts" are wrong. YES they are saying "use std::string ALWAYS" to store UTF-8 and non-UTF-8 and "maybe UTF-8" and also random streams of bytes. No, you DO NOT need a "UTF-8 aware container", contrary to your claim. Yes, you CAN "pass your UTF-8 string to the outside as a std::string" and it WILL work! Holy crap, the "hello world" program does that! Come on, please at least try to understand what is going on and what the argument is. I don't blame you, something about Unicode turns otherwise good programmers into absolute morons, so you are not alone. This manifesto is the best attempt by some people who actually "get it" to try to convince the boneheaded people who just cannot see how vastly simple this should be.Spitzak (talk) 20:41, 18 June 2015 (UTC)[reply]

Naturally, I disagree with your analysis of my understanding of the manifesto. But if you wish to continue this discussion, I suggest we move it to my talk page, as it contributes little to the question at hand. -- Elphion (talk) 03:07, 19 June 2015 (UTC)[reply]

Not every piece of code dealing with strings is actually involved in processing and validation of text. A file copy program which receives Unicode file names and passes them to file IO routines, would do just fine with a simple byte buffer. If you design a library that accepts strings, the simple, standard and lightweight std::string would do just fine. On the contrary, it would be a mistake to reinvent a new string class and force everyone through your peculiar interface. Of course, if one needs more than just passing strings around, he should then use appropriate text processing tools.31.168.83.233 (talk) 15:03, 22 June 2015 (UTC)[reply]

Yes, that's what I meant above by "processing" a string. If you're just passing something along, there's no need to to use specialized classes. But if you are actually tinkering with the UTF internals of a string, rolling your own code to do that on the fly is a likely source of errors and incompatibility. -- Elphion (talk) 20:00, 22 June 2015 (UTC)[reply]

In such case I suggest you remove the "FAQ #4" comment and the following argument. It is no longer relevant and the authors are doing okay. — Preceding unsigned comment added by 46.117.0.120 (talk) 22:51, 23 June 2015 (UTC)[reply]

The argument that UTF-8 is the "least weak" of all the encodings is silly. The differences in "weakness" are minuscule and mostly in the eyes of the beholder. As a professional programmer dealing with Unicode, you should know all of these and be prepared to deal with them. The important question instead is which encoding best suits your purposes as a programmer for the task at hand. As long as your end result is a valid Unicode string (in whatever encoding) and you communicate this to whatever interface you're sending it to, nobody should have cause for complaint. The interface may need to alter the string to conform to the expectations on the other side (Windows, e.g.). It is precisely the interfaces where expectations on both sides need to be spelled out. Leaving them to convention is the wrong approach.

I would say that all three major encodings being international standards endorsed by multiple actors is sufficient warrant for using them.

-- Elphion (talk) 18:24, 18 June 2015 (UTC)[reply]

Let's remove the sample code

I don't think it belongs in the article. sverdrup (talk) 12:11, 29 June 2015 (UTC)[reply]

Agreed. Mr. Swordfish (talk) 20:35, 9 July 2015 (UTC)[reply]

One response is not a "discussion". That code is what 50% of the visitors to this page are looking for. Instead I would prefer if the many many paragraphs that describe in excruciating detail over and over and over again how UTF-8 is encoded was deleteted, reduced to that obvious 7-line table that is at the start, and this code was included. Also I note nobody seems to want to delete the code with UTF-16.Spitzak (talk) 18:02, 13 July 2015 (UTC)[reply]

I didn't think it needed much discussion since it's a clear violation of Wikipedia policy. Wikipedia is not a how-to guide. Sample code is inappropriate. Unless you can provide some sort of reasoning why we should make an exception here, it should go.

I seriously doubt that anywhere near 50% of visitors or here to view C code. As for UTF-16, I've never had the occasion to look at that page, but if it includes sample code it should be removed too. Other stuff exists is not necessarily a valid argument. Mr. Swordfish (talk) 20:58, 13 July 2015 (UTC)[reply]

Code can be a very useful tool for explaining computer science matters, so I don't see it a clear violation of policy. Heapsort is a good example. I'm not sure C is the best language for code-based explanation, and I'm not sure that that code was of use to most of the visitors of the page.--Prosfilaes (talk) 23:26, 14 July 2015 (UTC)[reply]

The Heapsort article uses Pseudocode, which is an excellent way of providing an informal high-level description of the operating principle of a computer program or other algorithm. Since pseudocode is intended to be read by humans it is appropriate for an encyclopedic article. Actual code in a specific language - not so much, especially in this case since we are not describing an algorithm or program. Mr. Swordfish (talk) 14:19, 15 July 2015 (UTC)[reply]

Actual code in a specific language is intended to be read by humans, as well. Clear code in an actual language can be clearer then low-level pseudocode can, since the meaning is set in stone. In this case... we don't seem to be doing a good job of communicating how UTF-8 works for some people. I don't particularly believe the C code was helping, but I'm not sure deleting it helped make the subject more clear.--Prosfilaes (talk) 22:18, 15 July 2015 (UTC)[reply]

Fair enough. I now see that there are a lot of wikipedia pages with example code (https://en.wikipedia.org/wiki/Category:Articles_with_example_code) and I 'm not about to go on a crusade to remove them all. So I guess the question is whether that code example improves the article or not. I don't think it does, and don't really understand the point of including it in the article. Mr. Swordfish (talk) 15:49, 16 July 2015 (UTC)[reply]

If anything you say is true, then the excruciatingly detailed tables that repeat over and over and over again where the bits go (when the first 7-line table is by far the clearest) should be removed as well. They are simply attempts to describe the same program in english.Spitzak (talk) 00:39, 15 July 2015 (UTC)[reply]

I believe the code example helps A LOT and should be restored. The remaining text is nearly useless (except for the 7-line table that some people keep trying to obfuscate). The code actually describes how it works and makes it clear what patters are allowed and what ones are disallowed, it also shows what to do with invalid byte sequences (turn each byte into a code).Spitzak (talk) 17:32, 27 July 2015 (UTC)[reply]

I suppose we'll just have to disagree on its usefulness - I do not think it adds to the article in any meaningful way. Perhaps other editors can weigh in.

If we are going to restore it, we need to ask where it came from and how we know it is correct. I don't see any sources cited - either it was copied from somewhere ( with possible copyright implications) or an editor wrote it him or herself (original research). So, first we need to reach consensus that it improves the article, and then we'll need to attribute it to a reliable source. Mr. Swordfish (talk) 12:59, 29 July 2015 (UTC)[reply]

I wrote it. Some of it is based on earlier code I wrote for fltk. The code is in the public domain.Spitzak (talk) 17:52, 29 July 2015 (UTC)[reply]

Modified UTF-8

In this section it is stated:

"Modified UTF-8 uses the 2-byte overlong encoding of U+0000 (the NUL character), 11000000 10000000 (hex C0 80), rather than 00000000 (hex 00). This allows the byte 00 to be used as a string terminator."

It ides not explain why a single byte 00000000 can not be used as a string terminator. Given that all other bytes commencing 0 are treated as ASCII, why should 00 be any different?

FreeFlow99 (talk) 11:53, 8 September 2015 (UTC)[reply]

If you use a single null byte (00) to indicate the end of a string, then whenever you encounter 00 you have to decide whether it indicates an actual character (U+0000) in the string or the end of the string itself. The point of the modified scheme is that 00 would never appear in character data, even if the data included U+0000. This has two advantages: (1) 00 does not code characters, so it is freed up for local use as the traditional metacharacter to indicate unambiguously the end of a string, and (2) 00 never appears in encoded data, so that software that chokes on 00 or ignores 00 can still handle data that includes U+0000. The downside, of course, is that you have to avoid software that chokes on the overlong sequence. -- Elphion (talk) 13:34, 8 September 2015 (UTC)[reply]

Thanks for your reply. However I'm not sure I completely understand the merits. If we are talking about a text only file, 00x should never be in a string (paper tape being obsolete), [unless it is used to pad a fixed length field, in which case I understand the point]. What you say would be a problem for data files because bytes aren't just used to store characters but can store values such as integers, reals etc, and these can conflict with control characters; but this problem is not limited to 00x, it applies to all control codes. The only reason I can see to treat NUL separately is that 00x is very common. To avoid ambiguity completely we would need to limit control code usages to text only files, and for data files use a structure eg where each component has an id and a size followed by the actual data, or use some different system with a special bit that is out of reach of any possible data values (conceptually a 'ninth' bit) that flags the 'byte' as a control character. Or have I missed something? FreeFlow99 (talk) 14:54, 8 September 2015 (UTC)[reply]

Your statement "if we are talking about a text only file, 00x should never be in a string" is the misunderstanding. U+0000 is a perfectly valid Unicode character. Unicode does not define its meaning beyond its correspondence with ASCII NUL, and applications use it in a variety of ways. It is in fact common in real data. You should therefore expect character data to contain it, and be prepared to handle it (and preserve it as data). Modified UTF-8 is one way of doing that without giving up the notion of 00 as string terminator. The alternative pretty much requires that string data be accompanied by its length (in bytes or characters) -- which is not a bad idea. -- Elphion (talk) 16:33, 8 September 2015 (UTC)[reply]

Comment from lead

Move the following material to Talk (originally in the lead of the article and since August 2014 commented out)

UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.{{cn}}

Possible Citations, I'm sure you could find lots more, but these are good samples showing that the low-level api is in UTF-8:

https://developer.apple.com/library/mac/qa/qa1173/_index.html

https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text

http://wayland.freedesktop.org/docs/html/protocol-spec-interface-wl_shell_surface.html (look for set_title)

End moved material -- Elphion (talk) 12:35, 10 September 2015 (UTC)[reply]

Misleading caption

Graph indicates that UTF-8 (light blue) exceeded other main encodings of text on the Web, that by 2010 it was nearing 50% prevalent, and up to 85% by August 2015.[2]

While the 85% statement may be true, the graph doesn't indicate any such thing. NotYourFathersOldsmobile (talk) 22:13, 10 September 2015 (UTC)[reply]

Yes, attempts to update the graph to a newer version produced by the same engineer at Google have been reverted because he has not made clear what the copyright is on the new graph.Spitzak (talk) 23:26, 10 September 2015 (UTC)[reply]

where's the source referred to in #WTF-8?

Section WTF-8 says "The source code samples above work this way, for instance." – I can't find these samples, where are they? --Unhammer (talk) 06:25, 24 September 2015 (UTC)[reply]

Huge number of incorrect edits!

112.134.187.248 made a whole lot of mis-informed edits. With any luck he will look here. Sorry but it was almost impossible to fix without a complete revert, though he was correct exactly once when he used "will" instead of "could", I will try to keep that..

1. An indic code point cannot take 6 bytes. He is confusing things with CESU-8 I think, though even that will encode these as 3 bytes as they are still in the BMP.

2. You cannot pass ASCII to a function expecting UTF-16. Declaring the arguments to be byte arrays does not help, and I doubt any sane programmer would do that either. Therefore it is true that UTF-16 requires new apis.

3. Many attempts to claim there is some real-world chance of Asian script being larger in UTF-8 than UTF-16, despite obvious measurements of real documents that show this does not happen. Sorry you are wrong. Find a REAL document on-line that is larger. Stripping all the markup and newlines does not count.

4. Deletion of the fact that invalid byte sequences can be stored in UTF-8, by somehow pretending that they will magically not happen. Sorry, try actually programming before you believe such utter rubbish. Especially odd because some other edits indicate that he things invalid sequences will happen and are somehow a disadvantage of UTF-8.

5. Belief that markup tags using non-ASCII can remove the fact that it makes UTF-8 smaller. Sorry but markup contains far more slashes and angle brackets and spaces and quotes and lots and lots of ASCII-only tags so this will just not happen, no matter how much you wish it could.

6. Claim that invalid sequences of 4 bytes are somehow a problem, while ignoring invalid sequences in UTF-16 and all other invalid sequences in UTF-8. This is despite earlier edits where he basically pretends invalid sequences magically don't happen. Sorry you can't have it both ways.

7. Complete misunderstanding of why multibyte sequences in UTF-16 cause more problems than in UTF-8: because they are RARE. Believe me, NOBODY screws up UTF-8 by "confusing it with ASCII" because they will locate their mistake the very first time a non-ASCII character is used. That is the point of this advantage.

Spitzak (talk) 17:34, 19 October 2015 (UTC)[reply]

Includes response to Spitzak (talk)'s UTF-8 propaganda.

1: It was not an "Indic codepoint", an indic character. Characters are not codepoints, and, see TSCII. The letter "பு" takes 6 bytes in UTF-8 but only one in TSCII. Nice straw-man. "Sorry, try actually programming before you believe such utter rubbish." : Same goes for you, a character is not necessarily a codepoint. Try reading Unicode TRs and standards before you even attempt to program using encodings.
2: Well, UTF-8 requires new APIs too, to pass NUL as an argument.
3: Why don't you cite a reliable source? Anecdotes or WP:NOR. And SGML can still use non-ASCII for markup; there is no restriction on this.
4: Try to learn some English first; "Thinks". I programmed, too.
5: "Will just not happen" is normative, not positive. It is your opinion, and Wikipedia is not Facebook.
6: I can understand this. Sorry, I apologize.
7: Normative (opinion). If there is an error, I say that we spot it even with Latin, in UTF-16 if there is too much NUL or too less NUL. It depends on the programmer.

Still, UTF-16 can recover using surrogate points because there is a particular order of encoding surrogates, and they are paired; if a surrogate is missing, the other is just discarded and if both are there at right places, it could be used as the start pointer for recovery.

Invalid filenames: UTF-8 encoding says anything over 4 bytes as a code-point is illegal, and therefore it should be discarded; bugs are not to be considered features.

112.134.196.135 (talk) 06:32, 20 October 2015 (UTC)[reply]

"Deletion of the fact that invalid byte sequences can be stored in UTF-8": Invalid sequences can happen ANYWHERE; it is not an advantage for UTF-8.

Interspersed comments do not work at Wikipedia (consider what the page would look like if anyone wanted to reply to one of your points). Therefore I have extracted your responses and posted them above. I have not looked at the issue yet, but the tone of the above responses is what would be expected from a belligerent rather than someone wanting to help build the encyclopedia. Johnuniq (talk) 06:51, 20 October 2015 (UTC)[reply]

Thanks for editing, Johnuniq . The edits by Spitzak includes a lot of normative content; haven't you noticed his tone? I am trying to build a good encyclopedia, with less opinion-based information and good neutrality.

1: The number of bytes that a character takes in any Unicode format is unbounded, though I think Unicode permits some limit on the number of combining characters to be stacked on one base character. That TSCII encodes as one codepoint what Unicode does as two is not really relevant to this page.

2: If you want to include NUL in a string, you need an API that will let you do it. If you don't care, you can use NUL-terminated strings. There's absolutely no difference between ASCII and UTF-8 in this regard, and many people use NUL terminated UTF-8 strings.

3: Cites are important where there's a real dispute, but this is citable, I believe, and in practice it is true. I don't see why SGML, which is a generic form of markup instead of markup itself, is relevant here; of course a markup language in theory could use non-ASCII characters, but I've never seen it done, and HTML, TeX and other common forms of ASCII markup are probably a million times more common then any non-ASCII markup.

5: Bah. It's measurable. You can demand cites, but don't demean other posts by dismissing facts as their opinion.--Prosfilaes (talk) 15:38, 20 October 2015 (UTC)[reply]

1. It is still true that TSCII is a multi-byte (or single-byte on how you view it) encoding, and still deals with this problem. Even though the 2CP problem is due to Unicode itself, it still applies because it is a multi-byte encoding to encode the same characters; just more efficient for a certain script.

2: Yes, they do use null-terminated strings, I can understand that these are used in argv[]. But still, null-terminated strings are what house a lot of other bugs; UTF-8 does not expose them first, but UTF-16 forces them to care about NUL problems.

3: Yes, SGML places no restrictions on this, HTML does; this is just what popular document-tools do. However, when the Chinese content is higher than the markup tags, it is space-advantagerous for UTF-16. I am not saying that this does not happen, but what the author says 'often' is something that needs citation on 'how often'; the citation provided was anecdotal. For example, for a non-fancy document, or some document that properly separates documents, scripts and stylesheets. 'Sometimes' is a better word.

5. WP:NOR still applies.

To the point 7: People who treat UTF-16 as UCS-2 are similar to people who treat UTF-8 as ASCII. Which editors happen to be the majority? Editors who treat UTF-8 as ASCII, I can say from anecdotes. Right now, rarely any editors have these problems; Blaming UTF-16 for UCS-2 is like blaming UTF-8 for Latin; the blame-game is based on obsolete encoding assumptions. Who says "NOBODY"? Weasel?

How exactly is saving the illegal characters an advantage over UTF-16? If it is a stream, both UTF-8 and UTF-16 do them. Both these standards tell to discard invalid codepoints; you may claim that anything beyond 4-bytes after transformation can still be ignored, but any other invalid surrogates can be ignored in UTF-16 as well. I am the same person, again.112.134.200.102 (talk) 18:39, 20 October 2015 (UTC)[reply]

1: The 2CP problem? Sounds like OR to me. Sure as heck doesn't come up in a web search.

2: NUL terminated strings are just fine for many problem; they do not "house a lot of bugs".

3: So your OR wins over a citation? SGML is irrelevant here; there are many forms of markup, and none of the popular ones use characters outside of ASCII.--Prosfilaes (talk) 02:05, 21 October 2015 (UTC)[reply]

1. 2CP problem: 2 code-point per character problem. It is not an OR; it is a problem mentioned above; it is very relevant in 'specific multi-byte encodings' or 'specific single-byte encoding'. It was already mentioned here, but someone said it's irrelevant; it is still an encoding, well-capable of encoding that " two code points" in single byte.

3: So as UTF-16 is fine for me using non-NUL-terminated strings. What's the problem here? Null-terminated strings still house a lot of bugs.^[1]^[2]. If programmer's ignorance is not to be cited, then remove the thing about UCS-2, and, all implementation-related bugs/faults as well. Edit: Sorry, someone else removed it, or, I removed it; it is no longer there; this point is done. The point was that programmer's ignorance about mistaking UTF-16 for UCS-2 is not a weakness of UTF-16 per se, any more than the weakness of UTF-8 getting mistaken for some Latin-xxxx especially when we have specific BOMs.

3: My claim was no OR; the author's claim was the OR. It is up to the claimant to prove something; I am not saying that most popular markups do not use ASCII; It is up to author to prove (or at least cite from a reliable source) that the size of markup is larger than the size of other language text, for him to use the word " often" here. "Sometimes" is a better word. Moreover, I can still say that usually, a web page is generated on request; they do not store HTML files there, just text that will be inserted there, in a database, with very little markup. It is not whether it is my OR or not; it is just not giving an anecdoate as a citation to show how often this happens, even if you think SGML is irrelevant.112.134.234.218 (talk) 05:27, 21 October 2015 (UTC)[reply]

I removed the one about UTF-8 encoding invalid strings; some UTF-16 processor which can still process invalid sequences can either ignore invalid sequences. It is quite the same as UTF-16; you either have this advantage both ways or neither way. Moreover, Unix-filenames/paths are encoding-unaware; they still treat the file paths/names as ASCII; only having one NUL is a problem, so as having a NUL is not a problem in length-prefixed filenames or other length-aware methods or methods like double NUL. If what he means by invalid filename is invalid character sequence, any encoding-unaware stream processor can do it (either one that does not discard anything after 4 bytes or after 2nd ordered surrogate), based on internal limitations like NUL or NULNUL or other ones caused by pre-encoding the stream length.

Removed the sentence blaming UTF-16 for inconsistent behaviour with invalid strings, while not explaining about the consistency of invalid sequences in invalid UTF-8 streams (the over-4-byte problem). If there is a standard behaviour that makes UTF-8 more consistent with invalid sequences than UTF-16, please mention the standard.112.134.188.165 (talk) 16:11, 21 October 2015 (UTC)[reply]

Unix filenames do not treat the name like ASCII; they treat it as an arbitrary byte string that does not include 00 or 2F. Perfect for storing UTF-8 (which was designed for it), impossible for storing UTF-16. Your links to complaints about NUL-terminated strings are unconvincing and irrelevant here.--Prosfilaes (talk) 08:17, 25 October 2015 (UTC)[reply]

Prosfilaes: Yes, UNIX just does them as arbitrary NUL-terminated strings without 00/2F. So as Windows handles it differently. These are based on invalid limitations. My point is that 'possibility to store invalid strings' exists in both of them; that's not a UNIX-only bug/feature. He responded like encoding a broken UTF-8 string is a UTF-8-only feature while you can have broken streams pretty much in any place where (1) it does not violate internal limitations and (2) it is encoding-unaware. I do not have any problem with NUL terminated files per se, but I don't see why should a broken bytestream considered feature in one and not a feature in the other.175.157.233.124 (talk) 15:43, 25 October 2015 (UTC)[reply]

Cleaned up the article due to WP:WEASEL ("... suggested that...").

Cleaned up the article, again WP:WEASEL ("...it is considered...") by whom? Weasels?175.157.125.211 (talk) 07:48, 22 October 2015 (UTC)[reply]

112.134.190.56 (talk) 21:12, 21 October 2015 (UTC)[reply]

Cleaned up: "If any partial character is removed the corruption is always recognizable." Recognizable how? UTF-8 is not an integrity-check mechanism; if one byte is missing or altered, it will be processed AS-IS. However, the next byte is processable, and it is already listed there.175.157.94.167 (talk) 09:18, 22 October 2015 (UTC)[reply]

The point you mentioned, '4. Deletion of the fact that invalid byte sequences can be stored in UTF-8, by somehow pretending that they will magically not happen. Sorry, try actually programming before you believe such utter rubbish. Especially odd because some other edits indicate that he things invalid sequences will happen and are somehow a disadvantage of UTF-8.'

I did not delete it. Please, be respectful to others; do not accuse others falsely. Invalid sequences can be stored in anything, especially in encoding-unaware or buggy implementations. I could see where this is coming from, currently, I am from Sri Lanka; am not them, who are probably from somewhere else.

Spitzak: See your edit here: https://en.wikipedia.org/w/index.php?title=UTF-16&type=revision&diff=686526776&oldid=686499664 . Javascript is NOT Java, and the content you removed says nothing implying failure to support UTF-16; it is just that the distinction was made so that surrogate pairs are counted separately than code points. In fact, it's I who added the word 'Javascript' first to that article.

Please refrain from using phrases like "any sane programmer", "NOBODY screws" which are examples of No true Scotsman fallacy.175.157.94.167 (talk) 10:07, 22 October 2015 (UTC)[reply]

For edits like 1, please see WP:AGF.

Prosfilaes: I clarified your point in the article; If you think it is too much for an encyclopedia, please feel free remove it. Thanks :) 175.157.233.124 (talk) 15:54, 25 October 2015 (UTC)[reply]

Prosfilaes: The article (Indic-specific) TSCII exists in Wikipedia; that's why I stated it; it works as I mentioned (it is in the table in the article) and it is capable of encoding one-byte for two-codepoints for one letter. The standard is here: https://www.iana.org/assignments/charset-reg/TSCII . I am sorry if it is considered OR or rare.

[1] ttps://www.jwz.org/blog/2014/04/heartbleed-hit-list/

[2] ttp://queue.acm.org/detail.cfm?id=2010365

[1]

[2]

@@ Line 246: / Line 246: @@
 ::::::Please refrain from using phrases like "any sane programmer", "NOBODY screws" which are examples of [[No true Scotsman]] fallacy.[[Special:Contributions/175.157.94.167|175.157.94.167]] ([[User talk:175.157.94.167|talk]]) 10:07, 22 October 2015 (UTC)
 ::::::For edits like 1, please see [[WP:AGF]].
-:::::::{{u|Prosfilaes}}: I clarified your point in the article; If you think it is too much for an encyclopedia, please remove it. Thanks :) [[Special:Contributions/175.157.233.124|175.157.233.124]] ([[User talk:175.157.233.124|talk]]) 15:54, 25 October 2015 (UTC)
+:::::::{{u|Prosfilaes}}: I clarified your point in the article; If you think it is too much for an encyclopedia, please feel free remove it. Thanks :) [[Special:Contributions/175.157.233.124|175.157.233.124]] ([[User talk:175.157.233.124|talk]]) 15:54, 25 October 2015 (UTC)
+:::::::{{u|Prosfilaes}}: The article (Indic-specific) [[TSCII]] exists in Wikipedia; that's why I stated it; it works as I mentioned (it is in the table in the article) and it is capable of encoding one-byte for two-codepoints for one letter. The standard is here: https://www.iana.org/assignments/charset-reg/TSCII . I am sorry if it is considered OR or rare.