Jump to content

Talk:UTF-8: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
Line 251: Line 251:


::Naturally, I disagree with your analysis of my understanding of the manifesto. But if you wish to continue this discussion, I suggest we move it to my talk page, as it contributes little to the question at hand. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 03:07, 19 June 2015 (UTC)
::Naturally, I disagree with your analysis of my understanding of the manifesto. But if you wish to continue this discussion, I suggest we move it to my talk page, as it contributes little to the question at hand. -- [[User:Elphion|Elphion]] ([[User talk:Elphion|talk]]) 03:07, 19 June 2015 (UTC)

:::Not every piece of code dealing with strings is actually involved in processing and validation of text. A file copy program which receives Unicode file names and passes them to file IO routines, would do just fine with a simple byte buffer. If you design a library that accepts strings, the simple, standard and lightweight std::string would do just fine. On the contrary, it would be a mistake to reinvent a new string class and force everyone through your peculiar interface. Of course, if one needs more than just passing strings around, he should then use appropriate text processing tools.[[Special:Contributions/31.168.83.233|31.168.83.233]] ([[User talk:31.168.83.233|talk]]) 15:03, 22 June 2015 (UTC)


The argument that UTF-8 is the "least weak" of all the encodings is silly. The differences in "weakness" are minuscule and mostly in the eyes of the beholder. As a professional programmer dealing with Unicode, you should know all of these and be prepared to deal with them. The important question instead is which encoding best suits your purposes as a programmer for the task at hand. As long as your end result is a valid Unicode string (in whatever encoding) and you communicate this to whatever interface you're sending it to, nobody should have cause for complaint. The interface may need to alter the string to conform to the expectations on the other side (Windows, e.g.). It is precisely the interfaces where expectations on both sides need to be spelled out. Leaving them to convention is the wrong approach.
The argument that UTF-8 is the "least weak" of all the encodings is silly. The differences in "weakness" are minuscule and mostly in the eyes of the beholder. As a professional programmer dealing with Unicode, you should know all of these and be prepared to deal with them. The important question instead is which encoding best suits your purposes as a programmer for the task at hand. As long as your end result is a valid Unicode string (in whatever encoding) and you communicate this to whatever interface you're sending it to, nobody should have cause for complaint. The interface may need to alter the string to conform to the expectations on the other side (Windows, e.g.). It is precisely the interfaces where expectations on both sides need to be spelled out. Leaving them to convention is the wrong approach.

Revision as of 15:04, 22 June 2015


Double error correction

[[:File:UnicodeGrow2010.png|thumb|360px|Graph indicating that UTF-8 (light blue) exceeded all other encodings of text on the Web in 2007, and that by 2010 it was nearing 50%.<ref name="MarkDavis2010"/> Given that some ASCII (red) pages represent UTF-8 as entities, it is more than half.<ref name="html4_w3c">]]

The legend says "This may include pages containing only ASCII but marked as UTF-8. It may also include pages in CP1252 and other encodings mistakenly marked as UTF-8, these are relying on the browser rendering the bytes in errors as the original character set"... but, it is not the original idea, we can not count something "mistakenly marked as UTF-8", even it exist. The point is that there are a lot of ASCII pages that have symbols that web-browser map to UTF-8.

PubMed Central, for example, have 3.1 MILLION Articles in ASCII but using real UTF-8 by entity encode. No one is a mistake.

The old text (see thumb here) have a note <ref name="html4_w3c"> is: { { "HTML 4.01 Specification, Section 5 - HTML Document Representation", W3C Recommendation 24 December 1999. Asserts "Occasional characters that fall outside this encoding may still be represented by character references. These always refer to the document character set, not the character encoding. (...) Character references are a character encoding-independent (...).". See also Unicode and HTML/Numeric character references.} }

This old text have also some confusion (!)... so I corrected to "Many ASCII (red) pages have also some ISO 10646 symbols representanted by entities,[ref] that are in the UTF-8 repertoire. That set of pages may be counted as UTF-8 pages."

--Krauss (talk) 22:45, 23 August 2014 (UTC)[reply]

I reverted this as you seem to have failed to understand it.
First, an Entity IS NOT UTF-8!!!!!!! They contain only ascii characters such as '&' and digits and ';'. They can be correctly inserted into files that are NOT UTF-8 encoded and are tagged with other encodings.
Marking an ASCII file as UTF-8 is not a mistake. An ASCII file is valid UTF-8. However since it does not contain any multi-byte characters it is a bit misleading to say these files are actually "using" UTF-8.
Marking CP1252 as UTF-8 is very common, especially when files are concatenated, and browsers recognize this due to encoding errors. This graph also shows these mis-identified files as UTF-8 but they are not really.
Spitzak (talk) 23:58, 23 August 2014 (UTC)[reply]
Sorry about my initial confused text. Now we have another problem here, is about interpretation of W3C standards and statistics.
1. RFC 2047 (MIME Content-Transfer-Encoding) interpretarion used in the charset or enconding attributes of HTTP (content-type header with charset) and HTML4 (meta http-equiv): say what must be interpreted as "ASCII page" and what is a "UTF-8 page". Your assertion "an ASCII file is valid UTF-8" is a distortion of these considerations.
2. W3C standards, HTML4.1 (1999): say that you can add to an ASCII page some special symbols (ISO 10646 as expressed by the standard) by entities. Since before 2007, what all web-browser do, when typing special symbols, is replace the entity by an UTF-8 character (rendering the entity as its standard UTF-8 glyph).
3. Statistics: this kind of statistics report must use first the technical standard options and variations. These options have concrete consequences that can be relevant to the "counting web pages". The user mistakes may be a good statistical hypothesis testing, but you must first to prove that they exist and that they are relevant... In this case, you must to prove that the "user mistake" is more important than technical standard option. In an encyclopedia, we did not show unproven hypothesis, neither a irrelevant one.
--Krauss (talk) 10:23, 24 August 2014 (UTC)[reply]
An ASCII file is valid UTF-8. That's irrefutable fact. To speak of "its standard UTF-8 glyph" is a category error; UTF-8 doesn't have glyphs, as it's merely a mapping from bytes to Unicode code points.--Prosfilaes (talk) 21:23, 24 August 2014 (UTC)[reply]
To elaborate on the second point above: Krauss in conflating "Unicode" and "UTF-8". They are not the same. A numerical character entity in HTML (e.g., &#355; or &#x0163;) is a way of representing a Unicode codepoint using only characters in the printable ASCII range. A browser finding such an entity will use the codepoint number from the entity to determine the Unicode character and will use its font repertoire to attempt to represent the character as a glyph. But this process does not involve UTF-8 encoding -- which is a different way of representing Unicode codepoints in the HTML byte stream. The ASCII characters of the entity might themselves be encoded in some other scheme: the entity in the stream might be ASCII characters or single-byte UTF-8 characters, or even UTF-16 characters, taking 2 bytes each. But the browser will decode them as ASCII characters first and then keying on the "&#...;" syntax use them to determine the codepoint number in a way that does not involve UTF-8. -- Elphion (talk) 21:58, 24 August 2014 (UTC)[reply]
I agree the problem is that Krauss is confusing "Unicode" with "UTF-8". Sorry I did not figure that out earlier.Spitzak (talk) 23:28, 25 August 2014 (UTC)[reply]
Our job as Wikipedia editors is not to interpret the standards, nor to determine what is and isn't appropriate to count as UTF-8 "usage". That job belongs to the people who write the various publications that we cite as references in our articles. Mark Davis's original post on the Official Google Blog, from whence this graph came and which we (now) correctly cite as its source, doesn't equivocate about the graph's content or meaning. Neither did his previous post on the topic. Davis is clearly a reliable source, even though the posts are on a blog, and we should not be second-guessing his claims. That job belongs to others (or to us, in other venues), and when counter-results are published, we should consider using them. RossPatterson (talk) 11:13, 25 August 2014 (UTC)[reply]
Thanks for finding the original source. [1] clearly states that the graph is not just a count of the encoding id from the html header, but actually examines the text, and thus detects ASCII-only (I would assume also this detects UTF-8 when marked with other encodings, and other encodinds like CP1252 even if marked as UTF-8): "We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example... Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8)." The caption needs to be fixed up.Spitzak (talk) 23:28, 25 August 2014 (UTC)[reply]
Krauss nicely points out below Erik van der Poel's methodology at the bottom of Karl Dubost's W3C blog post, which makes it explicit that the UTF-8 counts do not include ASCII: "Some documents come with charset labels declaring iso-8859-1, windows-1252 or even utf-8 when the byte values themselves are never greater than 127. Such documents are pure US-ASCII (if no ISO 2022 escape sequences are encountered).". RossPatterson (talk) 17:24, 27 August 2014 (UTC)[reply]

Wow, a lot of discussion! So many intricate nuances of interpretations, sorry I was to imagine something more simple when started...

  • "Unicod" vs "UTF8": Mark Davis use "Unicod (UTF8...)" in the legend, and later, in the text, express "As you can see, Unicode has (...)". So, for his public, "Unicode" and "UTF8" are near the same thing (only very specialized public fells pain with it). Here, in our discussion, is difficult to understand what the technical-level we must to use.
  • About Mark Davis methodology, etc. no citation, only a vague "Every January, we look at the percentage of the webpages in our index that are in different encodings"...
    But, SEE HERE similar discussion, by those who did the job (the data have been compiled by Erik van der Poel)
  • Trying an answer about glyph discussion: the Wikipedia glyph article is a little bit confuse (let's review!); see W3C use of the term. In a not-so-technical-jargon, or even in the W3C's "loose sense", we can say that there are a set of "standard glyphs/symbols" that are represented in a subset of "UTF-8-like symbols", and are not in ASCII neither CP1252 "symbols"... Regular people see that "ASCII≠CP1252" and "UTF8≠CP1252"... So, even regular people see that "ASCII≠UTF8" in the context of the illustration, and that HTML-entities are maped to something that is a subset of UTF8-like symbols.

Mark Davis not say any thing about HTML-entities or about "user mistakes", so, sugestion: let's remove it from article's text.
--Krauss (talk) 03:33, 26 August 2014 (UTC)[reply]

Neither W3C page you point to says anything about UTF-8, and I don't have a clue where you're getting "UTF-8-like symbols" from. Unicode is the map from code points to symbols and all the associated material; UTF-8 is merely a mapping from bytes to code points. The fact that it can be confusing to some does not make it something we should conflate.--Prosfilaes (talk) 06:00, 27 August 2014 (UTC)[reply]
My text only say "W3C use of the term" (the term "glyph" not the term "UTF-8"), and there (at the linked page) are a table with a "Glyph" column, with images showing the typical symbols. This W3C use of the term "glyph" as typical symbol, conflicts with the Wikipedia's thumb illustration with the text "various glyphs representing the typical symbol". Perhaps W3C is wrong, but since 2010 we need Refimprove (Wikipedia's glyph definition "needs additional citations for verification").
About my bolded sugestion, "let's remove it", ok? need to wait or vote, or we can do it now? --Krauss (talk) 12:38, 27 August 2014 (UTC)[reply]
I'm confused. Is Krauss questioning Mark Davis's reliability as a reference for this article? It seems to me that the graphs he presents are entirely appropriate to this article, especially after reading Erik van der Poel's methodology, as described in his 2008-05-08 post at the bottom of Karl Dubost's W3C blog post, which is designed to recognize UTF-8 specifically, not just Unicode in general. RossPatterson (talk) 17:16, 27 August 2014 (UTC)[reply]
Sorry my english, I supposed Mark Davis and W3C as reliable sources (!). I think Mark Davis and W3C write some articles to the "big public" and other articles to the "specialized technical people"... We here can not confront "specialized text" with "loose text", even of the same author: this confrontation will obviously generates some "false evidence of contradiction" (see ex. "Unicode" vs "UTF8", "glyph" vs "symbol", etc. debates about correct use of the terms). About Erik van der Poel's explanations, well, this is other discussion, where I agree your first paragraph about it, "Our job as Wikipedia editors (...)". Now I whant only to check the sugestion ("let's remove it from article's text" above). --Krauss (talk) 11:13, 28 August 2014 (UTC)[reply]

It appears this discussion is moot - the graph image has been proposed for deletion 2 days from now. RossPatterson (talk) 03:41, 30 August 2014 (UTC)[reply]

Thanks, fixed. --Krauss (talk) 17:25, 30 August 2014 (UTC)[reply]

Backward compatibility:

Re: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.

I'm a non-guru struggling with W3C's strong push to UTF-8 in a world of ISO-8859-1 and windows-1252 text editors, but either I have misunderstood this completely or else it is wrong? Seven-bit is the same in ASCII or UTF-8, sure; but in 8-bit extended ASCII (whether "extended" to ISO-8859-1, windows-1252 or whatever), a byte with the MSB "on" is one byte in extended ASCII, two bytes in UTF-8. A parser expecting "8-bit extended ASCII" will treat each of the UTF-8 bytes as a character. Result, misery. Or have I missed something? Wyresider (talk) 19:18, 5 December 2014 (UTC)[reply]

No, it is not a problem unless your software decides to take two things that it thinks are "characters" and insert another byte in between them. In 99.999999% of the cases when reading the bytes in, the bytes with the high bit set will be output unchanged, still in order, and thus the UTF-8 is preserved. You might as well ask how programs handle english text when they don't have any concept of correct spelling and each word is a bunch of bytes that it looks at individually. How do the words get read and written when the program does not understand them? It is pretty obvious how it works, and this is why UTF-8 works too.Spitzak (talk) 19:52, 5 December 2014 (UTC)[reply]
Wyresider -- This has been discussed in the article talk archives. Most of the time, if a program doesn't mess with what it doesn't understand, or treats sequences of high-bit-set characters as unanalyzable units, then simple filter etc. programs will often pass non-ASCII UTF8 characters through unaltered. It's a design feature of UTF8 which is designed to lighten the programming load of transition from single-byte to UTF8 -- though certainly not an absolute guarantee of backward compatibility... AnonMoos (talk) 14:51, 7 December 2014 (UTC)[reply]
In a world of ISO-8859-1 and Windows-1252 text editors? What world is that? I live in a world where the most-spoken language is Chinese, which clears a billion users alone, and the text editors that come with any remotely recent version of Linux, Windows or Macs, or any version of Android or iOS, support UTF-8 (or at least Unicode). There's no magic button that makes UTF-8 work invariably with systems expecting 8-bit extended ASCII (or Windows-1252 with systems expecting 8-bit extended ASCII to not use C1 control codes 80-9F), but UTF-8 works better then, say, Big5 (which uses sub-128 values as part of multibyte characters) or ISO-2022-JP (which can use escape sequences to define sub-128 values to mean a character set other then ASCII).--Prosfilaes (talk) 13:45, 8 December 2014 (UTC)[reply]
Wikipedia Talk pages are not a forum, but to be crystal clear, ASCII bytes have a high bit of zero and are UTF-8-clean, and anything that has a high bit of one isn't ASCII and will almost certainly have some bytes that will be treated differently in a UTF-8 context. A parser expecting data encoded in Windows codepage 1252 or in ISO 8859-1 isn't parsing ASCII, and won't understand UTF-8 correctly. RossPatterson (talk) 00:09, 9 December 2014 (UTC)[reply]
There are many parsers that don't expect UTF-8 but work perfectly with it. An example is the printf "parser". The only sequence of bytes it will alter starts with an ascii '%' and contains only ascii (such as "%0.4f"). All other byte sequences are output unchanged. Therefore all multibyte UTF-8 characters are preserved. Another example is filenames, on Unix for instance the only bytes that mean anything are NUL and '/', all other bytes are considered part of the filename, and are not altered. Therefore all UTF-8 multibyte characters can be parts of filenames.Spitzak (talk) 02:24, 9 December 2014 (UTC)[reply]

Many errors

I'm not an expert here, but I am an engineer and I do recognize when I read something that's illogical.

There are 2 tables:

https://en.wikipedia.org/wiki/UTF-8#Description

https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

They cannot both be correct. If the one following #Description is correct, then the one following #Codepage_layout must be wrong.

Embellishing on the table that follows #Description:

1-byte scope: 0xxxxxxx = 7 bits = 128 code points.

2-byte scope: 110xxxxx 10xxxxxx = 11 bits = 2048 additional code points.

3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 additional code points.

4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 additional code points.

The article says: "The next 1,920 characters need two bytes to encode". As shown in the #Description table, 11 bits add 2048 code points, not 1920 code points. The mistake is in thinking that the 1-byte scope and the 2-byte scope overlap so that the 128 code points in the 1-byte scope must be deducted from the 2048 code points in the 2-byte scope. That's wrong. The two scopes do not overlap. They are completely independent of one another.

The text following the #Codepage_layout table says: "Orange cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add." This implies there's a scope that looks like this:

2-byte scope: 01111111 10xxxxxx = 6 bits = 64 additional code points.

While that's possible, it conflicts with the #Description table. These discrepancies seem pretty serious to me. So serious that they put into doubt the entire article.

MarkFilipak (talk) 03:13, 23 February 2015 (UTC) )[reply]

11000001 10000000 encodes the same value as 01000000. So, yes, they do overlap.--Prosfilaes (talk) 03:24, 23 February 2015 (UTC)[reply]
Huh? The scopes don't overlap. Perhaps you mean that they map to the same glyph? Are you sure? I don't know because I've not studied the subject, If this is a logical issue with me, it's probably a logical issue with others. Perhaps a section addressing this issue is appropriate, eh?
Also, what about the "Orange cells" text and the 2-bit scope I've added to be consistent? That scope conflicts with the other table. Do you have a comment about that? Thank you. --MarkFilipak (talk) 03:53, 23 February 2015 (UTC)[reply]
Perhaps this is what's needed. What do you think?
1-byte scope: 0xxxxxxx = 7 bits = 128 code points.
2-byte scope: 1100000x 10xxxxxx = 7 bits = 128 alias code points that map to the same points as 0xxxxxxx.
2-byte scope: 1100001x 10xxxxxx = 7 bits = 128 additional code points.
2-byte scope: 110001xx 10xxxxxx = 8 bits = 256 additional code points.
2-byte scope: 11001xxx 10xxxxxx = 9 bits = 512 additional code points.
2-byte scope: 1101xxxx 10xxxxxx = 10 bits = 1024 additional code points.
3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 additional code points.
4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 additional code points.
--MarkFilipak (talk) 05:54, 23 February 2015 (UTC)[reply]
For the first question, it seems you don't understand "code points" the way the article means. "Code points" here refer to Unicode code points. The unicode code points are better described in the Plane (Unicode) article. In the UTF-8 encoding the Unicode code points (in binary numbers) are directly mapped to the x:es in this table:
1-byte scope: 0xxxxxxx = 7 bits = 128 possible values.
2-byte scope: 110xxxxx 10xxxxxx = 11 bits = 2048 possible values.
3-byte scope: 1110xxxx 10xxxxxx 10xxxxxx = 16 bits = 65536 possible values.
4-byte scope: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx = 21 bits = 2097152 possible values.
That means that in the 2-byte scheme you could encode the 2048 first code points, but you are not allowed to encode the first 128 code points, as described in the Overlong encodings section. And similarly it would be possible to encode all the 65536 first code points in the 3-byte scheme, but you are only allowed to use the 3-byte scheme from the 2049th code point. And the 4-byte scheme is used from the 66537th to the 1114112th (the last one) code point.
For your second question, continuation bytes (starting with 10) are only allowed after start bytes (starting with 11), not after "ascii bytes" (starting with 0). The "ascii bytes" are only used in the 1-byte scope. Boivie (talk) 10:58, 23 February 2015 (UTC)[reply]
Thanks for your reply. What you wrote is inconsistent with what Prosfilaes wrote, to wit: "11000001 10000000 encodes the same value as 01000000." Applying the principle of Overlong encodings, it seems to me that "11000001 10000000" is an overlong encoding (i.e., the encoding is required to be "01000000"), therefore, unless I misunderstand the principle of overlong encoding, what Prosfilaes wrote about "11000001 10000000" is wrong. I'll let you two work it out between you.
It occurs to me that the "1100000x 10xxxxxx" scope could therefore be documented as follows:
2-byte scope: 1100000x 10xxxxxx = 7 bits = 128 illegal codings (see: Overlong encodings).
Should it be so documented? Would that be helpful?
Look, I don't want to be a pest, but this article seems inconsistent and to lack comprehensiveness. I have notions, not facts, so I can't "correct" the article. I invite all contributors who do have the facts to consider what I've written. I will continue to contribute critical commentary if encouraged to do so, but my lack of primary knowledge prohibits me from making direct edits on the source document. Regards --MarkFilipak (talk) 15:12, 23 February 2015 (UTC)[reply]
I see nothing wrong with Prosfilaes' comment. Using the 2-bit scheme 11000001 10000000 would decode to code point 1000000, even if it would be the wrong way to encode it. I am also not sure it would be helpful to include illegal byte sequences in the first table under the Description header. It is clearly stated in the table from which code point to which code point each scheme should be used. The purpose of the table seem to be to show how to encode each code point, not to show how not to encode something. Boivie (talk) 17:16, 23 February 2015 (UTC)[reply]
1, Boivie, you wrote, "I see nothing wrong with Prosfilaes' comment." Assuming that you agree that "11000001 10000000" is overlong, and that overlong encodings "are not valid UTF-8 representations of the code point", then you must agree that "11000001 10000000" is invalid. How can what Prosfilaes wrote, "11000001 10000000 encodes the same value as 01000000," be correct if it's invalid? If it's invalid, then "11000001 10000000" doesn't encode any Unicode code point. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)[reply]
2, Regarding whether invalid encodings should be shown as invalid so as to clarify the issue in readers' minds, I ask: What's wrong with that? I assume you'd like to make the article as easily understood as possible. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)[reply]
3, Regarding the Orange cells quote: "Orange cells with a large dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 6 bits they add", it is vague and misleading because,
3.1, those cells don't add 6 bits, they cause a whole 2nd byte to be added, and
3.2, they don't actually add because the 1st byte ("0xxxxxxx") doesn't survive the addition -- it's completely replaced.
Describing the transition from
this: 00000000, 00000001, 00000010, ... 01111111, to
this: 11000010 10000000, 11000010 10000001, 11000010 10000010, ... 11000010 10111111,
as resulting from "the 6 bits they add" is (lacking the detail I supply in the 2 preceding sentences) going to confuse or mislead almost all readers. It misled me. Now that I understand the process architecture, I can interpret "the 6 bits they add" as sort of a metaphorical statement, but there is a better way.
My experience as an engineer and documentarian is to simply show the mapping from inputs to outputs (encodings to Unicode code points in this case) and trust that readers will see the patterns. Trying to explain the processes without showing the process is not the best way. I can supply a table that explicitly shows the mappings which you guys can approve or reject, but I need reassurance up front that what I produce will be considered. If not open to such consideration, I'll bid you adieu and move on to other aspects of my life. Comment? --MarkFilipak (talk) 20:09, 23 February 2015 (UTC)[reply]
¢ U+00A2 is encoded as C2 A2, and if you look in square C2 you find 0080 and in square A2 you find +22. If you in hexadecimal add the continuation byte's +22 to the start byte's 0080 you get 00A2, which is the code point we started with. So the start byte gives the first bits, and the continuation byte gives the last six bits in the code point.
I have no idea why a transition from the 1-byte scheme to the 2-byte scheme would be at all relevant in that figure. Boivie (talk) 21:02, 23 February 2015 (UTC)[reply]
Thank you for the explanation. --MarkFilipak (talk) 21:14, 23 February 2015 (UTC)[reply]
"Orange cells with a large dot are continuation bytes..." "White cells are the start bytes for a sequence of multiple bytes". Duh! You mean that the Orange cells aren't part of a "sequence of multiple bytes"? This article is awful and you guys just don't get it. I'm not going to waste my time arguing. I'm outta here. Bye. --MarkFilipak (talk) 21:25, 23 February 2015 (UTC)[reply]
The tables are correct. The "scopes" as you call them do overlap. Every one of the lengths can encode a range of code points starting at zero, therefore 100% of the smaller length is overlapped. However UTF-8 definition further states that when there is this overlap, only the shorter version is valid. The longer version is called an "overlong encoding" and that sequence of bytes should be considered an error. So the 1-byte sequences can do 2^7 code points, or 128. The 2-byte sequences have 10 bits and thus appear to do 2^10 code points or 2048, but exactly 128 of these are overlong because there is a 1-byte version, thus leaving 2048-128 = 1920, just as the article says. In addition two of the lead bytes for 2-byte sequeces can *only* start an overlong encoding, so those bytes (C0,C1) can never appear in valid UTF-8 and thus are colored red in the byte table.Spitzak (talk) 20:02, 23 February 2015 (UTC)[reply]
Thank you for the explanation. --MarkFilipak (talk) 20:13, 23 February 2015 (UTC)[reply]

Backward compatibility II

It rates a mention that unicode byte sequences never use a zero bytes except for the ASCII NUL. This means that processors that expect nul-terminated character strings (eg: C and C++ string libraries) can cope with UTF-8.

Paul Murray (talk) 03:56, 23 May 2015 (UTC)[reply]

ASCII bytes always represent themselves. I don't think we need to belabor that for \0.--Prosfilaes (talk) 15:08, 23 May 2015 (UTC)[reply]
Backward compatibility with ASCII is mentioned in the very first paragraph.Spitzak (talk) 05:11, 31 May 2015 (UTC)[reply]
Should this UTF-8 article explicitly mention that UTF-8 is compatible with software and hardware designed to expect null-terminated strings terminated by a single zero byte, while UTF-16 is incompatible? Yes, I agree with Paul Murray. --DavidCary (talk) 02:23, 18 June 2015 (UTC)[reply]

"UTF-8 should be the default choice"

"It was also suggested that UTF-8 should be the default choice of encoding for all Unicode-compliant software."

http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/ is a secondary source, published in a well-known technology magazine, that comments on the UTF-8 Everywhere publication, and there is no indication that it affiliates with the authors of the later. Please explain how it does not count as a valid source for the above claim. Thanks. 82.80.119.226 (talk) 17:00, 15 June 2015 (UTC)[reply]

It's a valid source for the above claim. Who cares about the claim? Lots of things have been suggested; for the most part, the suggestions are not interesting enough to note in Wikipedia.--Prosfilaes (talk) 17:30, 15 June 2015 (UTC)[reply]
Notability is not about being a "suggestion", a person or event, it is about importance. The manifesto clearly gained enough pagerank and endorsements by many respectful groups. It is a subject of on-going revolution and must definitely be referenced by Wikipedia to make Wikipedia better. In fact, I believe it even deserves an entry. I would say link the manifesto, but not TheRegister. 46.117.0.120 (talk) 15:36, 16 June 2015 (UTC)[reply]
What manifesto? "It was also suggested that UTF-8 should be the default choice of encoding for all Unicode-compliant software." says not a thing about any manifesto.--Prosfilaes (talk) 23:36, 16 June 2015 (UTC)[reply]
IMHO, the TheRegister article clearly refers to The Manifesto at http://www.utf8everywhere.org/. Exactly this masterpiece, in my opinion, should be referenced in the article. The only question is where and how. — Preceding unsigned comment added by 46.117.0.120 (talk) 00:47, 17 June 2015 (UTC)[reply]
The problem of referencing http://utf8everywhere.org/ directly is that it is a primary source which is not eligible to be referenced directly on its own. 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)[reply]
So add it not as a reference but as an external link. I am sure you guys understand more about the proper way to link it from Wikipedia than I do. There has to be a solution. If none is suggested, let's revert to the original proposed "it was suggested" formulation. — Preceding unsigned comment added by 46.117.0.120 (talk) 18:02, 17 June 2015 (UTC)[reply]
Many people have attempted to have Wikipedia reference http://www.utf8everywhere.org/ or the proposed boost extensions to make it possible to open files on Windows using UTF-8, but they always get reverted, there appears to be a contingent out there who do not want this opinion to ever be visible and come down fast and hard against anybody who attempts it. The fact that such an opinion exists and is shared by a large and rapidly-growing portion of the programming community is actually pretty important when discussing Unicode encodings and it should be mentioned in Wikipedia articles, but obviously somebody does not want that and is using the excuse that it is "opinion" to delete it every time. That said, I would prefer a link to utf8everywhere, the Reg article is pretty awful.Spitzak (talk) 18:42, 15 June 2015 (UTC)[reply]
I don't have a strong opinion about UTF8Everywhere. I have a strong opinion about "It was also suggested that ..." without saying who suggested it or why we should care that they suggested it.--Prosfilaes (talk) 23:36, 16 June 2015 (UTC)[reply]
Does "It is recommended by parts of the programming community that ..." sound better? This is a factual information, not opinion based, and is definitely relevant to the subject. Your argument of "why we should care that..." is faulty because it can be applied to any other piece of information virtually everywhere on Wikipedia. Why should we care about IMC recommendations? W3C recommendations? How community recommendations have less weight in this respect? 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)[reply]
"It is recommended by parts of the programming community that ..." Unix is being unreasonable for using more than 7 bits per character. (The Unix-Haters Handbook). That everyone should use UTF-16. That everyone should use ISO-2022. W3C recommendations are community recommendations, and from a known part of the community. I don't know why I should take UTF8Everywhere any more seriously then DEC64.--Prosfilaes (talk) 15:56, 17 June 2015 (UTC)[reply]
The Unix-Haters Handbook, UTF-16, ISO-2022, DEC64 -- if you have an indication that these are notable enough opinions (have secondary sources, etc...) then you are more than welcome to include these opinions in the respective articles. 82.80.119.226 (talk) 16:47, 17 June 2015 (UTC)[reply]
No, Wikipedia is NPOV, not positive POV. Even if we often do, we shouldn't have cites to pro-UTF-8 pages on UTF-8 and pro-UTF-16 page on UTF-16. To the extent that public support for UTF-8 over UTF-16 or vice versa is documented on either page, it should be consistent between pages.--Prosfilaes (talk) 19:26, 21 June 2015 (UTC)[reply]
I agree that it's even more important to link The Manifesto from the UTF-16 page. But, we are now discussing the UTF-8 page. 31.168.83.233 (talk) 14:21, 22 June 2015 (UTC)[reply]
I agree about 'suggested' and we can agree that there's no need for a holy war and no doubt about manifesto's notability. So, Prosfilaes, what formulation would you suggest? And where to link it from? I would leave the decision to you. Also whether the manifesto is good or bad or right or wrong is for sure irrelevant. I personally like it very much. 46.117.0.120 (talk) 14:02, 17 June 2015 (UTC)[reply]

The "manifesto" is not a bad piece, but it's one group's opinion and not a very compelling case for "UTF-8 everywhere". Each of the three principal encodings has strengths and weaknesses; one should feel free to use the one that best meets the purpose at hand. Since all three are supported by well-vetted open source libraries, one can safely choose at will. The manifesto's main point is that encoding UTF-16 is tricky; but it's no more tricky than UTF-8, and I've seen home-rolled code making mistakes with either. This really is a tempest in a tea-cup. -- Elphion (talk) 01:53, 17 June 2015 (UTC)[reply]

Whether it's compelling or not is not a criteria for the inclusion of the opinion. And as per manifesto's main point, to my interpretation it is that using one encoding everywhere simplifies everything, and then it gives a bunch of reasons why this encoding must be UTF-8. It also argues that UTF-16 is a historical mistake. 82.80.119.226 (talk) 10:26, 17 June 2015 (UTC)[reply]
What is the criteria for the inclusion of the opinion?--Prosfilaes (talk) 15:56, 17 June 2015 (UTC)[reply]
Notability. Just like anything else on Wikipedia. If you think that the one reference provided isn't enough, then say that, I'll find more. 82.80.119.226 (talk) 16:47, 17 June 2015 (UTC)[reply]
The danger of "free to choose" approach is addressed in FAQ#4 of the manifesto itself. This is also the reason for 'by default'. — Preceding unsigned comment added by 46.117.0.120 (talk) 00:47, 17 June 2015 (UTC)[reply]
You are repeating exactly what this manifesto argues against. The whole point of the manifesto is that the encodings do NOT have "equal strengths and weaknesses". It states that the encodings are NOT equal, and that any "weakness" of UTF-8 is shared by all the other encodings (ie variable-sized "characters" exist in all 3, even UTF-32), thus weakness(UTF-8) is less than weakness(others). Now you can claim the article is wrong, but that is an OPINION. You can't take your opinion and pretend it is a "fact" and then use it to suppress a document with a different opinion. The article is notable and is being referenced a lot. The only thing that would make sense is to link both it and a second article that has some argument about either a strength of non-UTF-8 that does not exist in UTF-8 or a weakness of UTF-8 that does not exist in non-UTF-8.Spitzak (talk) 17:18, 17 June 2015 (UTC)[reply]
I would like more references in this article.
I agree with Elphion that "You should always use UTF-8 everywhere" is not a neutral point of view. However, my understanding of the WP:YESPOV policy is that the proper way to deal with such opinions is to include all verifiable points of view which have sufficient weight.
I hear that something "is being referenced a lot". Are any of those places that reference it usable as a WP:SOURCE? If so, I think we should use those places as references in this article.
Are there places that say that maybe UTF-8 shouldn't always be used everywhere? Are any of those places usable as a WP:SOURCE? If so, I think that opposing opinion(s) should also be mentioned, using those places as references, in this article. --DavidCary (talk) 02:52, 18 June 2015 (UTC)[reply]
I have found only one. This: CONSIDERING UTF-16 BE HARMFUL BE CONSIDERED HARMFUL — Preceding unsigned comment added by 46.117.0.120 (talk) 04:38, 18 June 2015 (UTC)[reply]
unborked link: http://www.siao2.com/2012/04/27/10298345.aspx (SHOULD CONSIDERING UTF-16 BE HARMFUL BE CONSIDERED HARMFUL) -- Elphion (talk) 16:30, 18 June 2015 (UTC))[reply]
According to WP:SOURCE the place I referenced is usable as a source. My formulation is fine according to WP:YESPOV, and if Prosfilaes, or anyone else, does think that this makes the section biased in any way, they shall add a mention a WP:SOURCE conforming contradicting opinion, not to revert it. As such, can some experienced Wikipedian do that so I would not engage in edit wars with Prosfilaes? Obviously changing the formulation is also an option, but only alternatives I can currently think of are:
  • It was also suggested that...
  • It is recommended by parts of the programming community that...
  • The UTF-8 Everywhere manifesto published by Pavel Radzivilovsky et al. and supported by parts of the programming community, recommends that...
82.80.119.226 (talk) 10:48, 21 June 2015 (UTC)[reply]
I oppose this. There's simply no evidence this manifesto is notable. Actual use, declarations of intent by notable organizations, maybe something with a massive swell of public support, but not a manifesto by two people with a Register article mention.--Prosfilaes (talk) 19:26, 21 June 2015 (UTC)[reply]
I believe this http://www.reddit.com/r/programming/comments/sy5j0/the_utf8everywhere_manifesto/ and this https://news.ycombinator.com/item?id=3906253 is enough evidence for notability. There's more. — Preceding unsigned comment added by 31.168.83.233 (talk) 14:11, 22 June 2015 (UTC)[reply]

The manifesto's FAQ #4 is a good example of what the authors get wrong. The question asks whether any encoding is allowable internally, and the authors respond (reasonably) that they have nothing against this. But then they argue that std::string is used for all kinds of different encodings, and we should just all agree on UTF-8. First, this doesn't answer the question, which in context is talking about UTF-8 vs UTF-16: nobody would use std::string (a sequence of bytes) for UTF-16 -- you would use std::wstring instead. Second, std::string has a precise meaning: it is a sequence of bytes, period. It knows nothing of Unicode, or of the UTF-8 encoding, or how to advance a character, or how to recognize illegal encodings, etc., etc. If you are processing a UTF-8 string, you should use a class that is UTF-8 aware -- one specifically designed for UTF-8 that does know about such things. Passing your UTF-8 string to the outside as std::string on the assumption that the interface will treat it as Unicode is just asking for trouble, and asking the world to adopt that assumption (despite the definition of std::string) is naive and will never come to pass. It will only foster religious wars. If you need to pass a UTF-8 string as std::string to the outside (to the OS or another library), test and document the interface between your UTF-8 string and the outside world for clarity. The manifesto's approach only muddies the waters.

You are utterly misunderstanding the manifesto, spouting out exactly what it is arguing against as though they are "facts", when in fact the whole paper is about how your "facts" are wrong. YES they are saying "use std::string ALWAYS" to store UTF-8 and non-UTF-8 and "maybe UTF-8" and also random streams of bytes. No, you DO NOT need a "UTF-8 aware container", contrary to your claim. Yes, you CAN "pass your UTF-8 string to the outside as a std::string" and it WILL work! Holy crap, the "hello world" program does that! Come on, please at least try to understand what is going on and what the argument is. I don't blame you, something about Unicode turns otherwise good programmers into absolute morons, so you are not alone. This manifesto is the best attempt by some people who actually "get it" to try to convince the boneheaded people who just cannot see how vastly simple this should be.Spitzak (talk) 20:41, 18 June 2015 (UTC)[reply]
Naturally, I disagree with your analysis of my understanding of the manifesto. But if you wish to continue this discussion, I suggest we move it to my talk page, as it contributes little to the question at hand. -- Elphion (talk) 03:07, 19 June 2015 (UTC)[reply]
Not every piece of code dealing with strings is actually involved in processing and validation of text. A file copy program which receives Unicode file names and passes them to file IO routines, would do just fine with a simple byte buffer. If you design a library that accepts strings, the simple, standard and lightweight std::string would do just fine. On the contrary, it would be a mistake to reinvent a new string class and force everyone through your peculiar interface. Of course, if one needs more than just passing strings around, he should then use appropriate text processing tools.31.168.83.233 (talk) 15:03, 22 June 2015 (UTC)[reply]

The argument that UTF-8 is the "least weak" of all the encodings is silly. The differences in "weakness" are minuscule and mostly in the eyes of the beholder. As a professional programmer dealing with Unicode, you should know all of these and be prepared to deal with them. The important question instead is which encoding best suits your purposes as a programmer for the task at hand. As long as your end result is a valid Unicode string (in whatever encoding) and you communicate this to whatever interface you're sending it to, nobody should have cause for complaint. The interface may need to alter the string to conform to the expectations on the other side (Windows, e.g.). It is precisely the interfaces where expectations on both sides need to be spelled out. Leaving them to convention is the wrong approach.

I would say that all three major encodings being international standards endorsed by multiple actors is sufficient warrant for using them.

-- Elphion (talk) 18:24, 18 June 2015 (UTC)[reply]