Talk:Byte order mark

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing / Software (Rated C-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Low  This article has been rated as Low-importance on the project's importance scale.
Taskforce icon
This article is supported by WikiProject Software (marked as Mid-importance).
 

Misc[edit]

Detailed discussion of BOM does not add to understanding of endianness, and BOM can be taken as a seperate concept, so i've moved it back to its own article.

It really was messy in the endianness article, especially as BOM has its own category links, external links, and the like.

--Pengo 00:52, 27 Oct 2004 (UTC)

edits by Cherlin[edit]

some of theese edits seem rather dodgy to me.

used-->misused : you claim that using the BOM to mark text as being in a utf- format is misuse yet http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading) states that the byte sequence may be used to indicate both byte order and charachtor set.

"contrary to its definition" : you claim that use of the BOM on utf-8 is contary to its definition yet http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading)

FF FE 00 00-->00 00 FF FE (already reverted) : encoding the code point FEFF in little endian utf-32 would give FF FE 00 00 as was in the original not 00 00 FF FE as your edit states. Furthermore the table that was there before your edit exactly corresponds to the information given in http://www.unicode.org/unicode/uni2book/ch13.pdf ("specials" section "Byte Order Mark (BOM)" heading)

unless i see good justification for theese edits i will be reverting the two that i have not already reverted Plugwash 16:13, 24 Dec 2004 (UTC)

It is now two days since you made the edits and you have not responded furthermore i find you to be a very new contributer who has got into trouble elsewhere and made few other edits im am therefore reverting the rest of the edits you made to this page Plugwash 02:23, 27 Dec 2004 (UTC)

Concerning UTF-16 big endian vs little endian[edit]

I have noticed that the Python interpreter reverses the byte order of UTF-16 big endian and little endian as compared to what is actually in the Unicode standard when given invalid input. When Python's codecs module is used to read UTF-8 text in from a file and write UTF-16 text out to another file, and the original UTF-8 file begins with the non-character U+FFFE (encoded as EF BF BE), the non-character is accepted as if it were the byte order mark U+FEFF and the resulting UTF-16 file has the opposite byte order of what was requested. I observed this on multiple platforms and Python versions.

The point is, if you are having trouble with the byte order of UTF-16 text, check your libraries/tools for problems, and verify everything using hexadecimal viewers. You may find incorrect assumptions are being made in your tools or libraries.

Canistota (talk) 14:47, 12 March 2009 (UTC)

Byte Order Mark in UTF-8[edit]

Does anyone know why Windows software likes to put a BOM at the front of UTF-8 files? Isn't it true that the order is unambiguous, and thus it does nothing for any endianness problems? Is it simply a way of flagging a file as containing UTF-8 instead of ASCII? -R. S. Shaw 23:38, 5 Jun 2005 (UTC)

yeah its simply used to mark the file as being utf-8 rather than the systems legacy encoding. Plugwash 00:25, 6 Jun 2005 (UTC)
Whenever you save a file as UTF-8 in Windows Notepad, the UTF-8 BOM is prepended to it. You can use a different editor (a non–Unicode-aware editor or a hex editor) to remove the BOM. If the file contains one or more legal UTF-8 sequences, and only legal UTF-8 sequences, then removing the BOM will have no effect on the file—it’ll still be UTF-8. If the file contains only ASCII and you remove the BOM, Notepad will flag it as ANSI (8-bit codepage mode). If the file contains a BOM and you insert an illegal sequence into it (like a single FF byte in the middle of the text, or C2 E4, etc), then the file will stay intact, but if it hasn’t got a BOM and you insert such a sequence, it’ll revert to ANSI, and legal UTF-8 sequences too will be viewed in Notepad according to the current Windows ANSI codepage semantics (for example CF 80 as Ï€ instead of π if you’re on a US WinXP). --Shlomital 22:33, 2005 Jun 11 (UTC)
On Czech WinXP it works the same. Notepad marks it with BOM for easier recognition of the encoding, but does not require it. It is an unexpectedly tolerant approach.

Why is the byte sequence EF BB BF choose to be the mask?[edit]

Is there a reason? Or someone just picked it by change? —Preceding unsigned comment added by 117.104.188.16 (talk) 10:03, 25 January 2011 (UTC)

That is U+FEFF (the value of the BOM character) in UTF-8 encoding. It is what you would get if a translator from UTF-16 to UTF-8 that was completely unaware of the BOM would produce by translating the BOM character.Spitzak (talk) 19:37, 25 January 2011 (UTC)

Why is this a problem?[edit]

as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages

All those tools are free software or have free software equivalents and it must be relatively easy to make them ignore the mark. Shinobu (talk) 10:18, 20 November 2007 (UTC)

True though I could see that doing more harm than good, imagine you wrote your script on your desktop and it ran fine but when you put it on the production server an invisiable character stopped it from running. Plugwash (talk) 10:22, 20 November 2007 (UTC)
That assumes that the "free software" is of varied quality, not following a standard. That may be true. However the context for the quote was biased to support this situation. Tedickey (talk) 11:18, 20 November 2007 (UTC)
"All those tools are free software or have free software equivalents" — no, not proprietary Unixes, and yes they are still around. -- intgr [talk] 11:27, 20 November 2007 (UTC)
The de-facto standard is for tools (including such core OS components as the binary loader) to recognise a script by the first two bytes of a file being "#!". If some versions of some tools start ignoring a preceeding BOM but others don't (free software DOES NOT mean you can force your changes on your distro maker or server host) then IMO there is likely to be far more confusion than if scripts with a BOM universally fail (which afaict is the status quo). Plugwash (talk) 12:57, 20 November 2007 (UTC)
uh - no. No one's presented any evidence of scripts which would be ambiguous if someone provided a loader which handles BOM. Tedickey (talk) 13:10, 20 November 2007 (UTC)
I think the real question for Unix shell scripts is, what is the native character encoding that /bin/sh supports? Can you have a shell variable "$STRAßE"? An environment variable of the same name? What about Chinese? My bet is that the Unix shells only support ASCII text, in which case a byte order mark is inappropriate. After all, the kernel is looking for the bytes 2321, not the characters "#!". Canistota (talk) 23:28, 12 March 2009 (UTC)
Shell scripts support non-ASCII characters just fine (for instance in string literals - variable names may be more optimistic). The encoding is LC_CTYPE. But this is irrelevant to the recognition of the #! sequence, which is not performed by the shell in any case. Ewx (talk) 08:59, 13 March 2009 (UTC)
Python and Perl also support well utf8 encoding, including with BOM althout the shebang does not.~~ — Preceding unsigned comment added by 84.97.14.22 (talk) 16:28, 21 July 2012 (UTC)

"All those tools are free software or have free software equivalents and it must be relatively easy to make them ignore the mark." – In addition to what User:Plugwash writes above, I do not believe you can convince even a large minority of Unix users that placing a piece of crippled, limited character-encoding metadata into general files is a good idea. Although I only read about it just now, BOM for UTF-8 strikes me as an unusually stupid idea. The section on BOM in RFC 3629 illustrates some reasons why; it is full of heuristics and language that you rarely see in RFCs ("without a good reason", "only when really necessary", "an attempt at diminishing this uncertainty").

Should I interpret the article as if Windows Notepad is the only widely spread software which actually creates UTF-8 BOMs? It would make sense; Microsoft do not care about plain text editing – they are more into "one application, one proprietary file format" – and they have historically not cared about the usefulness of Notepad.

JöG (talk) 09:13, 29 March 2008 (UTC)

OK, now I see the article says "Quite a lot of Windows software (including Windows Notepad)". But it would be interesting to know if popular, serious text editors on Windows (emacs, vim, UltraEdit and popular Windows-specific editors) do this by default. JöG (talk) 09:18, 29 March 2008 (UTC)

You named two ports to Windows and one native. That's a rather small and unrepresentative example. There are many Windows editors. Btw, the comment regarding interprocess communication is unnecessary, since it adds no factual information. Take a look at Windows PowerShell, which has to be doing this transparently. Tedickey (talk) 11:01, 29 March 2008 (UTC)
* UltraEdit: If you select "UTF8" when you save, it adds the BOM without giving you a choice in the matter.
* Vim for Windows: It doesn't give the option to save as UTF8 and does not add a BOM, but when it opens a BOM'ed file it retains the BOM when saving. -- leuce (talk) 20:58, 30 March 2009 (UTC)
Vim's a port (and it doesn't recognize some of the Windows text formats). By the way, there are probably hundreds of applications to discuss in this manner. Tedickey (talk) 21:05, 30 March 2009 (UTC)
I agree -- I merely tested these two because they were mentioned by someone previously, plus the three mentioned below. -- leuce (talk) 14:44, 31 March 2009 (UTC)

In response to JöG's post, here are some Windows programs and whether they add a BOM to UTF8 or not.

  • Akelpad: Gives user a choice, but BOM is suggested by default.
  • MS Word XP: Adds BOM, gives no option not to add BOM. If you open a BOM'ed UTF-8 file in MS Word, it autodetects the encoding as UTF-8; if you open a non-BOM'ed file in MS Word, it makes a guess based on the characters it contains, but if all characters are present in the ANSI scheme, it will save such a file as ANSI, not UTF-8.
  • OpenOffice.org 3.0: Adds BOM, gives no option not to add BOM.

--leuce (talk) 13:11, 29 March 2009 (UTC)

Too technical![edit]

OK, I understand everything in the article, since I'm a unicodopath, but the intro should say:

  • Unicode is a computer encoding of all languages characters (in principle),
  • The byte order mark is designed so that a computer who reads it, can guess (with a reasonable probability) that the data text is probably Unicode, and
  • Guess what kind of Unicode encoding, since there are many - the article already says that, I just wanted to stress that it shall.

The intro is a bit too technical for being an intro. The current text qualifies as a technical description intended for me and you, not any outsider. The missing nouns that should be in the intro are: computer, data coding, natural languages. L8R. Said: Rursus 10:15, 25 April 2008 (UTC)

I think this set of recommendations is met or eliminated in the current article's text. The explanation that Unicode intends to capture all human languages belongs in the Unicode article (and it's there, and there's a link over there in the first sentence here). The notion that the BOM has the purpose of identifying Unicode (rather than some other encoding entirely) is not, so far as I can see, justified by the primary references, and is significantly undermined by the fact that BOM is in all contexts optional. The "which Unicode encoding" part is, as acknowledged, already captured. Jackrepenning (talk) 22:56, 6 August 2010 (UTC)

How to remove it[edit]

There should a section in this page discussing how to remove it. The only reason 99% of people would ever come to this page is because they are trying to remove this ugly little thing from a web page they are developing. The 1% of people who come because they are interested in it may be getting what they want but not the rest of us. —Preceding unsigned comment added by Tjayrush (talkcontribs) 16:44, 6 February 2009 (UTC)


There is a nice easy to use peice of software called bomstrip that makes removing this thing quick work on Linux. I didn't want to edit the page directly but perhaps an interested party can. —Preceding unsigned comment added by Tjayrush (talkcontribs) 18:08, 6 February 2009 (UTC)

Added remove script to Unwanted BOMs section. In Linux: 1. To search for files contaning BOM by running this command: grep -rl $'\xEF\xBB\xBF' 2. for each from the search results above, run:

  a. vi <filename from search result>
  b. from inside vi type the command (including the ":" sign)    :set nobomb  
  c. save and exit  :wq  — Preceding unsigned comment added by Drormik (talkcontribs) 13:45, 10 August 2012 (UTC) 

Whether Unicode standards recommends UTF-8 BOM or not[edit]

The text is "Use of a BOM is neither required nor recommended for UTF-8" (and this already appears in the cite!). That seems like a pretty clear "not recommended" to me - "neither fish nor fowl" means "not fish and not fowl", it doesn't mean "not fish and not specifically fowl". Ewx (talk) 07:47, 31 March 2009 (UTC)

And then it goes on to say that applications still must expect that it'll happen. May as well address the complete sentence, rather than construe a (reasonably) carefully worded comment into a completely negative recommendation. Tedickey (talk) 10:07, 31 March 2009 (UTC)
The Wikipedia text already points out that it may be encountered, and a recommendation not to *use* it doesn't contradict that at all. Ewx (talk) 13:54, 31 March 2009 (UTC)
But that is the point -- the Unicode standard does not contain a recommendation not to use it. -- leuce (talk) 14:37, 31 March 2009 (UTC)
Yes it does! The text in the standard is "Use of a BOM is neither required NOR RECOMMENDED for UTF-8" (emphasis mine). That is not an absence of a recommendation to use it and it is certainly not an absence of a recommendation not to use it; it is a straighforward and clear recommendation not to use it. Ewx (talk) 07:49, 1 April 2009 (UTC)
Indeed (the emphasis is yours. Use the complete sentence, or find another source which supports your viewpoint. Tedickey (talk) 10:56, 1 April 2009 (UTC)
This is completely ridiculous. The text is right there. It says it's not recommended. Ewx (talk) 08:07, 2 April 2009 (UTC)
Well I suspect this is a sticky point. I have searched chapter 2 and 16 of the Unicode standard for references to BOM, byte order mark and UTF-8, and in my opinion the reference under discussion here is the only instance in the standard that speaks even remotely negatively about a UTF-8 BOM. In all other cases where the UTF-8 BOM is mentioned or discussed, it is mentioned as a matter of course in an informational, neutral tone, without making any value judgements or any indication that the UTF-8 BOM is deprecated. My personal take on this reference is that people who want to implement the Unicode standard might wonder why the Unicode standard keeps making reference to the UTF-8 BOM (also in chapter 16) as if it were a valid construct, and they might become under the impression that the Unicode consortium actually recommends using a UTF-8 BOM even though it is not required. -- leuce (talk) 15:34, 1 April 2009 (UTC)
Having read comments by some of the people involved (in the topic itself...), my impression is that the statement is a compromise between two viewpoints, neither of which dominated in writing the source we're discussing. Tedickey (talk) 16:31, 1 April 2009 (UTC)
The phrase "X does not recommend Y" can have two meanings. It can mean that X recommends many things, but that Y is not one of the things that X recommends. Or, it can mean that X makes a recommendation *against* Y. The Unicode article does not recommend against a BOM... it simply does not make a recommend in favour of it. My gripe is that the wiki article before I edited did create the impression that the Unicode standard recommends against the use of a BOM. Even if one quotes directly from the Unicode standard, if quoted in a different context it can certainly give a slightly different impression of what the standard intends to say. -- leuce (talk) 14:37, 31 March 2009 (UTC)
Agree. And (for instance), if you consult some of the secondary sources, it's easy to come up with one that is wholly in favor of one or another viewpoint. (Some are completely absurd, but I see those reflected on this page ;-) Tedickey (talk) 10:06, 2 April 2009 (UTC)
A recent flurry of edits has opened this can of worms again, and the text has grown decidedly text-booky and verbose. I’ve reverted to the state pre-edits. Firstly, we cannot interpret the Unicode standard for it. The text comes straight from the source. The reader is going to have to decide for “himself” what that means. There is no other authoritative source and therefore we are not allowed to interpret it for the reader. The cited mailing list thread is not authoritative; it is just one of hundreds of discussions all over the Web on the topic, each coming to its own conclusions. Secondly, it makes no sense to prognosticate at length over the reliability or unreliability of the UTF-8 BOM as a signal for UTF-8 encoding. Go find some reliable reference if you feel something definitive needs to be said about it. The article is fine as it is, particularly since these observations about the unreliability of the UTF-8 BOM apply equally well to the UTF-16 BOMs. A file of unknown provenance can never, with 100% confidence, be stated to be in any encoding whatever, or even to be text even though it might be the collected works of Shakespeare in 7-bit ASCII. The best you can state completely confidently is that the content is not in some particular encoding due to a violation of the encoding’s standards. Strebe (talk) 19:57, 14 July 2012 (UTC)

May I make this edit?[edit]

Current:

While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may nonetheless be encountered, and it is explicitly allowed by the Unicode standard[1], the Unicode standard does not specifically recommend its usage[2]. It only identifies a file as UTF-8 and does not state anything about byte order.[3]

When I read these two sentences, it almost sounds as if the Unicode standard identifies a file as UTF-8 :-) That second sentence doesn't really fit anymore. Besides, it repeats what has been explained elsewhere. I suggest we remove it or move it somewhere else in the article. -- leuce (talk) 14:55, 31 March 2009 (UTC)

Article needs an example[edit]

This article should include an example of a byte-order-mark. DMahalko (talk) 23:56, 15 June 2009 (UTC)

The article currently presents the definition and its encodings, including how it looks when rendered naively in various ways. David Spector (talk) 14:21, 28 March 2013 (UTC)

Why the dash in byte-order?[edit]

The Unicode specification reads "byte order mark", not "byte-order mark". Why was this article's name changed? On the face of it, this article title is wrong. Strebe (talk) 04:01, 28 July 2009 (UTC)

Proper English would dictate the use of the hyphen. See http://en.wikipedia.org/wiki/English_compound#Hyphenated_compound_adjectives - Blueguy 65.0.223.146 (talk) 00:25, 7 August 2009 (UTC)
This article is about something that has a name. The name, by the body that coined it, is "byte order mark". It is not encyclopædic to "correct" established terminology; that is editorializing. This article's title is wrong. Strebe (talk) 09:05, 8 August 2009 (UTC)
Wikipedia rules tell to name articles as the thing is called on the street and in life, not as it's called in the dictionary or how it should be called; Strebe is right. 88.148.214.15 (talk) 20:35, 12 October 2009 (UTC)
The following discussion is an archived discussion of a requested move. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the move request was moved.  Skomorokh, barbarian  11:07, 27 October 2009 (UTC)


Byte-order markByte order mark — Cannot move back to old name without administrator intervention. Strebe (talk) 09:57, 18 October 2009 (UTC)

The above discussion is preserved as an archive of a requested move. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

Which text editors add a BOM to the beginning of text files?[edit]

"Some text editing software in a UTF-8 environment on MS Windows adds a BOM to the beginning of text files." Which ones? Tisane (talk) 02:57, 24 February 2010 (UTC)

Probably a long list (Visual Studio .NET for instance) Tedickey (talk) 09:20, 24 February 2010 (UTC)

The BOM will make a batch file not executable on Windows…[edit]

I removed this completely misleading remark of October 28:[4]. First, it is not impossible to remove the BOM even in Windows, so the conclusion about s.n. "ANSI" has not grounds. Second, user: BIL correctly stated that native encoding for .bat is CP437 but forgot to mention that non-Western Windows localisation actually use different OEMCP (see below a sample with code page 866); in any case this matter is quite off-topical and irrelevant though. And, the most important, .BATs starting with the BOM do execute:

T:\>test.bat

T:\>я╗┐echo ╨╗╤П╨╗╤П╨╗╤П╨╗╤П╨╗╤П╨╗╤П
'я╗┐echo' is not recognized as an internal or external command,
operable program or batch file.

T:\>ver

Microsoft Windows XP [Version 5.1.2600]

T:\>

The test.bat file contains:

echo ляляляляляля
ver/* UTF-8 */

Incnis Mrsi (talk) 18:15, 6 February 2011 (UTC)

Huh? Your example shows EXACTLY the problem: the BOM is not removed but is considered part of the "echo" command and therefore the .bat file fails to work.Spitzak (talk) 19:22, 6 February 2011 (UTC)

I see "the problem", but such .BAT do execute contrary to the statement quoted in the topic. As there is an error with the first line, there is, obviously, an easy workaround: skip the first line, say, leave it empty. This is all an WP:OR, just like deleted speculations. So I see no reason to keep BIL’s controversial OR in Wikipedia. Incnis Mrsi (talk) 20:36, 6 February 2011 (UTC)
If you leave the first line in the bat file empty, and save it as UTF-8, there will still be a BOM there, which will cause an error message, but the bat file will be executable. What I wanted to describe is that the Windows command prompt and bat files do not recognise BOM or Unicode. There might be a workaround, but still.--BIL (talk) 21:34, 6 February 2011 (UTC)
I agree that you have an extremely literal interpretation of the word "execute". Yes for almost any text file, the program the text file is for will start running, will open the file, and will actually read bytes from it, and only fail when it fails to interpret the line as the user of the text editor intended. By this criteria ALL programs "work with a BOM". However that is a pretty useless definition.Spitzak (talk) 21:23, 7 February 2011 (UTC)

────────────────────────────────────────────────────────────────────────────────────────────────────Please, Spitzak and Strebe: do tango it into a good text here at Talk. If you cannot solve it here, it will not be good a good text in the article page for sure. I really would like to read the good article on this. -DePiep (talk) 20:40, 8 February 2011 (UTC)

The rationale of this edit is wrong: Without the BOM it would NOT be "the wrong encoding"
The character encoding is declared as part of the text file contents only if there is a BOM and only within Unicode environments. If there is not a BOM, or if the environment is not Unicode, then the character encoding is determined externally. You cannot claim that a file sent to the DOS command line is UTF-8, since, by definition, the file is DOS 437. It does not matter how the file was constructed or what its history was or whether it contains a BOM; when you sent it to the DOS command line, you implicitly declared that it was Code page 437, which is not a Unicode environment. If that is not what you intend, then you simply sent the command line the wrong file. Strebe (talk) 00:04, 9 February 2011 (UTC)
I shortened the text and wrote that batch files do not support Unicode and therefore not the BOM. Note that echo does not support Unicode, for example writing echo From Genève to Zürich in a batch file gives From Gen├¿ve to Z├╝rich, and Unicode file names do not work either.--BIL (talk) 10:34, 9 February 2011 (UTC)
Saying "the text has an encoding" shows that you completely do not understand why the BOM is not being recommended by some. Without the BOM, a UTF-8 file containing only ASCII letters is identical to a ASCII file. So it is simultaneously in UTF-8 encoding and also in ASCII encoding and DOS 437 encoding and ISO-8859-1 encoding and CP1252 encoding. The entire design of UTF-8 was to allow this, to eliminate the need to identify and transmit encodings. However it is defeated by the addition of the BOM which makes it no longer in these encodings, for a completely invisible letter that the programs now have to decode depending and add to their input syntax just so they can skip it! And don't print that bull about "batch files are DOS 437", if that was the problem the batch file would produce "this is in the wrong encoding error", not complain about the inability to find a command that happens to be equal to the first ASCII command with the three bytes of the BOM added to the start. In reality, batch files are streams of bytes and the byte values that happen to match the ASCII space and CR and LF and a few other values have some meaning. This is not an "encoding" at all.Spitzak (talk) 20:31, 9 February 2011 (UTC)
You might consider calming down and perhaps finding some soothing hobbies. You have no idea what I understand and do not understand, and I really am not interested in these sorts of petty pissing matches or discussing who’s stupid. I’m interested in improving Wikipedia. I can’t imagine anyone else is interested in such flaming, either.
We agree that a BOM is not recommended— after all, we must agree because that is what Unicode states. I have no disagreement with the first half of your diatribe. You might reconsider your rant about DOS 437, on the other hand. It is not the job of text processing systems in non-Unicode environments to recognize Unicode conventions. The Unicode Consortium recognizes this and take pains to make sure no one thinks they’re imposing Unicode on everyone, especially systems that existed before Unicode. Batch files existed long before Unicode. It cannot be batch processing systems’ responsibility to declare that the encoding is wrong because they don’t even know that it’s “wrong”. It’s NOT wrong; by using the file as a batch command, you have imposed DOS 437 semantics onto the file. Therefore your assertion that batch file processing ought to produce a “This is the wrong encoding” is nonsense. What you are calling a BOM is not a BOM in a batch file; it is a sequence of three characters: the “intersection” glyph from set theory, and two box-corner symbols. Just because Unicode came along does not deprive DOS 437 (or any other encoding) of its upper ASCII register, which you seem to be arguing for be claiming it’s “not an encoding”.
The important thing here is that the declaration of the encoding system is not part of the file’s content; it is externally imposed. A BOM has specific meaning within the Unicode environment. It does not outside of it. Batch files are outside the Unicode environment. It really is that simple. Strebe (talk) 01:00, 10 February 2011 (UTC)
It is obviously a waste of time trying to explain this. Bascially though: if a program takes some bytes in a buffer and puts them on a device that interprets them according to encoding X, then that buffer is in encoding X. It does not matter if that program does not understand encoding X or that it was written decades before encoding X existed. The bytes are in that encoding becuase they are interpreted as though they are in that encoding. Anyway I am going to delete the windows batch file comment because adding "it is in DOS 437" makes the argument completely nonsensical.Spitzak (talk) 19:36, 10 February 2011 (UTC)

Dubious claim in "Representations of byte order marks by encoding" section[edit]

The GB-18030 section has the following claim: "[132] and [149] are unmapped ISO-8859-1 characters". But my understanding is that these characters aren't unmapped even in ISO-8859, but are C1 control characters; 0-31 is the C0 control area and 128-159 the C1 control area. This is why the mapping by Windows-X of higher Unicode characters to the latter range can cause problems.

I think this section needs to be edited by a knowledgable person. — 93.97.40.177 (talk) 07:00, 16 June 2011 (UTC)

You are confusing the character values produced after decoding with the bytes that are in the encoding. 132 is a value of one of the bytes in the GB18030 encoding of the BOM. It and 3 other bytes decode into the value 0xFEFF.Spitzak (talk) 19:09, 16 June 2011 (UTC)

From which version of text editors recognize/do not recognize UTF-8 without BOM in the beginning of text files?[edit]

From which version of text editors recognize/do not recognize UTF-8 without BOM in the beginning of text files? Because when all text editors will recognize UTF-8 without a BOM, BOM will not be necesary anymore... — Preceding unsigned comment added by 86.75.236.140 (talk) 10:09, 30 June 2012 (UTC)

«One reason the UTF-8 BOM is not recommended is that pieces of software without Unicode support may accept UTF-8 bytes at certain points inside a text but not at the start of a text.»[edit]

«One reason the UTF-8 BOM is not recommended is that pieces of software without Unicode support may accept UTF-8 bytes at certain points inside a text but not at the start of a text.» Formulation of this sentence looks strange and illogical as from my point of view: If a software does not support UTF-8, presence of BOM helps to indicate this software is not compatible with UTF-8. — Preceding unsigned comment added by 86.75.236.140 (talk) 10:12, 30 June 2012 (UTC)

The rest of the paragraph explains it. Strebe (talk) 18:59, 30 June 2012 (UTC)
I understood the sentence just now: the intent is to mean to not use BOM for backward compatibility with legacy software which accept 8 bits regardless encoding.
It seams to me very specific, althouth I understand such a specific case can be considered by Wikipedia. I assume in 2012, there are very few software which have this issue.
I am not sure the case of a compiler is a good example. To be verifiable, a name and a version of assumed incompatible compiler (for instance PHP 5) should be given as example/reference. For the two compilers I searched for, I understand that this issue is solved and a BOM can be used:
  • A seven years old compiler (Visual C++ 2005)[Notes 1].
  • Another compiler example with gcc fortran five years ago, which considered it as a corrected bug [Notes 2]
So I would prefer a sentence which states BOM is for fully unicode compatible software and for old software (from the XXth century ;-) ), BOM should be avoided. Althought the Unicode position might say the same in a more neutral way[Notes 3] might be better.
Above all, explanation should be simplified to be easily understandable.
To be more neutral, Wikipedia should not also focus on POSIX position but also consider Unicode and Microsoft one.
  1. ^ MSDN states Visual Studio 2005 requires the BOM for code with identifiers, macros, literals and comments in unicode [1]
  2. ^ gcc bugilla bases states not handling BOM by compiler is a bug which has been corrected [2]
  3. ^ The Unicode FAQ in regard to BOM includes the question « Q: How I should deal with BOMs?» hich is answered by a fours cases distinction [3]
The article report’s Unicode’s guidelines. As stated, the Unicode standard permits the BOM but does not require or recommend it. The sentence that starts, “One reason the UTF-8 BOM is not recommended” does not imply that the Unicode standard recommends against using a BOM. It merely means that the Unicode standard does not recommend for using a BOM for UTF-8 and gives an example of why Unicode’s recommendation was formulated the way it is. The Unicode caution may become less and less relevant over time, but the original reasons, one of which appears as that example, are immutable historical fact. By the way, I think your optimism about widespread Unicode compatibility is misplaced: Many, many third-party applications have no concept of a UTF-8 BOM, and some truly ancient code continues to be used and relied on now and into the indefinite future because no one will make the investment to overhaul it. But our opinions do not matter here. The article is supposed to be about verifiable facts. Strebe (talk) 20:14, 1 July 2012 (UTC)
Your explanation here might be clearer than the article, as it is concise, and giving an historical rational. Now I understand the “not recommended”, whose meaning was not trivial, as a recommendation to caution mainly addressed to the user/data provider: For me it would be clearer to say it is not recommended to a user to store the text with a BOM, and that writing text encoded with BOM might be not compatible with old/ancient/legacy programs limited to reading only ASCII, and for compatibility it is preferable that a program that read text file be improved to handle correctly the BOM when present. In particular, I believe I have read debate to justify not fix incompatibilities bugs obscure reasons such as it is not recommanded to. Then I did not understood!
In my opinion: Now, Unicode is everywhere, from Internet to Linux distributions. Incompatible software will be corrected or be less and less used till abandoned, although if it is a question of years, as it occurs in microsoft-Windows where the DOS box containing legacy softwares such as DIR, which give size of files in CP850 encoding! But our opinions do not matter here.

w3c + existing software to strip BOM[edit]

Note:

For the w3c, For compatibility with deployed content, the byte order mark (also known as BOM) is considered more authoritative than anything else. ( http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html#decode-and-encode ) which i did read nothin about in this article.
I also note there are some software to help developpers to deal with BOM, which is not considered by the article, such as:
  • tmp = tmp.replace("\uFEFF", "");
  • Apache BOMInputStream [5] — Preceding unsigned comment added by 77.199.89.101 (talk) 12:34, 2 July 2012 (UTC)
These look like good additions to the article. Feel free to add them. I think “the… BOM… is considered more authoritative than anything else” is obvious (i.e., if it’s not more authoritative in some instance, then someone has wrecked the byte stream somehow!) or irrelevant (because the byte stream is wrecked and therefore has no coherent encoding), but perhaps there are situations I am not thinking of that W3C has. At least it is a verifiable statement, which the article needs more of. Strebe (talk) 16:41, 2 July 2012 (UTC)

Bush hid the facts[edit]

I understand that the BOM is also a mean to avoid the Bush hid the facts bug. — Preceding unsigned comment added by 77.199.96.98 (talk) 18:55, 9 July 2012 (UTC)

Requiring a BOM would eliminate this bug. The Bush hid the facts bug occurred when an ASCII file without newline looked like UTF-16 without BOM. The BOM is not required even for UTF-16 for reasons written in the article.--BIL (talk) 09:02, 28 May 2014 (UTC)
Requiring a BOM in UTF-8 would actually encourage such bugs, not fix them, by encouraging software to check for strange encodings first. Pattern recognition where you pick the patterns that are *least* likely first (this would be UTF-8 first, and ASCII if there are no bytes with the high bit set). UTF-16 could be recognized (and endianess determined) by looking for a large numbers of null bytes in non-pairs, but this pattern is a bit more likely in random data than the UTF-8 or ASCII patterns so it should be checked third. Requiring a BOM in 16-bit text has the same problems as requiring it in UTF-8, though it would fix this particular example.Spitzak (talk) 02:17, 29 May 2014 (UTC)

UTF-8 BOM recommendation[edit]

User:Karl432 alerted me on my Talk page that the statement clarifying the Unicode Standard's neutrality with regard to use of a UTF-8 BOM was made by a senior Unicode Consortium member (Technical Vice President, Emeritus). Presumably this is reliable “enough” to cite.

Karl432 also elaborates in his edits, which I deleted: “It is to be noted that the presence of the byte sequence representing an UTF-8 encoded BOM at the start of a text stream or file can be interpreted as a hint that a text stream or file might be encoded as UTF-8, but not as a proof, as such a byte sequence may have other unrelating meanings unless such can be excluded by other knowledge of the context.” I deleted this because (besides being verbose and unencyclopedic verbiage) the same comment applies to any BOM, not just UTF-8. Strebe (talk) 23:53, 15 July 2012 (UTC)

RFC 5198[edit]

When BOM is used in files, RFC 5198 (a RFC relative to protocols) stands that Net-Unicode forbids BOM usage. — Preceding unsigned comment added by 86.75.160.141 (talk) 20:46, 20 November 2012 (UTC)

How do you interpret RFC 5198 that way? The injunction against BOMs has nothing to do with files. It has to do with transmission of text strings. Strebe (talk) 21:15, 20 November 2012 (UTC)
Okay; this article is relative to Byte_order_mark and not only to files 86.76.39.126 (talk) 22:39, 24 November 2012 (UTC).

Difficulty of detecting UTF-8 without BOM[edit]

It is not "trivial" to detect if a file is encoded in UTF-8. Easier than other encodings, yes, but it requires reading through a whole file, looking for characters that distinctly look like UTF-8-encoded characters, and finding enough of them to make a determination that the file is indeed UTF-8. It depends on the definition of "trivial," but I doubt that it meets it. Moreoever, if a file contains just one UTF-8 character, the algorithm may fail. Furthermore, what is the file is corrupt and has some invalid characters? The algorithm must not be too quick to bail. Finally, if using the popular ICU library, the detector is for whatever reason very slow. — Preceding unsigned comment added by 173.169.194.3 (talk) 20:49, 27 May 2014 (UTC)

You are not seeing the solution. You don't look for UTF-8 encoded characters, you look for sequences of bytes that are *not* UTF-8 encoded characters. The *vast* majority of sequences that contain a byte with the high bit set are not valid UTF-8 and it is easy to detect them. For instance a lone byte with the high bit set is not UTF-8. There is no need to read the entire file, and certainly no need to see if the UTF-8 characters make any sense. Checking even the first byte with the high bit set is enough to establish this with such a high degree of certainty that it is very difficult to contrive an example of even one actual word in any language that will fail (I think there is a known German word that if capitalized the ISO-8859-1 will produce a valid UTF-8 byte stream, but this is the only example anybody has come up with).
You are right that errors in the encoding would cause a strict version of this to say it is not UTF-8. However I recommend that coding detection be done on-the-fly: at each byte with the high bit set, it checks to see if it is UTF-8. If it is it uses it as UTF-8. Otherwise it can do a legacy conversion based on local pattern matching. This will fix multiple encodings pasted together, which no encoding-detection or BOM scheme will handle.Spitzak (talk) 23:43, 27 May 2014 (UTC)
That library is slow because it is written incorrectly. It is trying to pattern-match a vast number of legacy encodings before it ever gets around to UTF-8. The extremely fast and reliable not-UTF-8 test should be run first. However due to the historical addition of UTF-8 after other encodings, they tend not to be written this way. Also the use of the BOM has the preverse effect of making people write incorrect encoding detectors, as the lack of the BOM triggers legacy detection rather than just causing it to check the next high-bit byte for valid encoding.Spitzak (talk) 23:43, 27 May 2014 (UTC)
The text of the article is now incorrect. While Spitzak is correct that is it normally easy to detect that a file is not UTF-8, he fallaciously claims that is it easy to detect that it is UTF-8. That’s nonsense, especially if the file is short. A correct UTF-8 file could also be a correct file in any number of legacy encodings in the general case. Strebe (talk) 03:51, 28 May 2014 (UTC)
You are failing to understand. A random sequence of bytes is *extremely* unlikely to be valid UTF-8. This means that if you encounter a string that is valid UTF-8, it is extremely likely it *is* UTF-8, since the odds of encountering a string in an alternative encoding that happens to be valid UTF-8 is very low. For a more specific example, if an ISO-8859-1 string was to be misinterpreted as UTF-8 the only 8-bit characters it could contain are *pairs*, where the first character is an upper-case accented letter (the range 0xC2..0xDF), and the second character is a punctuation mark or C1 control (the range 0x80..0xBF). It is nearly impossible to make a string in any language that makes sense and actually contains such a sequence. I recommend you try to figure one out (try the JP multi-byte encodings and UTF-16 and others, too) and come back here if you can actually find a readable counter-example of more than one word, then you can say this does not work.Spitzak (talk) 02:07, 29 May 2014 (UTC)
I’m not interested in your proclamations of people not understanding. Knock off that idiocy. You have no idea what goes on in other people’s head. Your logic is fallacious. Text files represent a lot more than just real language. They represent all sorts of non-linguistic information. A valid UTF-8 file is also a valid file in many other encodings. “Valid” doesn’t mean human-readable. Strebe (talk) 03:39, 29 May 2014 (UTC)
If we make the assumption that if we find a file containing valid UTF-8 data (including more than 5 characters outside the 0-0x7F range) then we should be able to assume that it is intended to be UTF-8 even if there is no BOM. The only example I've seen where valid UTF-8 was intended to be Latin-1 was an example of how text can be misinterpreted if UTF-8 data is thought to be Latin-1. So even if valid UTF-8 data can legally be other encodings (every file are legal ISO-8859-x files) so will it in reality be unlikely, so being without BOM and test for UTF-8 first would do no harm in reality. --BIL (talk) 16:58, 29 May 2014 (UTC)
You’re just repeating Spitzak, so I guess I’ll repeat myself. You can tell with 100% certainty that a file is not UTF-8 fairly easily in almost all cases because the file violates UTF-8 syntax. You cannot ever tell with 100% certainty that a file is UTF-8 because there is no file that violates all other encodings but adheres to UTF-8. You can increase your confidence the larger the file is and the more non-ASCII bytes are in it. But that heuristic yields low confidence in files that are dominated by ASCII while having just have a few bytes above 0x7f. For Web pages, sure, generally you can detect with strong confidence (not 100%, but strong). If the document is a human language, sure, generally it’s pretty clear (not 100%, but strong). Otherwise, no, sometimes it’s not. The article needs to quit talking in certainties and superlatives expressing this imaginary certainty, and it needs to quit failing to distinguish between syntactical certainty and mere circumstantial evidence. Strebe (talk) 05:20, 30 May 2014 (UTC)
You are still failing to understand. This is a statistical logic problem that often confuses people. Lets vastly overestimate the chances of a random sequence of bytes being valid UTF-8 as 1/1,000,000. This means that if you take the set of all possible strings in this other encoding, 1/1,000,000 of that set will be valid UTF-8. Now pretend there are 500,000 different strings actually being used in the world in this non-UTF-8 encoding. This predicts that there is 1/2 of a string that will be confused, ie quite possibly none. Now if you actually extend this to real-world numbers, with actual odds of UTF-8 and string lengths of about 100 characters, the chances are astronomically small. In fact they are so small that I see no need to examine more than the first one or two 8-bit bytes and use that to assume the rest.
You are correct that if there are only 7-bit bytes in there, it may be one of the ancient 7-bit non-ASCII encodings and not UTF-8/ASCII. This problem though exists whether or not you consider UTF-8 first.Spitzak (talk) 20:59, 30 May 2014 (UTC)
1. This conversation does not belong here. It’s WP:OR, and what’s in the article is WP:OR. That is one reason I am finished with this discussion. Because it’s WP:OR, I’m going to delete large swaths of what’s in the article unless it gets cleaned up in a way that meets the consensus of the editors of this article and some semblance of Wikipedia policy. If someone thinks they have something to say on the topic in the article, it had either better not be controversial, or it had better be cited by referring to a WP:RELIABLE source.
2. Spitzak’s proclamations that people don’t understand are a violation of WP:CIV, and his insistence on doing this has made it pointless to engage in productive discourse. That is another reason I am finished with this discussion.
3. Spitzak wishes to talk statistics, which means he’s already agreeing with me that the “is UTF-8” check can only be statistical. He invents some number while ignoring the huge volume of files in existence that are mostly 7-bit ASCII with just a few 8-bit bytes in them. As a simple example, the UTF-8 string, “We don't know what ☔~ means” may just as well be “We don't know what笘梅 means” in Shift-JIS or something else in the myriad other encodings out there. Files of mostly English text with just a few symbols in them are not rare. Spitzak appears to wish for them not to exist. They exist. So. Given that we are an impasse, that is a third reason I am finished. After waiting a reasonable time for the article to get cleaned up or cited, I’ll simply do it myself, removing all the WP:OR. The article will be much shorter. Strebe (talk) 04:23, 31 May 2014 (UTC)
I don't see the problem. There is in general hard to determine encoding if it is not explicitly given. Why shall we then worry that on rare occasions something is detected to be UTF-8 when it is not? When other encoding determination situations don't work? I admit that there are advantages of having BOM on UTF-8 but there should be the option of being without, for example to edit source code for compilers and interpreters (such as Unix shell scripts) that don't understand UTF-8. Always adding or requiring a BOM would be sub-optimization (optimizing one aspect of a problem which causes problems for other aspects). --BIL (talk) 09:28, 31 May 2014 (UTC)
I don’t advocate a BOM. I advocate that the article be correct and WP:VERIFIABLE. Strebe (talk) 03:55, 1 June 2014 (UTC)
I'm sorry, but I am going to continue to state that you don't understand. You blatently state an irrelevant fact: "Files of mostly English text with just a few symbols in them are not rare." The fact that you say this shows that you are not comprehending the problem or solution. The real fact is: files containing symbols ONLY IN PAIRS AND ONLY IN A 2.6% SUBSET OF ALL POSSIBLE PAIRS are rare! Please understand this and don't screw up wikipedia with your incorrect assumptions, as they are as much original research as anything else.Spitzak (talk) 00:18, 1 June 2014 (UTC)

Some interesting (and mostly correct) comments here: [[6]]. They actually underestimated the chance that a 3-byte sequence will collide with UTF-8, although they also did a poor job of extrapolating this to actual files (which are longer than 3 bytes and are in encodings that have their own patterns that make UTF-8 collisions less likely), I tried to add a comment to fix this.

The main problem is that there are bad detectors who basically say "if the start is not the UTF-8 BOM" then NEVER consider UTF-8 again, or defer it until after a lot of other much less reliable pattern detectors are tried.

A correct detector should run the UTF-8 validation first: if it passes and contains an 8-bit byte it is so highly likely that it is UTF-8 that the answer should be considered done. Note that this is redundant with the BOM detection (since the BOM would trigger this test) so that should not be done (also because it fails to detect the common mistake of concatenation so that a BOM is before non-UTF-8 text).

These bad detectors are probably the biggest impediment to implementation of Unicode, because they encourage programs to "default" to non-unicode.Spitzak (talk) 19:16, 2 June 2014 (UTC)