Byte order mark: Difference between revisions
Themckinlay (talk | contribs) m Undid revision 836184605 by Themckinlay (talk) to restore link |
→UTF-16: showed illustration and expanded explanation |
||
Line 30: | Line 30: | ||
* If the 16-bit units are represented in [[big-endian]] byte order, this BOM character will appear in the sequence of bytes as <code>[[hexadecimal|0x]]FE</code> followed by <code>0xFF</code>. This sequence appears as the [[ISO-8859-1]] characters <code>þÿ</code> in a text display that expects the text to be ISO-8859-1. |
* If the 16-bit units are represented in [[big-endian]] byte order, this BOM character will appear in the sequence of bytes as <code>[[hexadecimal|0x]]FE</code> followed by <code>0xFF</code>. This sequence appears as the [[ISO-8859-1]] characters <code>þÿ</code> in a text display that expects the text to be ISO-8859-1. |
||
* if the 16-bit units use [[little-endian]] order, the sequence of bytes will have <code>0xFF</code> followed by <code>0xFE</code>. This sequence appears as the [[ISO-8859-1]] characters <code>ÿþ</code> in a text display that expects the text to be ISO-8859-1. |
* if the 16-bit units use [[little-endian]] order, the sequence of bytes will have <code>0xFF</code> followed by <code>0xFE</code>. This sequence appears as the [[ISO-8859-1]] characters <code>ÿþ</code> in a text display that expects the text to be ISO-8859-1. |
||
In simple language (with an illustration), the FF byte in the FF/FE pair indicates the location of the most significant byte in the pair of bytes. For example, in little-endian order, the most significant byte appears first in the pair of bytes, so the BOM in a little-endian file looks like this: |
|||
<pre> |
|||
Offset from beginning of file: 00 01 02 |
|||
Byte value for little-endian: FF FE ... |
|||
</pre> |
|||
Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable). |
Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable). |
Revision as of 17:43, 13 May 2018
The byte order mark (BOM) is a Unicode character, U+FEFF ZERO WIDTH NO-BREAK SPACE (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text:[1]
- What byte order, or endianness, the text stream is stored in;
- The fact that the text stream is Unicode, to a high level of confidence;
- Which of several Unicode encodings that text stream is encoded as.
BOM use is optional, and, if used, should appear at the start of the text stream.
Unicode can be encoded in units of 8-bit, 16-bit, or 32-bit integers. For the 16- and 32-bit representations, a computer receiving text from arbitrary sources needs to know which byte order the integers are encoded in. Because the BOM itself is encoded in the same scheme as the rest of the document, but has a known value, the consumer of the text can examine these first few bytes to determine the encoding. The BOM thus gives the producer of the text a way to describe the text endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it is free to process the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in text processing within a closed environment.
The byte sequence of the BOM character differs per Unicode encoding, and none of the sequences is likely to appear at the start of text streams stored in other encodings. Therefore, placing an encoded BOM at the start of a text stream can serve to indicate the text is Unicode and to identify the encoding scheme used, even for UTF-8, which has no endianness. This generalized use of the BOM character is called a Unicode signature,[2] and its use has been extended to Unicode-based encoding schemes that aren't part of the Unicode standard, such as UTF-7 (see table below).
Usage
If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favor of the "Word Joiner" character, U+2060.[1] This allows U+FEFF to be only used as a BOM.
UTF-8
The UTF-8 representation of the BOM is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF
. A text editor or web browser misinterpreting the text as ISO-8859-1 or CP1252 will display the characters 
for this.
The Unicode Standard permits the BOM in UTF-8,[3] but does not require or recommend its use.[4] Byte order has no meaning in UTF-8,[5] so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted from another stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[6][7] The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[8]
Not using a BOM allows text to be backwards-compatible with some software that is not Unicode-aware. Examples include programming languages that permit non-ASCII bytes in string literals but not at the start of the file.
If there is no BOM or other indication of the encoding, heuristic analysis is often able to reliably determine whether UTF-8 is in use due to the large number of byte sequences that are invalid in UTF-8. (When text is known to not be UTF-8, determining which legacy encoding can be difficult and uncertain. Several free libraries are available to ease the task, such as Mozilla Universal Charset Detector[9] and International Components for Unicode.[10])
Microsoft compilers[11] and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII. Google Docs also adds a BOM when converting a document to a plain text file for download.
UTF-16
In UTF-16, a BOM (U+FEFF
) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the character U+FFFE
, which is defined by Unicode as a "non character" that should never appear in the text.
- If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as
0xFE
followed by0xFF
. This sequence appears as the ISO-8859-1 charactersþÿ
in a text display that expects the text to be ISO-8859-1. - if the 16-bit units use little-endian order, the sequence of bytes will have
0xFF
followed by0xFE
. This sequence appears as the ISO-8859-1 charactersÿþ
in a text display that expects the text to be ISO-8859-1.
In simple language (with an illustration), the FF byte in the FF/FE pair indicates the location of the most significant byte in the pair of bytes. For example, in little-endian order, the most significant byte appears first in the pair of bytes, so the BOM in a little-endian file looks like this:
Offset from beginning of file: 00 01 02 Byte value for little-endian: FF FE ...
Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).
For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a "zero width no-break space".
Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored. When those same files are accessible on the Internet, on the other hand, no such presumption can be made. Searching for 16-bit characters in the ASCII range or just the space character (U+0020) is a method of determining the UTF-16 byte order.
The W3C/WHATWG encoding standard used in HTML5 specifies that content labelled either "utf-16" or "utf-16le" are to be interpreted as little-endian "to deal with deployed content".[12] However, if a byte-order mark is present, then that BOM is to be treated as "more authoritative than anything else".[13]
UTF-32
Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.
Byte order marks by encoding
This table illustrates how the BOM character is represented as a byte sequence in various encodings and how those sequences might appear in a text editor that is interpreting each byte as a legacy encoding (CP1252 and symbols for the C0 controls):
Encoding | Representation (hexadecimal) | Representation (decimal) | Bytes as CP1252 characters |
---|---|---|---|
UTF-8[a] | EF BB BF
|
239 187 191
|

|
UTF-16 (BE) | FE FF
|
254 255
|
þÿ
|
UTF-16 (LE) | FF FE
|
255 254
|
ÿþ
|
UTF-32 (BE) | 00 00 FE FF
|
0 0 254 255
|
␀␀þÿ (␀ refers to the ASCII null character)
|
UTF-32 (LE) | FF FE 00 00 [b]
|
255 254 0 0
|
ÿþ␀␀ (␀ refers to the ASCII null character)
|
UTF-7[a] | 2B 2F 76 38 2B 2F 76 39 2B 2F 76 2B 2B 2F 76 2F [c]2B 2F 76 38 2D [d]
|
43 47 118 56 43 47 118 57 43 47 118 43 43 47 118 47 43 47 118 56 45
|
+/v8 +/v9 +/v+ +/v/ +/v8-
|
UTF-1[a] | F7 64 4C
|
247 100 76
|
÷dL
|
UTF-EBCDIC[a] | DD 73 66 73
|
221 115 102 115
|
Ýsfs
|
SCSU[a] | 0E FE FF [e]
|
14 254 255
|
␎þÿ (␎ represents the ASCII "shift out" character)
|
BOCU-1[a] | FB EE 28
|
251 238 40
|
ûî(
|
GB-18030[a] | 84 31 95 33
|
132 49 149 51
|
„1•3
|
- ^ a b c d e f g This is not literally a "byte order" mark, since the byte is also the code unit in these encodings and there is no byte order to resolve. The sequence can be used to indicate the encoding of the text which it is preceding, however.[5][14]
- ^ This is the same byte pattern as a UTF-16LE file starting with a BOM and a NUL.
- ^ In UTF-7, the fourth byte of the BOM, before encoding as base64, is
001111xx
in binary. The final two bits,xx
, are not specifically part of the BOM, but contain the first two bits of the first encoded character following the BOM. All four possible byte combinations are shown in the table, as well as a fifth which is used for an empty string. - ^ If no following character is encoded,
38
is used for the fourth byte and the following byte is2D
. - ^ SCSU allows other encodings of U+FEFF, the shown form is the signature recommended in UTR #6.[15]
See also
References
- ^ a b "FAQ - UTF-8, UTF-16, UTF-32 & BOM". Unicode.org. Retrieved 2017-01-28.
- ^ "The Unicode® Standard Version 9.0" (PDF). The Unicode Consortium.
- ^ "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). p. 36. Retrieved 2009-03-29.
Table 2-4. The Seven Unicode Encoding Schemes
- ^ "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). p. 36. Retrieved 2008-11-30.
Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature
- ^ a b "FAQ - UTF-8, UTF-16, UTF-32 & BOM: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?". Unicode.org. Retrieved 2009-01-04.
- ^ "Re: pre-HTML5 and the BOM from Asmus Freytag on 2012-07-13 (Unicode Mail List Archive)". Unicode.org. Retrieved 2012-07-14.
- ^ "Bug ID: JDK-6378911 UTF-8 decoder handling of byte-order mark has changed". Bugs.sun.com. Retrieved 2017-01-28.
- ^ Yergeau, Francois (2003). UTF-8, a transformation format of ISO 10646. IETF. doi:10.17487/RFC3629. RFC 3629. Retrieved May 15, 2014.
{{citation}}
: Unknown parameter|month=
ignored (help) - ^ Shanjian Li. "A composite approach to language/encoding detection". Archive.mozilla.org. Retrieved 2017-01-28.
- ^ "ICU - International Components for Unicode". Site.icu-project.org. Retrieved 2017-01-28.
- ^ Alf P. Steinbach (2011). "Unicode part 1: Windows console i/o approaches". Retrieved 24 March 2012.
However, since the C++ source code was encoded as UTF-8 without BOM (as is usual in Linux), the Visual C++ compiler erroneously assumed that the source code was encoded as Windows ANSI.
- ^ "UTF-16LE". Encoding Standard. WHATWG.
- ^ "Decode". Encoding Standard. WHATWG.
- ^ "RFC 3629 - UTF-8, a transformation format of ISO 10646". Tools.ietf.org. 2003-11-08. Retrieved 2017-01-28.
- ^ Markus Scherer. "UTS #6: Compression Scheme for Unicode". Unicode.org. Retrieved 2017-01-28.