Jump to content

UTF-8: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m →‎Advantages and disadvantages: spelling and context
Removed inappropriate word. These encodings have not been bequeathed in a will.
Line 176: Line 176:
Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 3.1% for a 2 byte sequence, 2.0% for a 3 byte sequence and even lower for longer sequences.
Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 3.1% for a 2 byte sequence, 2.0% for a 3 byte sequence and even lower for longer sequences.


While natural languages encoded in traditional encodings are far from random byte sequences, they are also unlikely to produce byte sequences that would pass a UTF-8 validity test and then be misinterpreted (obviously pure ASCII text would pass a UTF-8 validity test, but provided the legacy encodings under consideration are also ASCII-based, this is not a problem). For example, for [[ISO-8859-1]] text to be misrecognized as UTF-8, the '''only''' non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol.
While natural languages encoded in traditional encodings are far from random byte sequences, they are also unlikely to produce byte sequences that would pass a UTF-8 validity test and then be misinterpreted (obviously pure ASCII text would pass a UTF-8 validity test, but provided the encodings under consideration are also ASCII-based, this is not a problem). For example, for [[ISO-8859-1]] text to be misrecognized as UTF-8, the '''only''' non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol.


The bit patterns can be used to identify UTF-8 characters. If the byte's first hex code begins with 0–7, it is an ASCII character. If it begins with C or D, it is an 11 bit character (expressed in two bytes). If it begins with E, it is 16 bit (expressed in 3 bytes), and if it begins with F, it is 21 bits (expressed in 4 bytes). 8 through B cannot be first hex codes, but all ''following'' bytes must begin with a hex code between 8 through B. Thus, at a glance, it can be seen that "0xA9" is not a valid UTF-8 character, but that "0x54" or "0xE3 0xB4 0xB1" ''are'' valid UTF-8 characters.
The bit patterns can be used to identify UTF-8 characters. If the byte's first hex code begins with 0–7, it is an ASCII character. If it begins with C or D, it is an 11 bit character (expressed in two bytes). If it begins with E, it is 16 bit (expressed in 3 bytes), and if it begins with F, it is 21 bits (expressed in 4 bytes). 8 through B cannot be first hex codes, but all ''following'' bytes must begin with a hex code between 8 through B. Thus, at a glance, it can be seen that "0xA9" is not a valid UTF-8 character, but that "0x54" or "0xE3 0xB4 0xB1" ''are'' valid UTF-8 characters.
Line 186: Line 186:
# Insert a replacement character (usually '?' or '■�' (U+FFFD)).
# Insert a replacement character (usually '?' or '■�' (U+FFFD)).
# Ignore the bytes.
# Ignore the bytes.
# Interpret each byte according to a legacy encoding (often [[ISO/IEC 8859-1|ISO-8859-1]] or [[CP1252]]).
# Interpret each byte according to another encoding (often [[ISO/IEC 8859-1|ISO-8859-1]] or [[CP1252]]).
# Not notice and decode as if the bytes were some similar bit of UTF-8.
# Not notice and decode as if the bytes were some similar bit of UTF-8.
# Stop decoding and report an error (possibly giving the caller the option to continue).
# Stop decoding and report an error (possibly giving the caller the option to continue).
Line 208: Line 208:


=== General ===
=== General ===
'''Advantages'''
====Advantages====
* UTF-8 is a [[superset]] of ASCII. Since a plain ASCII string is also a valid UTF-8 string, no conversion needs to be done for existing ASCII text. Software designed for traditional extended ASCII character sets can generally be used with UTF-8 with few or no changes.
* UTF-8 is a [[superset]] of ASCII. Since a plain ASCII string is also a valid UTF-8 string, no conversion needs to be done for existing ASCII text. Software designed for traditional extended ASCII character sets can generally be used with UTF-8 with few or no changes.
* Sorting of UTF-8 strings using standard byte-oriented sorting routines will produce the same results as sorting them based on Unicode code points. (This has limited usefulness, though, since it is unlikely to represent the culturally acceptable sort order of any particular language or locale.)
* Sorting of UTF-8 strings using standard byte-oriented sorting routines will produce the same results as sorting them based on Unicode code points. (This has limited usefulness, though, since it is unlikely to represent the culturally acceptable sort order of any particular language or locale.)
Line 215: Line 215:
* UTF-8 strings can be fairly reliably recognized as such by a simple algorithm. That is, the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. For instance, the octet values C0, C1, F5 to FF never appear. For better reliability, regular expressions can be used to take into account illegal overlong and surrogate values (see the [http://www.w3.org/International/questions/qa-forms-utf-8 W3 FAQ: Multilingual Forms] for a Perl regular expression to validate a UTF-8 string).
* UTF-8 strings can be fairly reliably recognized as such by a simple algorithm. That is, the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. For instance, the octet values C0, C1, F5 to FF never appear. For better reliability, regular expressions can be used to take into account illegal overlong and surrogate values (see the [http://www.w3.org/International/questions/qa-forms-utf-8 W3 FAQ: Multilingual Forms] for a Perl regular expression to validate a UTF-8 string).


'''Disadvantages'''
====Disadvantages====
* A badly-written (and not compliant with current versions of the standard) UTF-8 [[parser]] could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.
* A badly-written (and not compliant with current versions of the standard) UTF-8 [[parser]] could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.


=== Compared to single-byte legacy encodings ===
=== Compared to single-byte encodings ===
'''Advantages'''
====Advantages====
* UTF-8 can encode any [[Unicode]] character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time.
* UTF-8 can encode any [[Unicode]] character, avoiding the need to figure out and set a "[[code page]]" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time.


'''Disadvantages'''
====Disadvantages====
* UTF-8 encoded text is larger than the appropriate [[legacy encoding]] for everything except [[diacritic]]-free, Latin-alphabet text.
* UTF-8 encoded text is larger than the appropriate single-byte encoding for everything except [[diacritic]]-free, Latin-alphabet text.
* Legacy encodings using a single byte per character make string cutting and joining easy even with simple-minded APIs.
* Single byte per character encodings make string cutting and joining easy even with simple-minded APIs.


=== Compared to multi-byte legacy encodings ===
=== Compared to multi-byte encodings ===
'''Advantages'''
====Advantages====
* UTF-8 can encode any [[Unicode]] character. In most cases, [[legacy encoding]]s can be converted to Unicode and back with no loss and — as UTF-8 is an encoding of Unicode — this applies to it too.
* UTF-8 can encode any [[Unicode]] character. In most cases, multi-byte encodings can be converted to Unicode and back with no loss and — as UTF-8 is an encoding of Unicode — this applies to it too.
* Character boundaries are easily found from anywhere in an octet stream (scanning either forwards or backwards). This implies that if a stream of bytes is scanned starting in the middle of a multi-byte sequence, only the information represented by the partial sequence is lost and decoding can begin correctly on the next character. Similarly, if a number of bytes are corrupted or dropped, then correct decoding can resume on the next character boundary. Many legacy multi-byte encodings are much harder to resynchronise.
* Character boundaries are easily found from anywhere in an octet stream (scanning either forwards or backwards). This implies that if a stream of bytes is scanned starting in the middle of a multi-byte sequence, only the information represented by the partial sequence is lost and decoding can begin correctly on the next character. Similarly, if a number of bytes are corrupted or dropped, then correct decoding can resume on the next character boundary. Many multi-byte encodings are much harder to resynchronise.
* A byte sequence for one character never occurs as part of a longer sequence for another character as it did in older variable-length encodings like [[Shift-JIS]] (see the previous section on this). For instance, US-ASCII octet values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g., the [[printf]]() function in C libraries) that parse based on US-ASCII values but are transparent to other values.
* A byte sequence for one character never occurs as part of a longer sequence for another character as it did in older variable-length encodings like [[Shift-JIS]] (see the previous section on this). For instance, US-ASCII octet values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g., the [[printf]]() function in C libraries) that parse based on US-ASCII values but are transparent to other values.
* The first byte of a multi-byte sequence is enough to determine the length of the multi-byte sequence. This makes it extremely simple to extract a sub-string from a given string without elaborate parsing. This was often not the case in legacy multi-byte encodings.
* The first byte of a multi-byte sequence is enough to determine the length of the multi-byte sequence. This makes it extremely simple to extract a sub-string from a given string without elaborate parsing. This was often not the case in multi-byte encodings.
* Efficient to encode using simple bit operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike the obsolete [[UTF-1]] encoding).
* Efficient to encode using simple bit operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike the obsolete [[UTF-1]] encoding).


'''Disadvantages'''
====Disadvantages====
* UTF-8 encoded text is generally larger than the appropriate legacy encoding for everything except [[diacritic]]-free, Latin-alphabet text. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate legacy encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their legacy encodings yet take three bytes per character in UTF-8.
* UTF-8 encoded text is generally larger than the appropriate multi-byte encoding for everything except [[diacritic]]-free, Latin-alphabet text. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.


=== Compared to UTF-7 ===
=== Compared to UTF-7 ===
'''Advantages'''
====Advantages====
* UTF-8 uses significantly fewer bytes per character for all non-ASCII characters.
* UTF-8 uses significantly fewer bytes per character for all non-ASCII characters.
* UTF-8 encodes "+" as itself whereas [[UTF-7]] encodes it as "+-".
* UTF-8 encodes "+" as itself whereas [[UTF-7]] encodes it as "+-".


'''Disadvantages'''
====Disadvantages====
* UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using [[quoted printable]] or [[base64]] in some cases. This extra stage of encoding carries a significant size penalty. However, this disadvantage is not so important an issue any more because most [[mail transfer agent]]s in modern use are eight-bit clean and support 8BITMIME SMTP extension as specified in RFC 1869.
* UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using [[quoted printable]] or [[base64]] in some cases. This extra stage of encoding carries a significant size penalty. However, this disadvantage is not so important an issue any more because most [[mail transfer agent]]s in modern use are eight-bit clean and support 8BITMIME SMTP extension as specified in RFC 1869.


=== Compared to UTF-16 ===
=== Compared to UTF-16 ===
'''Advantages'''
====Advantages====
* Byte values of 0 (The ASCII NUL character) do not appear in the encoding unless U+0000 (the Unicode NUL character) is represented. This means that legacy C library string functions (such as strcpy()) that use a [[null terminator]] will not incorrectly truncate strings.
* Byte values of 0 (The ASCII NUL character) do not appear in the encoding unless U+0000 (the Unicode NUL character) is represented. This means that [[standard C library]] string functions (such as strcpy()) that use a [[null terminator]] will not incorrectly truncate strings.
* Since ASCII characters can be represented in a single byte, text consisting of mostly diacritic-free Latin letters will be around half the size in UTF-8 than it would be in [[UTF-16]]. Text in many other alphabets will be slightly smaller in UTF-8 than it would be in UTF-16 because of the presence of spaces.
* Since ASCII characters can be represented in a single byte, text consisting of mostly diacritic-free Latin letters will be around half the size in UTF-8 than it would be in [[UTF-16]]. Text in many other alphabets will be slightly smaller in UTF-8 than it would be in UTF-16 because of the presence of spaces.
* Most existing [[computer program]]s (including [[operating system]]s) were not written with Unicode in mind. Using UTF-16 with them while maintaining compatibility with existing programs requires every system API, library function, and structure that takes a string to be duplicated. UTF-8 only requires APIs that specially treat bytes with the high bit set to be duplicated (this is close to none on both Unix and Windows, and such APIs are already locale-dependent if legacy encodings are used).
* Most existing [[computer program]]s (including [[operating system]]s) were not written with Unicode in mind. Using UTF-16 with them while maintaining compatibility with existing programs requires every system API, library function, and structure that takes a string to be duplicated. UTF-8 only requires APIs that specially treat bytes with the high bit set to be duplicated (this is close to none on both Unix and Windows, and such APIs are already locale-dependent if other single-byte or multi-byte encodings are used).
* In UTF-8, characters outside the basic multilingual plane are not a special case. UTF-16 is often mistaken to be constant-length, leading to code that works for most text but suddenly fails for non-[[Basic Multilingual Plane|BMP]] characters. Retrofitting code tends to be hard, so it's better to implement support for the entire range of Unicode from the start.
* In UTF-8, characters outside the basic multilingual plane are not a special case. UTF-16 is often mistaken to be constant-length, leading to code that works for most text but suddenly fails for non-[[Basic Multilingual Plane|BMP]] characters. Retrofitting code tends to be hard, so it's better to implement support for the entire range of Unicode from the start.
* UTF-8 uses a byte as its atomic unit while UTF-16 uses a 16-bit word which is generally represented by a pair of bytes. This representation raises a couple of potential problems of its own.
* UTF-8 uses a byte as its atomic unit while UTF-16 uses a 16-bit word which is generally represented by a pair of bytes. This representation raises a couple of potential problems of its own.
Line 255: Line 255:
** If an odd number of bytes are removed from the beginning of UTF-16-encoded text, the result will be either invalid UTF-16 or completely meaningless text. In UTF-8, if part of a multi-byte character is removed, only that character is affected and not the rest of the text.
** If an odd number of bytes are removed from the beginning of UTF-16-encoded text, the result will be either invalid UTF-16 or completely meaningless text. In UTF-8, if part of a multi-byte character is removed, only that character is affected and not the rest of the text.


'''Disadvantages'''
====Disadvantages====
* Characters above U+0800 in the [[Basic Multilingual Plane|BMP]] use three bytes in UTF-8, but only two in UTF-16. As a result, text in [for example] Chinese, Japanese or Hindi takes up more space when represented in UTF-8. However, this disadvantage is partly offset by the fact that characters below U+0080 (Latin letters, numbers and punctuation marks, space, carriage return and line feed) that frequently appear in those text take only one byte in UTF-8 while they take two bytes in UTF-16.
* Characters above U+0800 in the [[Basic Multilingual Plane|BMP]] use three bytes in UTF-8, but only two in UTF-16. As a result, text in [for example] Chinese, Japanese or Hindi takes up more space when represented in UTF-8. However, this disadvantage is partly offset by the fact that characters below U+0080 (Latin letters, numbers and punctuation marks, space, carriage return and line feed) that frequently appear in those text take only one byte in UTF-8 while they take two bytes in UTF-16.



Revision as of 01:20, 12 September 2007

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed.

UTF-8 encodes each character in one to four octets (8-bit bytes):

  1. One byte is needed to encode the 128 US-ASCII characters (Unicode range U+0000 to U+007F).
  2. Two bytes are needed for Latin letters with diacritics and for characters from Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF).
  3. Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use).
  4. Four bytes are needed for characters in the other planes of Unicode, which are rarely used in practice.

Four bytes may seem like a lot for one character (code point). However, code points outside the Basic Multilingual Plane are generally very rare. Furthermore, UTF-16 (the main alternative to UTF-8) also needs four bytes for these code points. Whether UTF-8 or UTF-16 is more efficient depends on the range of code points being used. However, the differences between different encoding schemes can become negligible with the use of traditional compression systems like DEFLATE. For short items of text where traditional algorithms do not perform well and size is important, the Standard Compression Scheme for Unicode could be considered instead.

The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data with UTF-8 as at least one supported encoding.[1] The Internet Mail Consortium (IMC) recommends[2] that all email programs be able to display and create mail using UTF-8.

Template:Table Unicode

History

By early 1992 a search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF that provided a byte-stream encoding of its 32-bit characters. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.

In July 1992 the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only 8-bit characters, i.e. those where the high bit was set.

In August 1992 this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Laboratories then made a crucial modification to the encoding, to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string in order to find character boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.

UTF-8 was first officially presented at the USENIX conference in San Diego January 2529 1993.

Microsoft's specification for CAB (MS Cabinet) from 1996 allows for UTF-8 encoded strings everywhere specifically (though this was before UTF-8 was actually formally standardised), but the encoder never actually implemented it.

Description

There are several current, slightly different[citation needed] definitions of UTF-8 in various standards documents:

  • RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element
  • The Unicode Standard, Version 4.0, §3.9–§3.10 (2003)
  • ISO/IEC 10646-1:2000 Annex D (2000)

They supersede the definitions given in the following obsolete works:

  • ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)
  • The Unicode Standard, Version 2.0, Appendix A (1996)
  • RFC 2044 (1996)
  • RFC 2279 (1998)
  • The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1: UTF-8 Shortest Form (2000)
  • Unicode Standard Annex #27: Unicode 3.1 (2001)

They are all the same in their general mechanics with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.

The bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes. A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII. In other cases, up to four bytes are required. The most significant bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters and therefore keep standard byte-oriented string processing safe.

Code range
hexadecimal
Scalar value
binary
UTF-8
binary / hexadecimal
Notes
000000–00007F
128 codes
00000000 00000000 0zzzzzzz 0zzzzzzz(00-7F) ASCII equivalence range; byte begins with zero
seven z seven z
000080–0007FF
1920 codes
00000000 00000yyy yyzzzzzz 110yyyyy(C2-DF) 10zzzzzz(80-BF) first byte begins with 110, the following byte begins with 10.
three y; two y, six z five y; six z
000800–00D7FF
00E000–00FFFF
61440 codes
00000000 xxxxyyyy yyzzzzzz 1110xxxx(E0-EF) 10yyyyyy 10zzzzzz first byte begins with 1110, the following 2 bytes begin with 10.
four x,four y; two y,six z four x; six y; six z
010000–10FFFF
1048576 codes
000wwwxx xxxxyyyy yyzzzzzz 11110www(F0-F4) 10xxxxxx 10yyyyyy 10zzzzzz First byte begins with 11110, the following 3 bytes begin with 10
three w, two x; four x, four y; two y, six z three w; six x; six y; six z

For example, the character aleph (א), which is Unicode U+05D0, is encoded into UTF-8 in this way:

  • It falls into the range of U+0080 to U+07FF. The table shows it will be encoded using two bytes, 110yyyyy 10zzzzzz.
  • Hexadecimal 0x05D0 is equivalent to binary 101-1101-0000.
  • The eleven bits are put in their order into the positions marked by "y"-s and "z"-s: 11010111 10010000.
  • The final result is the two bytes, more conveniently expressed as the two hexadecimal bytes 0xD7 0x90. That is the encoding of the character aleph (א) in UTF-8.

Width by first byte:

Binary Hexadecimal Decimal Width
00000000-01111111 0-7F 0-127 1 byte
11000010-11011111 C2-DF 194-223 2 bytes
11100000-11101111 E0-EF 224-239 3 bytes
11110000-11110100 F0-F4 240-244 4 bytes

So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes.

By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. With these restrictions, the following byte values never appear in a legal UTF-8 sequence:

Codes (binary) Codes (hexadecimal) Notes
1100000x C0, C1 Overlong encoding: lead byte of a 2-byte sequence, but code point <= 127
11110101
1111011x
F5, F6, F7 Restricted by RFC 3629: lead byte of 4-byte sequence for codepoint above 10FFFF
111110xx
1111110x
F8, F9, FA, FB, FC, FD Restricted by RFC 3629: lead byte of a sequence 5 or 6 bytes long
1111111x FE, FF Invalid: lead byte of a sequence 7 or 8 bytes long

While the two categories labeled "Restricted by RFC" above were technically allowed by earlier UTF-8 specifications, no characters were ever assigned to the code points they represent, so they should never have appeared in UTF-8-encoded text.

The design of the algorithm has some similarities with Huffman coding.

UTF-8 derivations

Windows

Although not part of the standard, many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1 characters "" in most text editors and web browsers not prepared to handle UTF-8.

Java

In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter.

However, Java also supports a non-standard variant of UTF-8 called modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constants in class files. There are two differences between modified and standard UTF-8. The first difference is that the null character (U+0000) is encoded with two bytes instead of one, specifically as 0xc0 0x80. This ensures that there are no embedded nulls in the encoded string, to address the concern that if the encoded string is processed in a language such as C, a null byte would signal the end of the string, and cause premature truncation.

The second difference is in the way characters outside the Basic Multilingual Plane are encoded. In standard UTF-8 these characters are encoded using the four-byte format above. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence as in CESU-8, taking up 6 bytes in total. Each Java character represents a 16 bit value. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change.

Because modified UTF-8 is not UTF-8, one needs to be very careful to avoid mislabelling data in modified UTF-8 as UTF-8 when interchanging information over the Internet.

Mac OS X

The Mac OS X Operating System uses canonically decomposed Unicode, encoded using UTF-8 for file names in the filesystem. This is sometimes referred to as UTF-8-MAC. In canonically decomposed Unicode, the use of precomposed characters is forbidden and combining diacritics must be used to replace them.

This makes sorting far simpler but can be confusing for software built around the assumption that precomposed characters are the norm and combining diacritics only used to form unusual combinations. This is an example of the NFD variant of Unicode normalization—most other platforms, including Windows and Linux, use the NFC form of Unicode normalization, which is also used by W3C standards, so NFD data must typically be converted to NFC for use on other platforms or the Web.

This is discussed in Apple Q&A 1173.

Rationale behind UTF-8's design

As a consequence of the design of UTF-8, the following properties of multi-byte sequences hold:

  • The most significant bit of a single-byte character is always 0.
  • The most significant bits of the first byte of a multi-byte sequence determine the length of the sequence. These most significant bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
  • The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.

UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although this property adds redundancy to UTF-8–encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently. This also means that if one or more complete bytes are lost due to error or corruption, one can resynchronize at the beginning of the next character and thus limit the damage.

Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 3.1% for a 2 byte sequence, 2.0% for a 3 byte sequence and even lower for longer sequences.

While natural languages encoded in traditional encodings are far from random byte sequences, they are also unlikely to produce byte sequences that would pass a UTF-8 validity test and then be misinterpreted (obviously pure ASCII text would pass a UTF-8 validity test, but provided the encodings under consideration are also ASCII-based, this is not a problem). For example, for ISO-8859-1 text to be misrecognized as UTF-8, the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol.

The bit patterns can be used to identify UTF-8 characters. If the byte's first hex code begins with 0–7, it is an ASCII character. If it begins with C or D, it is an 11 bit character (expressed in two bytes). If it begins with E, it is 16 bit (expressed in 3 bytes), and if it begins with F, it is 21 bits (expressed in 4 bytes). 8 through B cannot be first hex codes, but all following bytes must begin with a hex code between 8 through B. Thus, at a glance, it can be seen that "0xA9" is not a valid UTF-8 character, but that "0x54" or "0xE3 0xB4 0xB1" are valid UTF-8 characters.

Overlong forms, invalid input, and security considerations

The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

  1. Insert a replacement character (usually '?' or '■�' (U+FFFD)).
  2. Ignore the bytes.
  3. Interpret each byte according to another encoding (often ISO-8859-1 or CP1252).
  4. Not notice and decode as if the bytes were some similar bit of UTF-8.
  5. Stop decoding and report an error (possibly giving the caller the option to continue).

It is possible for a decoder to behave in different ways for different types of invalid input.

RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." [3] The Unicode Standard requires a Unicode-compliant decoder to "…treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.

Another common problem is decoders that do not check that the trailing bytes are really trailing bytes. This will cause more characters to be lost than necessary if some bytes are lost or corrupted.

To maintain security in the case of invalid input, there are a few options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input either returns an error or text that the application knows to be harmless. A third possibility is to not decode the UTF-8 at all, this is quite practical if the system only treats some ASCII characters (like slash and NUL) specially, and treats all other bytes as identifiers or other data but requires care to avoid passing invalid UTF-8 to other code (such as third party libraries or an operating system) that cannot safely handle it.

Advantages and disadvantages

String length

In general, it is not possible to determine from the number of code points in a Unicode string how much space it needs to be displayed, or where on a screen the cursor should be placed in a text buffer after displaying a string; combining characters, double width characters, proportional fonts, non-printing characters and right-to-left characters all contribute to this.

So while the number of octets in an UTF-8 string is related in a more complex way to the number of code points than for UTF-32, it is very rare to encounter a situation where this makes a difference in practice.

General

Advantages

  • UTF-8 is a superset of ASCII. Since a plain ASCII string is also a valid UTF-8 string, no conversion needs to be done for existing ASCII text. Software designed for traditional extended ASCII character sets can generally be used with UTF-8 with few or no changes.
  • Sorting of UTF-8 strings using standard byte-oriented sorting routines will produce the same results as sorting them based on Unicode code points. (This has limited usefulness, though, since it is unlikely to represent the culturally acceptable sort order of any particular language or locale.)
  • UTF-8 and UTF-16 are the standard encodings for XML documents. All other encodings must be specified explicitly either externally or through a text declaration. [1]
  • Any byte oriented string search algorithm can be used with UTF-8 data (as long as one ensures that the inputs only consist of complete UTF-8 characters). Care must be taken with regular expressions and other constructs that count characters, however.
  • UTF-8 strings can be fairly reliably recognized as such by a simple algorithm. That is, the probability that a string of characters in any other encoding appears as valid UTF-8 is low, diminishing with increasing string length. For instance, the octet values C0, C1, F5 to FF never appear. For better reliability, regular expressions can be used to take into account illegal overlong and surrogate values (see the W3 FAQ: Multilingual Forms for a Perl regular expression to validate a UTF-8 string).

Disadvantages

  • A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.

Compared to single-byte encodings

Advantages

  • UTF-8 can encode any Unicode character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple languages at the same time.

Disadvantages

  • UTF-8 encoded text is larger than the appropriate single-byte encoding for everything except diacritic-free, Latin-alphabet text.
  • Single byte per character encodings make string cutting and joining easy even with simple-minded APIs.

Compared to multi-byte encodings

Advantages

  • UTF-8 can encode any Unicode character. In most cases, multi-byte encodings can be converted to Unicode and back with no loss and — as UTF-8 is an encoding of Unicode — this applies to it too.
  • Character boundaries are easily found from anywhere in an octet stream (scanning either forwards or backwards). This implies that if a stream of bytes is scanned starting in the middle of a multi-byte sequence, only the information represented by the partial sequence is lost and decoding can begin correctly on the next character. Similarly, if a number of bytes are corrupted or dropped, then correct decoding can resume on the next character boundary. Many multi-byte encodings are much harder to resynchronise.
  • A byte sequence for one character never occurs as part of a longer sequence for another character as it did in older variable-length encodings like Shift-JIS (see the previous section on this). For instance, US-ASCII octet values do not appear otherwise in a UTF-8 encoded character stream. This provides compatibility with file systems or other software (e.g., the printf() function in C libraries) that parse based on US-ASCII values but are transparent to other values.
  • The first byte of a multi-byte sequence is enough to determine the length of the multi-byte sequence. This makes it extremely simple to extract a sub-string from a given string without elaborate parsing. This was often not the case in multi-byte encodings.
  • Efficient to encode using simple bit operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike the obsolete UTF-1 encoding).

Disadvantages

  • UTF-8 encoded text is generally larger than the appropriate multi-byte encoding for everything except diacritic-free, Latin-alphabet text. Latin letters with diacritics and characters from other alphabetic scripts typically take one byte per character in the appropriate multi-byte encoding but take two in UTF-8. East Asian scripts generally have two bytes per character in their multi-byte encodings yet take three bytes per character in UTF-8.

Compared to UTF-7

Advantages

  • UTF-8 uses significantly fewer bytes per character for all non-ASCII characters.
  • UTF-8 encodes "+" as itself whereas UTF-7 encodes it as "+-".

Disadvantages

  • UTF-8 requires the transmission system to be eight-bit clean. In the case of e-mail this means it has to be further encoded using quoted printable or base64 in some cases. This extra stage of encoding carries a significant size penalty. However, this disadvantage is not so important an issue any more because most mail transfer agents in modern use are eight-bit clean and support 8BITMIME SMTP extension as specified in RFC 1869.

Compared to UTF-16

Advantages

  • Byte values of 0 (The ASCII NUL character) do not appear in the encoding unless U+0000 (the Unicode NUL character) is represented. This means that standard C library string functions (such as strcpy()) that use a null terminator will not incorrectly truncate strings.
  • Since ASCII characters can be represented in a single byte, text consisting of mostly diacritic-free Latin letters will be around half the size in UTF-8 than it would be in UTF-16. Text in many other alphabets will be slightly smaller in UTF-8 than it would be in UTF-16 because of the presence of spaces.
  • Most existing computer programs (including operating systems) were not written with Unicode in mind. Using UTF-16 with them while maintaining compatibility with existing programs requires every system API, library function, and structure that takes a string to be duplicated. UTF-8 only requires APIs that specially treat bytes with the high bit set to be duplicated (this is close to none on both Unix and Windows, and such APIs are already locale-dependent if other single-byte or multi-byte encodings are used).
  • In UTF-8, characters outside the basic multilingual plane are not a special case. UTF-16 is often mistaken to be constant-length, leading to code that works for most text but suddenly fails for non-BMP characters. Retrofitting code tends to be hard, so it's better to implement support for the entire range of Unicode from the start.
  • UTF-8 uses a byte as its atomic unit while UTF-16 uses a 16-bit word which is generally represented by a pair of bytes. This representation raises a couple of potential problems of its own.
    • When representing a word in UTF-16 as two bytes, the order of those two bytes becomes an issue. A variety of mechanisms can be used to deal with this issue (for example, the Byte Order Mark), but they still present an added complication for software and protocol design.
    • If an odd number of bytes are removed from the beginning of UTF-16-encoded text, the result will be either invalid UTF-16 or completely meaningless text. In UTF-8, if part of a multi-byte character is removed, only that character is affected and not the rest of the text.

Disadvantages

  • Characters above U+0800 in the BMP use three bytes in UTF-8, but only two in UTF-16. As a result, text in [for example] Chinese, Japanese or Hindi takes up more space when represented in UTF-8. However, this disadvantage is partly offset by the fact that characters below U+0080 (Latin letters, numbers and punctuation marks, space, carriage return and line feed) that frequently appear in those text take only one byte in UTF-8 while they take two bytes in UTF-16.

Notes

  1. ^ RFC 2277 section 3.1
  2. ^ IMC.
  3. ^ Yergeau, F. (2003), "UTF-8, a transformation format of ISO 10646", RFC 3629, IETF

See also

External links