UTF-32

From Wikipedia, the free encyclopedia
Jump to: navigation, search

UTF-32 (or UCS-4) stands for Unicode Transformation Format 32 bits. It is a protocol to encode Unicode code points that uses exactly 32 bits per Unicode code point. This makes UTF-32 a fixed-length encoding, in contrast to all other Unicode transformation formats which are variable-length encodings. The UTF-32 form of a code point is a direct representation of that code point's numerical value.[1]

The main advantage of UTF-32, versus variable-length encodings, is that the Unicode code points are directly indexable. Examining the n'th code point is a constant time operation.[2] In contrast, a variable-length code requires sequential access to find the n'th code point. This makes UTF-32 a simple replacement in code that uses integers to index characters out of strings, as was commonly done for ASCII.

The main disadvantage of UTF-32 is that it is space inefficient, using four bytes per code point. Non-BMP characters are so rare in most texts[citation needed], they may as well be considered non-existent for sizing issues, making UTF-32 up to twice the size of UTF-16 and up to four times the size of UTF-8.

History[edit]

The original ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 32-bit friendly code value in the code space of integers between 0 and hexadecimal 7FFFFFFF.

Because only 17 planes are actually in use, all current code points are between 0 and 0x10FFFF. UTF-32 is a subset of UCS-4 that uses only this range. Since the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the BMP or the first 14 supplementary planes, UTF-32 will be able to represent all Unicode characters.

Use[edit]

The main use of UTF-32 is in internal APIs where the data is single code points or glyphs, rather than strings of characters. For instance in modern text rendering it is common that the last step is to build a list of structures each containing x,y position, attributes, and a single UTF-32 character identifying the glyph to draw. Often non-Unicode information is stored in the "unused" 11 bits of each word.

On Unix systems, UTF-32 strings are sometimes used for storage, due to the type wchar_t being defined as 32-bits. Python versions up to 3.2 can be compiled to use them instead of UTF-16; from version 3.3 onward, UTF-16 support is dropped, and a system is used whereby strings are stored in UTF-32 but with leading zero bytes optimized away where unnecessary.[3] Seed7 and Lasso encodes all characters and strings with UTF-32. Use of UTF-32 strings on Windows (where wchar_t is 16 bits) is almost non-existent.

Non-use in HTML5[edit]

HTML5 states that "authors should not use UTF-32, as the encoding detection algorithms described in this specification intentionally do not distinguish it from UTF-16."[4]

See also[edit]

References[edit]

External links[edit]