||It has been requested that the title of this article be changed to UTF-32/UCS-4. Please see the relevant discussion on the discussion page. Do not move the page until the discussion has reached consensus for the change and is closed.|
UTF-32 (or UCS-4) stands for Unicode Transformation Format 32 bits. It is a protocol to encode Unicode code points, that uses exactly 32 bits per Unicode code point. This makes UTF-32 a fixed-length encoding, in contrast to all other Unicode transformation formats which are variable-length encodings. The UTF-32 form of a code point is a direct representation of that code point's numerical value.
The main advantage of UTF-32, versus variable-length encodings, is that the Unicode code points are directly indexable. Examining the Nth code point is a constant time operation. In contrast, a variable-length code requires sequential access to find the Nth code point. This makes UTF-32 a simple replacement in code that uses integers to index characters out of strings, as was commonly done for ASCII.
The main disadvantage of UTF-32 is that it is space-inefficient, using four bytes per code point. Non-BMP characters are so rare in most texts (with the exception of Emoji, Mathematical alphanumeric forms, and certain CJK ideographs) they might as well be considered non-existent for sizing issues, making UTF-32 up to twice the size of UTF-16 and up to four times the size of UTF-8.
The original ISO 10646 standard defines a 31-bit encoding form called UCS-4, in which each encoded character in the Universal Character Set (UCS) is represented by a 31-bit code value in the code space of integers between 0x00000000 and 0x7FFFFFFF.
UTF-32 is UCS-4 restricted to the range 0x000000 to 0x10FFFF, to match the limits to the code space caused by UTF-16. The only codes above 0x10FFFF that were assigned meaning in UCS is additional Private Use Areas in the range of 0x00E00000 to 0x00FFFFFF, and 0x60000000 to 0x7FFFFFFF. Since the Principles and Procedures document of JTC1/SC2/WG2 states that all future assignments of characters will be constrained to the Unicode range, UTF-32 will be able to represent all UCS characters, and UTF-32 and UCS-4 are essentially identical.
Though a fixed number of bytes per code point appear convenient, it is not as useful as it appears. It makes truncation easier but not significantly so compared to UTF-8 and UTF-16 (both of which can search backwards for the point to truncate by looking at 2–4 code units at most).
It is extremely rare that code wishes to find the Nth code point without earlier examining the code points 0 to N–1, so an integer index that is incremented by 1 for each character can be replaced with an integer offset, measured in code units and incremented by the number of code units as each character is examined. This removes the speed advantage that novice programmers may believe UTF-32 has.
UTF-32 does not make calculating the displayed width of a string easier, since even with a “fixed width” font there may be more than one code point per character position (combining marks) or more than one character position per code point (for example CJK ideographs). Editors that limit themselves to left-to-right languages and precomposed characters can take advantage of fixed-sized code units, but such editors are unlikely to support non-BMP characters and thus can work equally well with 16-bit UTF-16 encoding.
The main use of UTF-32 is in internal APIs where the data is single code points or glyphs, rather than strings of characters. For instance in modern text rendering it is common that the last step is to build a list of structures each containing x,y position, attributes, and a single UTF-32 character identifying the glyph to draw. Often non-Unicode information is stored in the "unused" 11 bits of each word.
On Unix systems, UTF-32 strings are sometimes used for storage, due to the type wchar_t being defined as 32-bits. Python versions up to 3.2 can be compiled to use them instead of UTF-16; from version 3.3 onward, UTF-16 support is dropped, and a system is used whereby strings are stored in UTF-32 but with leading zero bytes optimized away where unnecessary. Seed7 and Lasso encodes all characters and strings with UTF-32. Use of UTF-32 strings on Windows (where wchar_t is 16 bits) is almost non-existent.
- SIL, Mapping code points to Unicode encoding forms, § 1: UTF-32
- Löwis, Martin. "PEP 393 -- Flexible String Representation". python.org. Python. Retrieved 26 October 2014.
- The Unicode Standard 5.0.0, chapter 3 – formally defines UTF-32 in § 3.10, D99-D101
- Unicode Standard Annex #19 – formally defined UTF-32 for Unicode 3.x (March 2001; last updated March 2002)
- Registration of new charsets: UTF-32, UTF-32BE, UTF-32LE – announcement of UTF-32 being added to the IANA charset registry (April 2002)