UTF-1

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

UTF-1 is one way of transforming ISO 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.

Design[edit]

UTF-1 is a multi-byte encoding like UTF-8; a single Unicode code point can be encoded in one, two, three, or five bytes. The ASCII range is encoded as one byte (all code points from U+0000 to U+009F are).

UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings, the bytes 0 - 0x20 or 0x7F - 0x9F always stand for the corresponding code point. This design with 66 protected characters tried to be ISO 2022 compatible.

UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).

code point UTF-8 UTF-1
U+007F 7F 7F
U+0080 C2 80 80
U+009F C2 9F 9F
U+00A0 C2 A0 A0 A0
U+00BF C2 BF A0 BF
U+00C0 C3 80 A0 C0
U+00FF C3 BF A0 FF
U+0100 C4 80 A1 21
U+015D C5 9D A1 7E
U+015E C5 9E A1 A0
U+01BD C6 BD A1 FF
U+01BE C6 BE A2 21
U+07FF DF BF AA 72
U+0800 E0 A0 80 AA 73
U+0FFF E0 BF BF B5 48
U+1000 E1 80 80 B5 49
U+4015 E4 80 95 F5 FF
U+4016 E4 80 96 F6 21 21
U+D7FF ED 9F BF F7 2F C3
U+E000 EE 80 80 F7 3A 79
U+F8FF EF A3 BF F7 5C 3C
U+FDD0 EF B7 90 F7 62 BA
U+FDEF EF B7 AF F7 62 D9
U+FEFF EF BB BF F7 64 4C
U+FFFD EF BF BD F7 65 AD
U+FFFE EF BF BE F7 65 AE
U+FFFF EF BF BF F7 65 AF
U+10000 F0 90 80 80 F7 65 B0
U+38E2D F0 B8 B8 AD FB FF FF
U+38E2E F0 B8 B8 AE FC 21 21 21 21
U+FFFFF F3 BF BF BF FC 21 37 B2 7A
U+100000 F4 80 80 80 FC 21 37 B2 7B
U+10FFFF F4 8F BF BF FC 21 39 6E 6C
U+7FFFFFFF FD BF BF BF BF BF FD BD 2B B9 40

Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.

See also[edit]

References[edit]