UTF-1

From Wikipedia, the free encyclopedia
Jump to: navigation, search

UTF-1 is a way of transforming ISO 10646/Unicode into a stream of bytes. Due to the design, it is not possible to resynchronise if decoding starts in the middle of a character (this makes truncation hard, among other things) and simple byte-oriented search routines cannot be reliably used with it. UTF-1 is also fairly slow due to its use of division by a number which is not a power of 2. Due to these issues, UTF-1 never gained wide acceptance and has been replaced by UTF-8.

Design[edit]

UTF-1 is a multi-byte encoding like UTF-8; a single Unicode code point can be encoded in one, two, three, or five octets. While the ASCII range is encoded as one octet, as in UTF-8, the ASCII octets 0x21 - 0x7E (decimal 33 - 126) are also used in UTF-1 multi-byte encodings; therefore UTF-1 is unsuited for many Internet protocols, including MIME.

UTF-1 does not use the C0 and C1 control codes in other encodings – any 0x00–0x20 or 0x7F–0x9F octet stands for the corresponding code points in ISO-8859-1 (U+0000–0020 and U+007F–009F, respectively). This design with 66 protected octets tried to be ISO 2022 compatible.

The UTF-1 encoding scheme uses "modulo 190" arithmetic (256-66=190); it was designed to encode the complete 31 bits of the original Universal Character Set (UCS-4). For comparison, UTF-8 protects all 128 ASCII octets, and needs two bits in trailing bytes of multi-byte encodings for this purpose, resulting in "modulo 64" arithmetic (8-2=6, 26=64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256-13=243).

codepoint UTF-16BE UTF-16LE UTF-8 UTF-1
U+007F 00 7F 7F 00 7F 7F
U+0080 00 80 80 00 C2 80 80
U+009F 00 9F 9F 00 C2 9F 9F
U+00A0 00 A0 A0 00 C2 A0 A0 A0
U+00BF 00 BF BF 00 C2 BF A0 BF
U+00C0 00 C0 C0 00 C3 80 A0 C0
U+00FF 00 FF FF 00 C3 BF A0 FF
U+0100 01 00 00 01 C4 80 A1 21
U+015D 01 5D 5D 01 C5 9D A1 7E
U+015E 01 5E 5E 01 C5 9E A1 A0
U+01BD 01 BD BD 01 C6 BD A1 FF
U+01BE 01 BE BE 01 C6 BE A2 21
U+07FF 07 FF FF 07 DF BF AA 72
U+0800 08 00 00 08 E0 A0 80 AA 73
U+0FFF 0F FF FF 0F E0 BF BF B5 48
U+1000 10 00 00 10 E1 80 80 B5 49
U+4015 40 15 15 40 E4 80 95 F5 FF
U+4016 40 16 16 40 E4 80 96 F6 21 21
U+D7FF D7 FF FF D7 ED 9F BF F7 2F C3
U+E000 E0 00 00 E0 EE 80 80 F7 3A 79
U+F8FF F8 FF FF F8 EF A3 BF F7 5C 3C
U+FDD0 FD D0 D0 FD EF B7 90 F7 62 BA
U+FDEF FD EF EF FD EF B7 AF F7 62 D9
U+FEFF FE FF FF FE EF BB BF F7 64 4C
U+FFFD FF FD FD FF EF BF BD F7 65 AD
U+FFFE FF FE FE FF EF BF BE F7 65 AE
U+FFFF FF FF FF FF EF BF BF F7 65 AF
U+10000 D8 00 DC 00 00 D8 00 DC F0 90 80 80 F7 65 B0
U+38E2D D8 A3 DE 2D A3 D8 2D DE F0 B8 B8 AD FB FF FF
U+38E2E D8 A3 DE 2E A3 D8 2E DE F0 B8 B8 AE FC 21 21 21 21
U+FFFFF DB BF DF FF BF DB FF DF F3 BF BF BF FC 21 37 B2 7A
U+100000 DB C0 DC 00 C0 DB 00 DC F4 80 80 80 FC 21 37 B2 7B
U+10FFFF DB FF DF FF FF DB FF DF F4 8F BF BF FC 21 39 6E 6C

See also[edit]

References[edit]

  • ISO IR 178 (PDF, 256 KB, the retired UTF-1 specification)