UTF-EBCDIC

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Spitzak (talk | contribs) at 15:44, 29 May 2020 (→‎{{anchor|UTFE}}Oracle UTFE: According to table above it takes 8 bytes for two surrogates. Added link to cesu). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first (creating what the specification calls an I8 sequence). The main difference between this encoding and UTF-8 is that it allows Unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this, UTF-8-Mod uses 101XXXXX instead of 10XXXXXX as the format for trailing bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, the UTF-8-Mod encoding of codepoints above U+009F is generally larger than the UTF-8 encoding.

The UTF-8-Mod transformation leaves the data in an ASCII-based format (for example, U+0041 "A" is still encoded as 01000001), so each byte is fed through a reversible (one-to-one) lookup table to produce the final UTF-EBCDIC encoding. For example, 01000001 in this table maps to 11000001; thus the UTF-EBCDIC encoding of U+0041 (Unicode's "A") is 0xC1 (EBCDIC's "A").

This encoding form is rarely used, even on the EBCDIC-based mainframes for which it was designed. IBM EBCDIC-based mainframe operating systems, such as z/OS, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.

Codepage layout

There are 160 characters with single-byte encodings in UTF-EBCDIC (compared to 128 in UTF-8). As can be seen, the single-byte portion is similar to IBM-1047 instead of IBM-37 due to the location of the square brackets. CCSID 37 has [] at hex BA and BB instead of at hex AD and BD respectively.

UTF-EBCDIC
_0 _1 _2 _3 _4 _5 _6 _7 _8 _9 _A _B _C _D _E _F
0_ Template:Chset-color-ctrl|NUL
0000
Template:Chset-color-ctrl|SOH
0001
Template:Chset-color-ctrl|STX
0002
Template:Chset-color-ctrl|ETX
0003
Template:Chset-color-ctrl|ST
009C
Template:Chset-color-ctrl|HT
0009
Template:Chset-color-ctrl|SSA
0086
Template:Chset-color-ctrl|DEL
007F
Template:Chset-color-ctrl|EPA
0097
Template:Chset-color-ctrl|RI
008D
Template:Chset-color-ctrl|SS2
008E
Template:Chset-color-ctrl|VT
000B
Template:Chset-color-ctrl|FF
000C
Template:Chset-color-ctrl|CR
000D
Template:Chset-color-ctrl|SO
000E
Template:Chset-color-ctrl|SI
000F
1_ Template:Chset-color-ctrl|DLE
0010
Template:Chset-color-ctrl|DC1
0011
Template:Chset-color-ctrl|DC2
0012
Template:Chset-color-ctrl|DC3
0013
Template:Chset-color-ctrl|OSC
009D
Template:Chset-color-ctrl|LF
000A
Template:Chset-color-ctrl|BS
0008
Template:Chset-color-ctrl|ESA
0087
Template:Chset-color-ctrl|CAN
0018
Template:Chset-color-ctrl|EM
0019
Template:Chset-color-ctrl|PU2
0092
Template:Chset-color-ctrl|SS3
008F
Template:Chset-color-ctrl|FS
001C
Template:Chset-color-ctrl|GS
001D
Template:Chset-color-ctrl|RS
001E
Template:Chset-color-ctrl|US
001F
2_ Template:Chset-color-ctrl|PAD
0080
Template:Chset-color-ctrl|HOP
0081
Template:Chset-color-ctrl|BPH
0082
Template:Chset-color-ctrl|NBH
0083
Template:Chset-color-ctrl|IND
0084
Template:Chset-color-ctrl|NEL
0085
Template:Chset-color-ctrl|ETB
0017
Template:Chset-color-ctrl|ESC
001B
Template:Chset-color-ctrl|HTS
0088
Template:Chset-color-ctrl|HTJ
0089
Template:Chset-color-ctrl|VTS
008A
Template:Chset-color-ctrl|PLD
008B
Template:Chset-color-ctrl|PLU
008C
Template:Chset-color-ctrl|ENQ
0005
Template:Chset-color-ctrl|ACK
0006
Template:Chset-color-ctrl|BEL
0007
3_ Template:Chset-color-ctrl|DCS
0090
Template:Chset-color-ctrl|PU1
0091
Template:Chset-color-ctrl|SYN
0016
Template:Chset-color-ctrl|STS
0093
Template:Chset-color-ctrl|CCH
0094
Template:Chset-color-ctrl|MW
0095
Template:Chset-color-ctrl|SPA
0096
Template:Chset-color-ctrl|EOT
0004
Template:Chset-color-ctrl|SOS
0098
Template:Chset-color-ctrl|SGCI
0099
Template:Chset-color-ctrl|SCI
009A
Template:Chset-color-ctrl|CSI
009B
Template:Chset-color-ctrl|DC4
0014
Template:Chset-color-ctrl|NAK
0015
Template:Chset-color-ctrl|PM
009E
Template:Chset-color-ctrl|SUB
001A
4_ Template:Chset-color-misc|SP
0020

+00

+01

+02

+03

+04

+05

+06

+07

+08

+09
Template:Chset-color-punct|.
002E
Template:Chset-color-graph|<
003C
Template:Chset-color-punct|(
0028
Template:Chset-color-graph|+
002B
Template:Chset-color-graph||
007C
5_ Template:Chset-color-punct|&
0026

+0A

+0B

+0C

+0D

+0E

+0F

+10

+11

+12
Template:Chset-color-punct|!
0021
Template:Chset-color-graph|$
0024
Template:Chset-color-punct|*
002A
Template:Chset-color-punct|)
0029
Template:Chset-color-punct|;
003B
Template:Chset-color-graph|^
005E
6_ Template:Chset-color-punct|-
002D
Template:Chset-color-punct|/
002F

+13

+14

+15

+16

+17

+18

+19

+1A

+1B
Template:Chset-color-punct|,
002C
Template:Chset-color-punct|%
0025
Template:Chset-color-punct|_
005F
Template:Chset-color-graph|>
003E
Template:Chset-color-punct|?
003F
7_
+1C

+1D

+1E

+1F
2
0000
2
0020
2
0040
2
0060
2
0080
Template:Chset-color-graph|`
0060
Template:Chset-color-punct|:
003A
Template:Chset-color-punct|#
0023
Template:Chset-color-punct|@
0040
Template:Chset-color-punct|'
0027
Template:Chset-color-graph|=
003D
Template:Chset-color-punct|"
0022
8_ Template:Chset-color-esc|2
00A0
Template:Chset-color-letter|a
0061
Template:Chset-color-letter|b
0062
Template:Chset-color-letter|c
0063
Template:Chset-color-letter|d
0064
Template:Chset-color-letter|e
0065
Template:Chset-color-letter|f
0066
Template:Chset-color-letter|g
0067
Template:Chset-color-letter|h
0068
Template:Chset-color-letter|i
0069
Template:Chset-color-esc|2
00C0
Template:Chset-color-esc|2
00E0
Template:Chset-color-esc|2
0100
Template:Chset-color-esc|2
0120
Template:Chset-color-esc|2
0140
Template:Chset-color-esc|2
0160
9_ Template:Chset-color-esc|2
0180
Template:Chset-color-letter|j
006A
Template:Chset-color-letter|k
006B
Template:Chset-color-letter|l
006C
Template:Chset-color-letter|m
006D
Template:Chset-color-letter|n
006E
Template:Chset-color-letter|o
006F
Template:Chset-color-letter|p
0070
Template:Chset-color-letter|q
0071
Template:Chset-color-letter|r
0072
Template:Chset-color-esc|2
01A0
Template:Chset-color-esc|2
01C0
Template:Chset-color-esc|2
01E0
Template:Chset-color-esc|2
0200
Template:Chset-color-esc|2
0220
Template:Chset-color-esc|2
0240
A_ Template:Chset-color-esc|2
0260
Template:Chset-color-graph|~
007E
Template:Chset-color-letter|s
0073
Template:Chset-color-letter|t
0074
Template:Chset-color-letter|u
0075
Template:Chset-color-letter|v
0076
Template:Chset-color-letter|w
0077
Template:Chset-color-letter|x
0078
Template:Chset-color-letter|y
0079
Template:Chset-color-letter|z
007A
Template:Chset-color-esc|2
0280
Template:Chset-color-esc|2
02A0
Template:Chset-color-esc|2
02C0
Template:Chset-color-punct|[
005B
Template:Chset-color-esc|2
02E0
Template:Chset-color-esc|2
0300
B_ Template:Chset-color-esc|2
0320
Template:Chset-color-esc|2
0340
Template:Chset-color-esc|2
0360
Template:Chset-color-esc|2
0380
Template:Chset-color-esc|2
03A0
Template:Chset-color-esc|2
03C0
Template:Chset-color-esc|2
03E0
3
0000
Template:Chset-color-esc|3
0400
Template:Chset-color-esc|3
0800
Template:Chset-color-esc|3
0C00
Template:Chset-color-esc|3
1000
Template:Chset-color-esc|3
1400
Template:Chset-color-punct|]
005D
Template:Chset-color-esc|3
1800
Template:Chset-color-esc|3
1C00
C_ Template:Chset-color-punct|{
007B
Template:Chset-color-letter|A
0041
Template:Chset-color-letter|B
0042
Template:Chset-color-letter|C
0043
Template:Chset-color-letter|D
0044
Template:Chset-color-letter|E
0045
Template:Chset-color-letter|F
0046
Template:Chset-color-letter|G
0047
Template:Chset-color-letter|H
0048
Template:Chset-color-letter|I
0049
Template:Chset-color-esc|3
2000
Template:Chset-color-esc|3
2400
Template:Chset-color-esc|3
2800
Template:Chset-color-esc|3
2C00
Template:Chset-color-esc|3
3000
Template:Chset-color-esc|3
3400
D_ Template:Chset-color-punct|}
007D
Template:Chset-color-letter|J
004A
Template:Chset-color-letter|K
004B
Template:Chset-color-letter|L
004C
Template:Chset-color-letter|M
004D
Template:Chset-color-letter|N
004E
Template:Chset-color-letter|O
004F
Template:Chset-color-letter|P
0050
Template:Chset-color-letter|Q
0051
Template:Chset-color-letter|R
0052
Template:Chset-color-esc|3
3800
Template:Chset-color-esc|3
3C00
Template:Chset-color-esc|4
4000
Template:Chset-color-esc|4
8000
Template:Chset-color-esc|4
10000
Template:Chset-color-esc|4
18000
E_ Template:Chset-color-punct|\
005C
Template:Chset-color-esc|4
20000
Template:Chset-color-letter|S
0053
Template:Chset-color-letter|T
0054
Template:Chset-color-letter|U
0055
Template:Chset-color-letter|V
0056
Template:Chset-color-letter|W
0057
Template:Chset-color-letter|X
0058
Template:Chset-color-letter|Y
0059
Template:Chset-color-letter|Z
005A
Template:Chset-color-esc|4
28000
Template:Chset-color-esc|4
30000
Template:Chset-color-esc|4
38000
Template:Chset-color-esc|5
40000
Template:Chset-color-esc|5
100000
Template:Chset-color-undef|
F_ Template:Chset-color-digit|0
0030
Template:Chset-color-digit|1
0031
Template:Chset-color-digit|2
0032
Template:Chset-color-digit|3
0033
Template:Chset-color-digit|4
0034
Template:Chset-color-digit|5
0035
Template:Chset-color-digit|6
0036
Template:Chset-color-digit|7
0037
Template:Chset-color-digit|8
0038
Template:Chset-color-digit|9
0039
Template:Chset-color-undef| Template:Chset-color-undef| Template:Chset-color-undef| Template:Chset-color-undef| Template:Chset-color-undef| Template:Chset-color-ctrl|APC
009F

  Letter  Number  Punctuation  Symbol  Other  Undefined

   Blue cells containing a large single-digit number are the start bytes for a sequence of that many bytes. The unbolded hexadecimal code point number shown in the cell is the lowest character value encoded using that start byte. This value can be greater than the value which would be obtained by following the start byte with continuation bytes which are all 65 (hex 0x41), if this would result in an invalid overlong form.

   Orange cells with one dot are continuation bytes. The hexadecimal number shown after a "+" plus sign is the value of the 5 bits they add.

   Red cells indicate start bytes (for a sequence of that many bytes) which can never appear in properly encoded UTF-EBCDIC text, because any possible continuation would result in an invalid overlong form. For example, 0x76 is marked in red because even 0x76 0x73 (which maps to the UTF-8-Mod sequence 0xC2 0xBF) would merely be an overlong encoding of U+005F (properly encoded as UTF-8-Mod 0x5F, UTF-EBCDIC 0x6D).

Oracle UTFE

Oracle UTFE is a Unicode 3.0 UTF-8 Oracle database variation, similar to the CESU-8 variant of UTF-8, where supplementary characters are encoded as two 4-byte characters rather than a single 4- or 5-byte character. It is used only on EBCDIC platforms.[1]

Advantages:

  • Only Unicode character set for EBCDIC.
  • Length of SQL CHAR types can be specified in number of characters.
  • Binary order of the SQL CHAR columns is same as binary order of the SQL NCHAR columns if the data consists of same supplementary characters. Consequently, these columns sort the same for identical strings.[1]

Disadvantages:

  • Supplementary characters occupy six bytes instead of four bytes only. Consequently, supplementary characters need to be converted.
  • UTFE is not a Unicode standard encoding. Clients requiring UTF-8 encoding must convert data on retrieval and storage.[1]

See also

References

  1. ^ a b c Baird, Cathy; Chiba, Dan; Chu, Winson; Fan, Jessica; Ho, Claire; Law, Simon; Lee, Geoff; Linsley, Peter; Matsuda, Keni; Oscroft, Tamzin; Takeda, Shige; Tanaka, Linus; Tozawa, Makoto; Trute, Barry; Tsujimoto, Mayumi; Wu, Ying; Yau, Michael; Yu, Tim; Wang, Chao; Wong, Simon; Zhang, Weiran; Zheng, Lei; Zhu, Yan; Moore, Valarie (2002) [1996]. "Appendix A: Locale Data". Oracle9i Database Globalization Support Guide (PDF) (Release 2 (9.2) ed.). Oracle Corporation. Oracle A96529-01. Archived (PDF) from the original on 2017-02-14. Retrieved 2017-02-14.

External links