Code page

From Wikipedia, the free encyclopedia
  (Redirected from Code pages)
Jump to: navigation, search

In computing, a code page is a table of values that describes the character set used for encoding a particular set of glyphs, usually combined with a number of control characters. The term "code page" originated from IBM's EBCDIC-based mainframe systems,[1] but many vendors use this term, including Microsoft, SAP,[2] and Oracle Corporation.[3] Vendors often allocate their own code page number to a character encoding, even if it is better known by another name (for example UTF-8 character encoding has code page numbers 1208 at IBM, 65001 at Microsoft, 4110 at SAP). The multitude of code page assignments leads many vendors to recommend Unicode.

The code page numbering system[edit]

IBM introduced the concept of systematically assigning a small, but globally unique, 16 bit number to each character encoding that a computer system or collection of computer systems might encounter. The IBM origin of the numbering scheme is reflected in the fact that the smallest (first) numbers are assigned to variations of IBM's EBCDIC encoding and slightly larger numbers refer to variations of IBM's extended ASCII encoding as used in its PC hardware.

With the release of PC DOS version 3.3 (and the near identical MS-DOS 3.3) IBM introduced the code page numbering system to regular PC users, as the code page numbers (and the phrase "code page") were used in new commands to allow the character encoding used by all parts of the OS to be set in a systematic way.[4]

After IBM and Microsoft ceased to cooperate in the 1990s, the two companies have maintained the list of assigned code page numbers independently from each other, resulting in some conflicting assignments. At least one third-party vendor (Oracle) also has its own different list of numeric assignments.[3] IBM's current assignments are listed in their CCSID repository, while Microsoft's assignments are documented within the MSDN.[5] Additionally, a list of the names and approximate IANA (Internet Assigned Numbers Authority) abbreviations for the installed code pages on any given Windows machine can be found in the Registry on that machine (this information is used by Microsoft programs such as Internet Explorer).

Most well-known code pages, excluding those for the CJK languages and Vietnamese, fit all their code-points into eight bits and do not involve anything more than mapping each code-point to a single character; furthermore, techniques such as combining characters, complex scripts, etc., are not involved.

The text mode of standard (VGA-compatible) PC graphics hardware is built around using an 8-bit code page, though it is possible to use two at once with some color depth sacrifice, and up to eight may be stored in the display adaptor for easy switching.[6] There was a selection of third-party code page fonts that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this hardware limitation entirely. However the system of referring to character encodings by a code page number remains applicable, as an efficient alternative to string identifiers such as those specified by the IETF and IANA for use in various protocols such as e-mail and web pages.

Relationship to ASCII[edit]

The vast majority of code pages in current use are supersets of ASCII, a 7-bit code representing 128 control codes and printable characters. In the distant past, 8-bit implementations of the ASCII code set the top bit to zero or used it as a parity bit in network data transmissions. When the top bit was made available for representing character data, a total of 256 characters and control codes could be represented. Most vendors (including IBM) used this extended range to encode characters used by various languages and graphical elements that allowed the imitation of primitive graphics on text-only output devices. No formal standard existed for these ‘extended character sets’ and vendors referred to the variants as code pages, as IBM had always done for variants of EBCDIC encodings.

Relationship to Unicode[edit]

Unicode is an effort to include all characters from previous code pages into a single character enumeration that can be used with a number of encoding schemes. In the process, duplicate characters are eliminated and new variants are introduced, like fullwidth ASCII. While consistent use of any single Unicode encoding would theoretically eliminate the need to keep track of different code pages or character encodings, the existence of multiple encodings of Unicode as well as the need to remain compatible with existing documents and systems that use the older encodings remains. In practice the various Unicode character set encodings have simply been assigned their own code page numbers, and all the other code pages have been technically redefined as encodings for various subsets of Unicode.

Code pages[edit]

EBCDIC-based code pages[edit]

  • 37/1140 - USA/Canada - CECP
  • 256 - International #1
  • 259 - Symbols, Set 7
  • 273/1141 - Germany F.R./Austria - CECP
  • 274 - Old Belgium Code Page
  • 275 - Brazil - CECP
  • 276 - Canada (French) - 94
  • 277 - Denmark, Norway - CECP
  • 278 - Finland, Sweden - CECP
  • 280 - Italy - CECP
  • 281 - Japan (Latin) - CECP
  • 282 - Portugal - CECP
  • 284 - Spain/Latin America - CECP
  • 285/1146 - United Kingdom - CECP
  • 500/1148 - International ECECP
  • 892 - EBCDIC, OCR A
  • 893 - EBCDIC, OCR B
  • 1047 - Latin 1/Open System
  • 1137 - Devanagari EBCDIC
  • 1140 - USA, Canada, etc. ECECP
  • 1141 - Austria, Germany ECECP
  • 1142 - Denmark, Norway ECECP
  • 1143 - Finland, Sweden ECECP
  • 1144 - Italy ECECP
  • 1145 - Spain, Latin America (Spanish) ECECP
  • 1146 - UK ECECP
  • 1147 - France ECECP with euro
  • 1149 - Icelandic ECECP with euro
  • 1153 - EBCDIC Latin 2 Multilingual with euro
  • 1154 - EBCDIC Cyrillic, Multilingual with euro
  • 1155 - EBCDIC Turkey with euro
  • 1156 - EBCDIC Baltic Multi with euro
  • 1157 - EBCDIC Estonia with euro
  • 1158 - EBCDIC Cyrillic, Ukraine with euro
  • 1159 - T-Chinese EBCDIC

ISO/IEC 646-related code pages[edit]

Other 7-bit code pages:

  • 895 – 7-bit Japan Latin
  • 896 – 7-bit Japan Katakana Extended
  • 1101 – 7-bit British NRC Set
  • 1102 – 7-bit Dutch NRC Set
  • 1103 – 7-bit Finnish NRC Set
  • 1104 – 7-bit French NRC Set
  • 1105 – 7-bit Norwegian/Danish NRC Set
  • 1106 – 7-bit Swedish NRC Set
  • 1107 – 7-bit Norwegian/Danish NRC Alternate

ISO/IEC 8859-related code pages[edit]

ISO/IEC 8859 related 8-bit code pages:

IBM PC / DOS (OEM) code pages[edit]

These code pages were originally embedded directly in the text mode hardware of the graphic adapters used with the IBM PC and its clones, including the original MDA and CGA adapters whose character sets could only be changed by physically replacing a ROM chip that contained the font. The interface of those adapters (emulated by all later adapters such as VGA) was typically limited to single byte character sets with only 256 characters in each font/encoding (although VGA added partial support for slightly larger character sets). Since the original IBM PC code page (number 437) was not really designed for international use, several partially compatible country or region specific variants emerged. Microsoft refers to these as the OEM code pages because they were defined by the OEMs who licensed MS-DOS for distribution with their hardware, not by Microsoft or a standards organization. Examples include:

When dealing with older hardware, protocols and file formats, it is often necessary to support these code pages, but newer encoding systems, in particular Unicode, are encouraged for new designs.

Code page 819 is identical to Latin-1, ISO/IEC 8859-1, and with slightly-modified commands, permits MS-DOS machines to use that encoding. It was used with IBM AS/400 minicomputers.

Windows (ANSI) code pages[edit]

Microsoft defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an apocryphal ANSI draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes used in ISO-8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.

Microsoft recommends new applications use UTF-8 or UCS-2/UTF-16 instead of these code pages.[18]

DBCS code pages[edit]

These code pages represent DBCS character encodings for various CJK languages. In Microsoft operating systems, these are used as both the "OEM" and "ANSI" code page for the applicable locale.

Apple related code pages[edit]

Various other Microsoft code pages[edit]

The following code page numbers are specific to Microsoft Windows. IBM may use different numbers for these code pages.

ISO/IEC 10646 / Unicode code pages[edit]

Unicode is the most recommended encoding for modern applications.

  • 1200UTF-16LE Unicode (little-endian)
  • 1201UTF-16BE Unicode (big-endian)
  • 1400 – ISO 10646 UCS-BMP (Based on Unicode 6.0)
  • 1401 – ISO 10646 UCS-SMP (Based on Unicode 6.0)
  • 1402 – ISO 10646 UCS-SIP (Based on Unicode 6.0)
  • 1414 – ISO 10646 UCS-SSP (Based on Unicode 4.0)
  • 1445 – IBM AFP PUA No. 1
  • 1446 – ISO 10646 UCS-PUP15 (Based on Unicode 4.0)
  • 1447 – ISO 10646 UCS-PUP16 (Based on Unicode 4.0)
  • 1448 – UCS-BMP (Generic UDC)
  • 1449 – IBM default PUA
  • 65000UTF-7 Unicode
  • 65001UTF-8 Unicode
  • 65520 – Empty Unicode Plane

Miscellaneous[edit]

List of code page assignments[edit]

List of known code page assignments (incomplete):

ID Names Description Origin Platform DOS OS/2 Windows Mac Else Encoding Comment
0 N/A Reserved IBM, Microsoft N/A 3.3+ 1.0+  ?  ?  ? Internal OS use[14]
437 CP437, IBM437 PC US IBM[19] IBM PC 3.3+ 1.0+ Yes  ? Yes 8-bit SBCS
57344 - 61439 N/A Private use derivations IBM N/A N/A N/A N/A N/A N/A various Private use code page derivations (E000h-EFFFh)
65280 - 65533 N/A Private use definitions IBM N/A N/A N/A N/A N/A N/A various Private use code page definitions (FF00h-FFFDh)
65534 N/A Reserved IBM, Microsoft N/A  ?  ?  ?  ?  ? various Internal OS use (FFFEh)
65535 N/A Reserved IBM, Microsoft N/A 3.3+ 1.0+  ?  ?  ? various Internal OS use (FFFFh)[14]

Criticism[edit]

Many older character encodings (unlike Unicode) suffer from several problems. Some code page vendors insufficiently document the meaning of all code point values, which decreases the reliability of handling textual data through various computer systems consistently. Some vendors add proprietary extensions to some code pages to add or change certain code point values; for example, byte 0x5C in Shift JIS can represent either a back slash or a yen currency symbol depending on the platform. Finally, in order to support several languages in a program that does not use Unicode, the code page used for each string/document needs to be stored.

Due to Unicode's extensive documentation, vast repertoire of characters and stability policy of characters, the problems listed above are rarely a concern for Unicode. Applications may also mislabel text in Windows-1252 as ISO-8859-1. Fortunately, the only difference between these code pages is that the code point values used by ISO-8859-1 for control characters are instead used as additional printable characters in Windows-1252. Since control characters have no function in HTML, web browsers tend to use Windows-1252 rather than ISO-8859-1. In HTML5, treating ISO-8859-1 as Windows-1252 is even codified as standard. Later, UTF-8 has succeeded both encodings in terms of popularity on the Internet.[20][21]

Private code pages[edit]

When, early in the history of personal computers, users didn't find their character encoding requirements met, private or local code pages were created using Terminate and Stay Resident utilities or by re-programming BIOS EPROMs. In some cases, unofficial code page numbers were invented (e.g., CP895).

When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the Kamenický or KEYBCS2 encoding for the Czech and Slovak alphabets. Another character set is Iran System encoding standard that was created by Iran System corporation for Persian language support. This standard was in use in Iran in DOS-based programs and after introduction of Microsoft code page 1256 this standard became obsolete. However some Windows and DOS programs using this encoding are still in use and some Windows fonts with this encoding exist.

In order to overcome such problems, the IBM Character Data Representation Architecture level 2 specifically reserves ranges of code page IDs for user-definable and private-use assignments. Whenever such code page IDs are used, the user must not assume that the same functionality and appearance can be reproduced in another system configuration or on another device or system unless the user takes care of this specifically. The code page range 57344-61439 (E000h-EFFFh) is officially reserved for user-definable code pages (or actually CCSIDs in the context of IBM CDRA), whereas the range 65280-65533 (FF00h-FFFDh) is reserved for any user-definable "private use" assignments. For example, a non-registered custom variant of code page 437 (1B5h) or 28591 (6FAF) could become 57781 (E1B5h) or 61359 (EFAFh), respectively, in order to avoid potential conflicts with other assignments and maintain the sometimes existing internal numerical logic in the assignments of the original code pages. An unregistered private code page not based on an existing code page, a device specific code page like a printer font, which just needs a logical handle to become addressable for the system, a frequently changing download font, or a code page number with a symbolic meaning in the local environment could have an assignment in the private range like 65280 (FF00h).

The code page IDs 0, 65534 (FFFEh) and 65535 (FFFFh) are reserved for internal use by operating systems such as DOS and must not be assigned to any specific code pages.

See also[edit]

References[edit]

  1. ^ IBM i Globalization - EBCDIC Code Pages
  2. ^ "Code Page". sap.com. 
  3. ^ a b "Glossary". oracle.com. 
  4. ^ The MS-DOS Encyclopaedia, Microsoft press (1988, ISBN 1-55615-049-0, ISBN 978-1-55615-049-4)
  5. ^ "Code Page Identifiers". microsoft.com. Microsoft. 
  6. ^ "VGA/SVGA Video Programming--VGA Text Mode Operation". osdever.net. 
  7. ^ a b c d e "Code Page Identifiers". Microsoft Developer Network. Microsoft. 2014. Archived from the original on 2016-06-19. Retrieved 2016-06-19. 
  8. ^ a b c d e "Web Encodings - Internet Explorer - Encodings". WHATWG Wiki. 2012-10-23. Archived from the original on 2016-06-20. Retrieved 2016-06-20. 
  9. ^ Foller, Antonin (2014) [2011]. "Western European (IA5) encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20. 
  10. ^ Foller, Antonin (2014) [2011]. "German (IA5) encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20. 
  11. ^ Foller, Antonin (2014) [2011]. "Swedish (IA5) encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20. 
  12. ^ Foller, Antonin (2014) [2011]. "Norwegian (IA5) encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20. 
  13. ^ Foller, Antonin (2014) [2011]. "US-ASCII encoding - Windows charsets". WUtils.com - Online web utility and help. Motobit Software. Archived from the original on 2016-06-20. Retrieved 2016-06-20. 
  14. ^ a b c d e f Paul, Matthias (2002-09-05), Technical info on undocumented DOS country info for LCASE, ARAMODE and CCTORC records, FreeDOS development list fd-dev at Topica, archived from the original on 2016-05-26, retrieved 2016-05-26 
  15. ^ a b c d e f g h Brown, Ralf (2002-12-29). "The x86 Interrupt List". Retrieved 2011-10-14. 
  16. ^ a b c d e f g h Paul, Matthias (1997-07-30). NWDOS-TIPs — Tips & Tricks rund um Novell DOS 7, mit Blick auf undokumentierte Details, Bugs und Workarounds. MPDOSTIP (e-book) (in German) (edition 3, release 157 ed.). Archived from the original on 2016-05-22. Retrieved 2012-01-11.  NWDOSTIP.TXT is a comprehensive work on Novell DOS 7 and OpenDOS 7.01, including the description of many undocumented features and internals. It is part of the author's yet larger MPDOSTIP.ZIP collection maintained up to 2001 and distributed on many sites at the time. The provided link points to a HTML-converted older version of the NWDOSTIP.TXT file.
  17. ^ a b c d e f g h Paul, Matthias (2001-04-09). NWDOS-TIPs — Tips & Tricks rund um Novell DOS 7, mit Blick auf undokumentierte Details, Bugs und Workarounds. MPDOSTIP (e-book) (in German) (edition 3, release 183 ed.). 
  18. ^ "Code Pages". microsoft.com. Microsoft. 
  19. ^ IBM. "SBCS code page information document - CPGID 00437". Retrieved 2014-07-04. 
  20. ^ "Usage Statistics of Character Encodings for Websites, (updated daily)". w3techs.com. Retrieved 6 August 2015. 
  21. ^ "UTF-8 Usage Statistics". trends.builtwith.com. Retrieved 28 March 2011. 

External links[edit]