Template talk:Character encodings

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
WikiProject Computing (Rated Template-class)
WikiProject iconThis template is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 Template  This template does not require a rating on the project's quality scale.
 

Encoding vs. TES[edit]

HZ is a TES, Transfer Encoding Syntax, see UTR17, of GB2312, not a character encoding proper. Nor is it a national standard. If at all kept in this template it should be in the misc section.

Similarly, UTF-7 is also a TES, not a UTF (despite the name). So I was thinking of removing UTF-7 from this template. It's included in the "Table Unicode" template, and I think that is enough.

/keka (talk) 08:40, 21 July 2009 (UTC)

Grouping[edit]

I've tried to group certain encodings in a "logical" way. For instance, even if the GOST standard is/was a national standard, it's for 4, 5, and 6-bit character encodings. Not something used in modern computers. So it's amongst "misc" items. Likewise, HKSCS is near Big5 and CP950 since they are so closely related. Etc.

keka (talk) 08:59, 25 July 2009 (UTC)

The Big5-HKSCS encoding is not really supported by Windows. Windows 950 should not be considered HKSCS compatible by default. Windows Vista only supports the Unicode characters of Big5-HKSCS. Microsoft HKSCS —Preceding unsigned comment added by 69.110.13.196 (talk) 04:57, 26 July 2009 (UTC)

newline[edit]

UTF-8, read that article please. It is not a "single character" (like horizontal tabulation, backspace etc.), it is a piece of encoding troubles related to line separation. Incnis Mrsi (talk) 09:06, 15 March 2010 (UTC)

Missing codepages[edit]

I notice, that there are a few code pages messing, namely the following

Code page 708 (Arabic ASMO);
Code page 851 (Greek III);
Code page 853 (Latin III);
Code page 868 (IBM Persian);
Code page 934 (MS-DOS Korean);
Code page 938 (MS-DOS Taiwanese);
Code page 999 (Yugoslavian ASCII-7).

I have the Korean edition of MS-DOS 6.2, which uses code page 934. It, and code page 938, are also referenced in MS-DOS 6.22 COUNTRY.TXT file.

MS-DOS code page 999 seems to be the code page version of the Yugoslavian ASCII-7 codepage, commonly used especially in Croatia and Slovenia before the advent of code page 852. One notable user of it is the Slovenia SAOP programming corporation's software.

Code page 708 is referenced in Windows. As for 851, 853, and 868, I've seen specifications of them on Google. - 94.140.73.150 (talk) 16:15, 22 August 2010 (UTC)

1259, 1260, 1262-1269[edit]

What are these Windows Codepages? What is CP0028?

Proposed changes[edit]

The design of this template is getting more and more complete but some few things could be done to get it clearer. Here are some suggestions:

  1. Make a clear distinction between what are “Character encoding methods”, “Character sets” and “Code pages”.
  2. The terminology “Code page” is used mainly by IBM and Microsoft, very few other manufacturers / organizations use it. The so called “Miscellaneous code pages” are not code pages. Perhaps, a better name would be “Miscellaneous character sets”.
  3. EUC, ISO/IEC 2022 and HZ are not character sets. They are encoding methods (schemes) which are used to encode character sets, namely JIS, KSX, GB and CNS character sets.
  4. The same goes for all UTF, which are encoding schemes to encode the ISO 10646 character set.
  5. The left column is already arranged accordingly to several platforms. That could be expanded and some character sets included in the “Platform specific” section could be moved to the “right” place:
    1. Adobe: Adobe Standard, Adobe Latin 1, Adobe Symbols, etc.
    2. DEC: DEC Multinational, DEC Turkish, DEC Greek, DEC Cyrillic, DEC Hebrew, DEC/8/ASMO, DEC Technical, DEC Kanji, DEC Korean, DEC Hanzi, DEC Hanyu, etc.
    3. Data General: Data General International, Data General Turkish, Data General Arabic, Data General Kana, Data General Symbols, etc.
    4. Hewlett-Packard: HP Roman-8, HP Turkish-8, HP East-8, HP Greek-8, HP Cyrillic-8, HP Hebrew-8, HP Arabic-8, HP Thai-8, HP Japan-15, HP Korea-15, HP PRC-15, HP ROC-15, HP Math-8, etc.
    5. Latex: T1 (Cork Encoding), T2A, T2B, T2C, T3, T4, T5, etc.
    6. ISO: ISO is not a platform in itself, but some platforms (for instance, UNIX) are designed to work following the ISO standards. Also, many character sets, non specific to any platform, are designed following the ISO standards. For the sake of convenience, perhaps we could consider ISO as a “platform”.
  6. “Acorn” is not a character set but rather a manufacturer (as are IBM or Apple). Perhaps, a better name would be “RISC OS character set”.
  7. Is it worthwhile to have an entry called “National standards”? Of course, some Governments or some Official National Bodies have defined their national standards. But, after that, the manufacturers or organizations have implemented them or some variations of them. And in some cases it was the opposite, some Governments or some Official National Bodies have adopted existing standards as their national standard. But that list, as it is, is a mixed bag and rather incomplete. Here is what I have found out so far:
Country 7-bit standard 8-bit standard Multibyte standard 16-bit standard Notes
Arab countries ASMO 449 ASMO 708
Armenia AST 34.005:1997 AST 34.002:1997 Commonly called ArmSCII
AST 34.002:1997 defines two variants: ArmSCII-8 for ISO environment; ArmSCII-8a for DOS and Macintosh environment.
Bangladesh BSD 1520:1995
BSD 1520:2000
BSD 1520:2011
BSD 1520:1995 was not approved;
BSD 1520:2011 is the same as the Bengali (Unicode block) but assigned to the upper part of an 8-bit character set;
commonly called BSCII.
Brazil NBR­-9614:1986
NBR-­9614:1991
Commonly called BraSCII.
Canada CSA Z243.4 ­ 1985 alt.1­1
CSA Z243.4 ­ 1985 alt.1­2
ISO 646-CA.
China GB 1988 - 1980 GB 2312-80
GB 18030-2000
GB 18030-2005
GB 1988 - 1980 = ISO 646-CN.
Croatia HRN I.B1.013:1988
Cuba NC 99-10 - 1981 ISO 646-CU.
Czechoslovakia ČSN 36 91 03 Nearly identical to ISO Latin-2.
Denmark DS 2089-1974 Not an official part of ISO 646 series.
Estonia EVS 8:1993 EVS 8:1993 has defined 3 “tables”:
table 3.1 for ISO environment;
table 3.2 for EBCDIC;
table 3.3 for DOS.
Finland SFS 4017 ISO 646-FI;
identical to Swedish Standard SEN 850200 b.
France NF Z 62-010 - 1973
NF Z 62-010 - 1982
ISO 646-FR.
Georgia SSP 18.1:1998 Commonly known as Geostd8;
the more popular GeoSCII is not the national standard.
Federal Republic of Germany DIN 66003 ISO 646-DE.
Greece ELOT 927 ELOT 928
Hungary MSZ 7795­3 ISO 646-HU.
India IS 13194:1991 IS 13194:1991 IS 13194:1991 defines several character sets:
EA-ISCII for 7-bit environment
ISCII for ISO environment
PC-ISCII for DOS
International ISO 646-1973 IRV ISO 10646
Iran ISIRI 2900 ISIRI 3342 ISIRI 2900 is glyph-based;
ISIRI 3342 is character-based.
Ireland IS 433 - 1996 Not an official part of ISO 646 series.
Israel SI 960 SI 1311:1988
SI 1311:1998
SI 1311:2002
The International Register number went on changing (IR 138 >> IR 198 >> IR 234) as the Standards Institute of Israel went on updating the character set, but ISO kept the name as ISO 8859-8.
Italy UNI 0204 - 1970 ISO 646-IT.
Japan JIS C 6220-1969
JIS C 6220-1976
JIS C 6226-1978
JIS C 6226-1983
JIS X 0208:1990
JIS X 0212:1990
JIS X 0213:2000
JIS X 0213:2004
JIS C 6220 (Roman version, not Katakana version) = ISO 646-JP.
Kazakhstan ST RK 920:91
ST RK 1048:2002
ST RK 920:91 is for DOS;
ST RK 1048:2002 is for Windows.
North Korea KPS 9566-97
South Korea KS C 5636
KS X 1003 - 1989
KSC 5601-1987
KS C 5601-1992
KS C 5636 is not an official part of ISO 646 series.
Latvia RST 1040-90
LVS 8-92
RST 1040-90 is commonly known as Code Page 866-Latvian.
Lithuania RST 1093-89
RST 1095-89
LST 1282:1993
LST 1283:1993
LST 1284:1993
LST 1590-1
LST 1590-2
LST 1590-3
Malta ?1 MSA ISO 8859-3?2 1 There is a character set commonly referred as ISO 646-MT (not an official part of the ISO 646 series), but I don’t know if it has been defined as a Maltese official standard;
2 The MSA has included all the ISO 8859 series among their standards; however, I haven’t seen any document saying specifically that MSA ISO 8859-3 is the national standard.
Norway NS 4551-1
NS 4551-2
ISO 646-NO.
Poland BN-74/3101-01 PN-T-42118:1993 BN-74/3101-01 is not an official part of ISO 646 series.
Romania SR 14111:1998
Soviet Union GOST 13052-74 GOST 19768-74
GOST 19768-87
GOST 13052-74 is commonly known as KOI-7;
GOST 19768-74 is commonly known as KOI-8;
check if they superseded as Russian standards
Sri Lanka SLS 1134:1990
SLS 1134:1996
SLS 1134:2004
SLS 1134:1990 was not approved;
SLS 1134:2004 is the same as the Sinhala (Unicode block) but assigned to the upper part of an 8-bit character set;
commonly called SlaSCII.
Sweden SEN 850200 b
SEN 850200 c
ISO 646-SE.
SEN 850200 b is identical to Finnish Standard SFS 4017.
Taiwan CNS 5205-1996 CNS 11643-1992 CNS 5205-1996 is not an official part of the ISO 646 series;
the more popular Big5 is not the national standard.
Thailand TIS 620-2529
TIS 620-2533
Turkey TS-5881:1988
United States ANSI X3.4 - 1968 Commonly called ASCII;
ISO 646-US.
United Kingdom BSI 4730 ISO 646-GB.
Vietnam TCVN 5712-1:1993
TCVN 5712-2:1993
TCVN 5712-3:1993
TCVN 6056:1995 TCVN 6909:2001 TCVN 5712 is also referred as VSCII;
the more popular VISCII is not the national standard
TCVN 6056 is for the Chữ Nôm script.
Yugoslavia JUSI.B1.002
JUSI.B1.003
JUSI.B1.004
JUS I.B1.013 In Croatia, JUS I.B1.013 was superseded as the HRN I.B1.013:1988 standard;
check if these standards were not followed in the other countries of former Yugoslavia;
JUSI.B1.002 = ISO 646-YU.
As it can be seen, putting all the national standards in the template can be cumbersome. Perhaps, it would be better if, in each article about a character set, we put the clear statement “It is the national standard of (country), called (name or code).”.

I would like to hear some feedback before making some changes.

Code Page Guy (talk) 16:39, 4 March 2017 (UTC)