Template talk:Character encodings
|WikiProject Computing||(Rated Template-class)|
Encoding vs. TES
HZ is a TES, Transfer Encoding Syntax, see UTR17, of GB2312, not a character encoding proper. Nor is it a national standard. If at all kept in this template it should be in the misc section.
Similarly, UTF-7 is also a TES, not a UTF (despite the name). So I was thinking of removing UTF-7 from this template. It's included in the "Table Unicode" template, and I think that is enough.
I've tried to group certain encodings in a "logical" way. For instance, even if the GOST standard is/was a national standard, it's for 4, 5, and 6-bit character encodings. Not something used in modern computers. So it's amongst "misc" items. Likewise, HKSCS is near Big5 and CP950 since they are so closely related. Etc.
The Big5-HKSCS encoding is not really supported by Windows. Windows 950 should not be considered HKSCS compatible by default. Windows Vista only supports the Unicode characters of Big5-HKSCS. Microsoft HKSCS —Preceding unsigned comment added by 22.214.171.124 (talk) 04:57, 26 July 2009 (UTC)
UTF-8, read that article please. It is not a "single character" (like horizontal tabulation, backspace etc.), it is a piece of encoding troubles related to line separation. Incnis Mrsi (talk) 09:06, 15 March 2010 (UTC)
I notice, that there are a few code pages messing, namely the following
- Code page 708 (Arabic ASMO);
- Code page 851 (Greek III);
- Code page 853 (Latin III);
- Code page 868 (IBM Persian);
- Code page 934 (MS-DOS Korean);
- Code page 938 (MS-DOS Taiwanese);
- Code page 999 (Yugoslavian ASCII-7).
I have the Korean edition of MS-DOS 6.2, which uses code page 934. It, and code page 938, are also referenced in MS-DOS 6.22 COUNTRY.TXT file.
MS-DOS code page 999 seems to be the code page version of the Yugoslavian ASCII-7 codepage, commonly used especially in Croatia and Slovenia before the advent of code page 852. One notable user of it is the Slovenia SAOP programming corporation's software.
1259, 1260, 1262-1269
What are these Windows Codepages? What is CP0028?
The design of this template is getting more and more complete but some few things could be done to get it clearer. Here are some suggestions:
- Make a clear distinction between what are “Character encoding methods”, “Character sets” and “Code pages”.
- The terminology “Code page” is used mainly by IBM and Microsoft, very few other manufacturers / organizations use it. The so called “Miscellaneous code pages” are not code pages. Perhaps, a better name would be “Miscellaneous character sets”.
- EUC, ISO/IEC 2022 and HZ are not character sets. They are encoding methods (schemes) which are used to encode character sets, namely JIS, KSX, GB and CNS character sets.
- The same goes for all UTF, which are encoding schemes to encode the ISO 10646 character set.
- The left column is already arranged accordingly to several platforms. That could be expanded and some character sets included in the “Platform specific” section could be moved to the “right” place:
- Adobe: Adobe Standard, Adobe Latin 1, Adobe Symbols, etc.
- DEC: DEC Multinational, DEC Turkish, DEC Greek, DEC Cyrillic, DEC Hebrew, DEC/8/ASMO, DEC Technical, DEC Kanji, DEC Korean, DEC Hanzi, DEC Hanyu, etc.
- Data General: Data General International, Data General Turkish, Data General Arabic, Data General Kana, Data General Symbols, etc.
- Hewlett-Packard: HP Roman-8, HP Turkish-8, HP East-8, HP Greek-8, HP Cyrillic-8, HP Hebrew-8, HP Arabic-8, HP Thai-8, HP Japan-15, HP Korea-15, HP PRC-15, HP ROC-15, HP Math-8, etc.
- Latex: T1 (Cork Encoding), T2A, T2B, T2C, T3, T4, T5, etc.
- ISO: ISO is not a platform in itself, but some platforms (for instance, UNIX) are designed to work following the ISO standards. Also, many character sets, non specific to any platform, are designed following the ISO standards. For the sake of convenience, perhaps we could consider ISO as a “platform”.
- “Acorn” is not a character set but rather a manufacturer (as are IBM or Apple). Perhaps, a better name would be “RISC OS character set”.
- Is it worthwhile to have an entry called “National standards”? Of course, some Governments or some Official National Bodies have defined their national standards. But, after that, the manufacturers or organizations have implemented them or some variations of them. And in some cases it was the opposite, some Governments or some Official National Bodies have adopted existing standards as their national standard. But that list, as it is, is a mixed bag and rather incomplete. Here is what I have found out so far:
Country 7-bit standard 8-bit standard Multibyte standard 16-bit standard Notes Arab countries ASMO 449 ASMO 708 Armenia AST 34.005:1997 AST 34.002:1997 Commonly called ArmSCII
AST 34.002:1997 defines two variants: ArmSCII-8 for ISO environment; ArmSCII-8a for DOS and Macintosh environment.
Bangladesh BSD 1520:1995
BSD 1520:1995 was not approved;
BSD 1520:2011 is the same as the Bengali (Unicode block) but assigned to the upper part of an 8-bit character set;
commonly called BSCII.
Commonly called BraSCII. Canada CSA Z243.4 1985 alt.11
CSA Z243.4 1985 alt.12
ISO 646-CA. China GB 1988 - 1980 GB 2312-80
GB 1988 - 1980 = ISO 646-CN. Croatia HRN I.B1.013:1988 Cuba NC 99-10 - 1981 ISO 646-CU. Czechoslovakia ČSN 36 91 03 Nearly identical to ISO Latin-2. Denmark DS 2089-1974 Not an official part of ISO 646 series. Estonia EVS 8:1993 EVS 8:1993 has defined 3 “tables”:
table 3.1 for ISO environment;
table 3.2 for EBCDIC;
table 3.3 for DOS.
Finland SFS 4017 ISO 646-FI;
identical to Swedish Standard SEN 850200 b.
France NF Z 62-010 - 1973
NF Z 62-010 - 1982
ISO 646-FR. Georgia SSP 18.1:1998 Commonly known as Geostd8;
the more popular GeoSCII is not the national standard.
Federal Republic of Germany DIN 66003 ISO 646-DE. Greece ELOT 927 ELOT 928 Hungary MSZ 77953 ISO 646-HU. India IS 13194:1991 IS 13194:1991 IS 13194:1991 defines several character sets:
EA-ISCII for 7-bit environment
ISCII for ISO environment
PC-ISCII for DOS
International ISO 646-1973 IRV ISO 10646 Iran ISIRI 2900 ISIRI 3342 ISIRI 2900 is glyph-based;
ISIRI 3342 is character-based.
Ireland IS 433 - 1996 Not an official part of ISO 646 series. Israel SI 960 SI 1311:1988
The International Register number went on changing (IR 138 >> IR 198 >> IR 234) as the Standards Institute of Israel went on updating the character set, but ISO kept the name as ISO 8859-8. Italy UNI 0204 - 1970 ISO 646-IT. Japan JIS C 6220-1969
JIS C 6220-1976
JIS C 6226-1978
JIS C 6226-1983
JIS X 0208:1990
JIS X 0212:1990
JIS X 0213:2000
JIS X 0213:2004
JIS C 6220 (Roman version, not Katakana version) = ISO 646-JP. Kazakhstan ST RK 920:91
ST RK 1048:2002
ST RK 920:91 is for DOS;
ST RK 1048:2002 is for Windows.
North Korea KPS 9566-97 South Korea KS C 5636
KS X 1003 - 1989
KS C 5601-1992
KS C 5636 is not an official part of ISO 646 series. Latvia RST 1040-90
RST 1040-90 is commonly known as Code Page 866-Latvian. Lithuania RST 1093-89
Malta ?1 MSA ISO 8859-3?2 1 There is a character set commonly referred as ISO 646-MT (not an official part of the ISO 646 series), but I don’t know if it has been defined as a Maltese official standard;
2 The MSA has included all the ISO 8859 series among their standards; however, I haven’t seen any document saying specifically that MSA ISO 8859-3 is the national standard.
Norway NS 4551-1
ISO 646-NO. Poland BN-74/3101-01 PN-T-42118:1993 BN-74/3101-01 is not an official part of ISO 646 series. Romania SR 14111:1998 Soviet Union GOST 13052-74 GOST 19768-74
GOST 13052-74 is commonly known as KOI-7;
GOST 19768-74 is commonly known as KOI-8;
check if they superseded as Russian standards
Sri Lanka SLS 1134:1990
SLS 1134:1990 was not approved;
SLS 1134:2004 is the same as the Sinhala (Unicode block) but assigned to the upper part of an 8-bit character set;
commonly called SlaSCII.
Sweden SEN 850200 b
SEN 850200 c
SEN 850200 b is identical to Finnish Standard SFS 4017.
Taiwan CNS 5205-1996 CNS 11643-1992 CNS 5205-1996 is not an official part of the ISO 646 series;
the more popular Big5 is not the national standard.
Thailand TIS 620-2529
Turkey TS-5881:1988 United States ANSI X3.4 - 1968 Commonly called ASCII;
United Kingdom BSI 4730 ISO 646-GB. Vietnam TCVN 5712-1:1993
TCVN 6056:1995 TCVN 6909:2001 TCVN 5712 is also referred as VSCII;
the more popular VISCII is not the national standard
TCVN 6056 is for the Chữ Nôm script.
JUS I.B1.013 In Croatia, JUS I.B1.013 was superseded as the HRN I.B1.013:1988 standard;
check if these standards were not followed in the other countries of former Yugoslavia;
JUSI.B1.002 = ISO 646-YU.
- As it can be seen, putting all the national standards in the template can be cumbersome. Perhaps, it would be better if, in each article about a character set, we put the clear statement “It is the national standard of (country), called (name or code).”.
I would like to hear some feedback before making some changes.
Please update the Apple 1 link to point to Apple_I#External_links.
There's a draft of the Apple III character set at Draft:Apple III character set but it will never survive by itself. Consider merging all old Apple sets into one article, source it well, and write up some of the history about them, otherwise it will all just get deleted and you might as well remove them from the infobox now.
The Amstrad link should probably point to Amstrad CP/M Plus character set.
The Apple Sabine link should be removed and that article should be deleted.
The only reference to Elwro Junior is here: List of ZX Spectrum clones#Elwro_800_Junior Currently the link points to an article about Polish spelling. I'm actually not sure if the Elwro Junior has its own character set; it may just be the same as the ZX Spectrum's character set.
The Mattel Aquarius character set article will not survive on its own; I recommend merging it into the Aquarius article.
The Minitel character set article has been deleted. Either remove it from the infobox, or put the character set in the Minitel article.
The Sega SC-3000 character set article should probably be deleted. Games at the time tended to use sprites and tiles and the meaning / appearance of a given code would be determined by whatever was in sprite ROM.
Semi-protected edit request on 9 July 2020
|This edit request has been answered. Set the |
The leading word "IBM" and the trailing word "emulations" should not be in this list. These terms don't make any sense next to the works Apple, Adobe, etc. Following are the lines to change - just removed IBM and emulations from each:
- Not done: please provide reliable sources that support the change you want to be made. Eggishorn (talk) (contrib) 17:00, 9 July 2020 (UTC)
I don't know of sources, I'm sorry, for the things to be changed are plain: the term "IBM" doesn't precede Apple - why would it. The term "emulations" doesn't follow Apple, why would it? Are you aware of the character sets used in those machines? They aren't emulations of any IBM anything. The terms are unfortunately free of meaning. I didn't know this would be an unusual request. Sorry to have bothered you. — Preceding unsigned comment added by 126.96.36.199 (talk) 17:05, 9 July 2020 (UTC)
- The phrase "IBM Apple Macintosh emulations" means emulations of Apple Macintosh, as used by IBM; it does not mean emulations of IBM.
- The Apple encodings are listed by their actual names under the MacOS code pages ("scripts") heading already. The IBM Apple Macintosh emulations heading is listing the code page numbers assigned by IBM to the Apple encodings, e.g. Mac OS Roman is numbered 1275 by IBM (see ). These numbers are only used by IBM or by things associated with IBM (e.g. software running under IBM products, or possibly ICU, which started off as an IBM project): for example, Microsoft assigns the same encoding (Mac OS Roman) the completely different code page number 10000 (see ; I'm not entirely sure why these are not also listed).
- -- HarJIT (talk) 17:57, 9 July 2020 (UTC)