From Wikipedia, the free encyclopedia
Jump to: navigation, search
Former featured article ASCII is a former featured article. Please see the links under Article milestones below for its original nomination page (for older articles, check the nomination archive) and why it was removed.
Article milestones
Date Process Result
January 19, 2004 Refreshing brilliant prose Kept
December 30, 2005 Featured article review Kept
May 10, 2008 Featured article review Demoted
Current status: Former featured article
          This article is of interest to the following WikiProjects:
WikiProject Computing (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
WikiProject Writing systems (Rated C-class, Mid-importance)
WikiProject icon This article falls within the scope of WikiProject Writing systems, a WikiProject interested in improving the encyclopaedic coverage and content of articles relating to writing systems on Wikipedia. If you would like to help out, you are welcome to drop by the project page and/or leave a query at the project’s talk page.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
Wikipedia Version 1.0 Editorial Team / v0.5
WikiProject icon This article has been reviewed by the Version 1.0 Editorial Team.
Taskforce icon
This article has been selected for Version 0.5 and subsequent release versions of Wikipedia.
C-Class article C  This article has been rated as C-Class on the quality scale.
WikiProject Typography (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.

Character set vs. Character encoding?[edit]

There is a definite difference between (finite) character sets and character encoding.

There is a sentence that currently reads:

Although ISO-8859-1 (Latin 1), its variant Windows-1252 (often mislabeled as ISO-8859-1), and the original 7-bit ASCII were the most common character encodings until the late 2000s, nowadays UTF-8 is becoming more common

But UTF-8 is not a single character set, but a Transformation Format that can represent both UCS-2 and UCS-4 sets. However, throughout the article, the ASCII character set is referred to as a character encoding, which doesn't quite sound right to me.

Should there be a consensus on the terminology to use? It may help with a lot of the confusion that goes with the subject.

ratarsed (talk) 09:46, 6 November 2012 (UTC)

As of 6.2/6.3, the Unicode standard says, in Chapter 1, "The Unicode Standard contains 1,114,112 code points, most of which are available for encoding of characters.", so there's only one character set - the first 65,536 constitute the Basic Multilingual Plane, but that's just a subset of Unicode.
For encodings, it says "Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form, UTF-8, has been designed for ease of use with existing ASCII-based systems." shortly before that in Chapter 1.
In Chapter 2, it says, in 2.4 "Code Points and Characters":
On a computer, abstract characters are encoded internally as numbers. To create a complete character encoding, it is necessary to define the list of all characters to be encoded and to establish systematic rules for how the numbers represent the characters.
The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.
In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF16, comprising 1,114,112 code points available for assigning the repertoire of abstract characters.
and, in 2.5 "Encoding Forms":
Computers handle numbers not simply as abstract mathematical objects, but as combinations of fixed-size units like bytes and 32-bit words. A character encoding model must take this fact into account when determining how to associate numbers with the characters.
Actual implementations in computer systems represent integers in specific code units of particular size—usually 8-bit (= byte), 16-bit, or 32-bit. In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16- bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. The “UTF” is a carryover from earlier terminology meaning Unicode (or UCS) Transformation Format. Each of these three encoding forms is an equally legitimate mechanism for representing Unicode characters; each has advantages in different environments.
All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; they are thus fully interoperable for implementations that may choose different encoding forms for various reasons. Each of the three Unicode encoding forms can be efficiently transformed into either of the other two without any loss of data.
It also says:
The Unicode Consortium fully endorses the use of any of the three Unicode encoding forms as a conformant way of implementing the Unicode Standard. It is important not to fall into the trap of trying to distinguish “UTF-8 versus Unicode,” for example. UTF-8, UTF-16, and UTF-32 are all equally valid and conformant ways of implementing the encoded characters of the Unicode Standard.
My personal inclination is to refer to "UTF-8-encoded Unicode", e.g. a file can be "ASCII text", meaning that it is a sequence of ASCII code points (so if it's a sequence of octets, none of those octets have the 8th bit set), or it could be "ISO 8859-1 text", meaning that it's a sequence of ISO 8859-1 code points (so the octets with the 8th bit not set are ASCII characters), or it could be "UTF-8-encoded Unicode text", or it could be "UTF-16-encoded text" (which means it's either big-endian or little-endian, indicated either by a byte-order mark or some out-of-band indication such as "the person who sent it to me told me it's big-endian" or "this is Windows, so it's little-endian"), or it could be "UTF-32-encoded text".
A separate characteristic of those files is the subset of the character set they contain; an "ISO 8859-1 text" file could contain the ASCII subset (so if it's a stream of octets, no octet in the file has the 8th bit set), in which case it's also an "ASCII text" file. A Unicode file, regardless of encoding, could contain the Basic Multilingual Plane subset. Guy Harris (talk) 18:45, 22 April 2014 (UTC)
(Note, of course, that "UTF-8-encoded Unicode" is redundant; if it's encoding something other than Unicode, it's not UTF-8. The redundant phrase, however, may serve to remind people that "UTF-8" encodes the entirety of Unicode, so, unless somebody explicitly says "but this text must use only characters in the Basic Multilingual Plane" or "but this text must use only characters in ISO 8859-1" or even "but this text must use only ASCII characters", code processing that text must be prepared to see arbitrary Unicode characters. It may also serve to remind people that "is this UTF-8 or is this Unicode?" is an ill-formed question, perhaps based on the long-out-of-date assumption that UTC-2 is Unicode and that unless each character is represented by two octets it's "not really Unicode".) Guy Harris (talk) 19:38, 22 April 2014 (UTC)

Delete vs Backspace[edit]

As video terminals began to replace printing ones, the value of the "rubout" character was lost. DEC systems, for example, interpreted "Delete" to mean "remove the character before the cursor" and this interpretation also became common in Unix systems.

That would be because the DEC terminals where connected to Unix systems which were configured to understand the terminals. Some Unix systems used the 7f (DEL) character as the "interrupt" key, while others used the 03 (ETX) character for that purpose.

Most other systems used "Backspace" for that meaning and used "Delete" to mean "remove the character at the cursor". That latter interpretation is the most common now.

This seems to conflate the DEL character with the Delete key, which when used to mean "delete character at (under or forward of) the cursor" typically sends a sequence like 1b-5b-33-7e (ESC [ 3 ~), rather than the single character 7f (DEL).

Yes there are local applications such as Unix Xterm, or remote connectors such as PuTTY, SSH, & Telnet, but their choice between DEL & BS depends on the target service and/or local preference settings, so they don't sway the argument of which is "most common". Furthermore, some target services follow the EMACS tradition and use character 04 (EOT) for "delete character under cursor".

But most applications now do not receive a stream of characters at all; rather they receive events from the local windowing system (either directly, or from the browser within which they run).

Other changes have also occurred: the Return key has been renamed Enter or just ↲ on most keyboards, and it is treated as End-Of-Record or Next-Line rather than as a return on the same line.

Martin Kealey (talk) 01:56, 10 July 2014 (UTC)

More accurately:
The DEC terminals - and, before DEC made terminals, the non-DEC terminals such as the Teletype Model 33 - were connected to various DEC operating systems, which interpreted DEL as "delete previous character", and that eventually got adopted by UNIX systems as well. Originally, UNIX systems imitated Multics systems, and used # for "delete previous character" and @ for "delete previous line; the Multics systems at least had the excuse that they had to support non-ASCII terminals with wired-in local echo, such as the IBM 2741 and IBM 1050, so they couldn't do DEC-style tricks when echoing DEL; UNIX didn't have that problem, but they went with it anyway, and used DEL as the interrupt character.
The BSD folk decided that was bogus and implemented a more DEC-style tty interface, with the erase and kill characters echoing DEC-style (print the deleted characters between slashes/backslashes on printing terminals, erase them with backspace and space on display terminals), and with DEC-style choices of DEL for erase, ^U for line kill, and ^C for interrupt; that ended up becoming the most common tty interface on UN*Xes as well. The characters were settable on UN*X, so sometimes BS rather than DEL was used.
So it was DEC's operating systems, not UNIX, that gave us "DEL as what you type to delete the previous character".
And the notion that you type Return at the end of the line, even if it sends CR, is also a DECism; no DEC OS I remember required that you type both CR and LF at the end of a line. At least with the older OSes, the CR would be echoed as CR LF, and would show up as CR LF as input. (RSX-11 and VMS were a bit weird here, in that they treated FORTRAN line format as the proper text format; I think that typing CR ended the line, but echoed only as CR, and the next line output to the terminal would begin with LF and end with CR as sent to the terminal, because it would normally have SP as the initial FORTRAN control character. But I digress....)
UNIX followed in the Multics "LF by itself, with no CR, at the end of a line" tradition; typing CR would end the line, cause CR LF to be echoed, and cause just an LF to appear in the input stream. Guy Harris (talk) 02:30, 10 July 2014 (UTC)
I've made some edits to clarify that this is a software interpretation of input characters, and to get rid of the video terminal stuff entirely, as well as to ask for citations about the BS-vs-DEL claims. Guy Harris (talk) 03:10, 10 July 2014 (UTC)

Second representation of the printable character list[edit]

I've added a previously removed second representation of the ASCII characters that supports easy copy-pasting. I wasn't aware that it already has been on the page. My edit got reverted. However, I think it should be still there.

In Wikipedia, there are lots of examples where information is displayed multiple times, even when we don't count efforts to help disabled people, like "spoken wikipedia" or descriptions below images: take AES as an example. The text perfectly describes the steps but the images display a second representation of those steps (and a third is in the image description, but as previously described we don't count it).

We can see the redundancy also on the principle of the lead paragraph: Except of the X in "is a X", most information gets repeated in the article below. It gives the reader a concise definition of the topic, that can be retrieved without having to read the whole article. The second representation of the ascii list fulfills this second purpose: the reader doesn't have to read every single character to get a full list of ASCII characters.

And, redundancy is still present in the list itself:

Binary Oct Dec Hex Glyph
010 0000 040 32 20 (space)

The list gives us three representations of the the character's number.

I think there are different use cases linked to both representations: first (currently only) representation helps readers with various conversions between the character and its ASCII address. Therefore the multiple number formats. The second (disputed) representation helps the reader in the case they want to act on the whole set of characters: I've used it for password generation, and others might want to use it in a program the case they write in a language that doesn't have such an easy linking between numbers and characters like C.

What do you think? — Preceding unsigned comment added by ‎Muelleum (talkcontribs) 21:04, 12 August 2014 (UTC)

Hello there! Hm, so the main purpose would be to make the whole ASCII set of characters easily available for copying and pasting? Maybe some kind of a compromise could be to provide it in form of a note after "There are 95 printable characters in total", using the {{Efn}} template? — Dsimic (talk | contribs) 21:54, 12 August 2014 (UTC)
I'm OK with that. Muelleum (talk) 23:24, 12 August 2014 (UTC)
Looking good, having a scrollable box was the only solution for long lines in reference tooltips. — Dsimic (talk | contribs) 23:32, 12 August 2014 (UTC)

A three-part article on ASCII[edit]

Sometime in the 1980's, I read an article in three successive issues of a personal (micro) computer magazine by the "inventor of ASCII", whoever that was, that went over all the non-alphabetic codes and was very enlightening. I've never been able to find it again. If anybody knows, please send me a message. Thanks.deisenbe (talk) 16:08, 14 August 2014 (UTC)