Talk:ASCII

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Former featured article ASCII is a former featured article. Please see the links under Article milestones below for its original nomination page (for older articles, check the nomination archive) and why it was removed.
Article milestones
Date Process Result
January 19, 2004 Refreshing brilliant prose Kept
December 30, 2005 Featured article review Kept
May 10, 2008 Featured article review Demoted
Current status: Former featured article
          This article is of interest to the following WikiProjects:
WikiProject Computing (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 
WikiProject Writing systems (Rated C-class, Mid-importance)
WikiProject icon This article falls within the scope of WikiProject Writing systems, a WikiProject interested in improving the encyclopaedic coverage and content of articles relating to writing systems on Wikipedia. If you would like to help out, you are welcome to drop by the project page and/or leave a query at the project’s talk page.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 
Wikipedia Version 1.0 Editorial Team / v0.5 (Rated C-class)
WikiProject icon This article has been reviewed by the Version 1.0 Editorial Team.
C-Class article C  This article has been rated as C-Class on the quality scale.
 ???  This article has not yet received a rating on the importance scale.
 
Note icon
This article is within of subsequent release version of Engineering, applied sciences, and technology.
Taskforce icon
This article has been selected for Version 0.5 and subsequent release versions of Wikipedia.
WikiProject Typography (Rated C-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Typography, a collaborative effort to improve the coverage of articles related to Typography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the quality scale.
 Mid  This article has been rated as Mid-importance on the importance scale.
 

Character set vs. Character encoding?[edit]

There is a definite difference between (finite) character sets and character encoding.

There is a sentence that currently reads:

Although ISO-8859-1 (Latin 1), its variant Windows-1252 (often mislabeled as ISO-8859-1), and the original 7-bit ASCII were the most common character encodings until the late 2000s, nowadays UTF-8 is becoming more common

But UTF-8 is not a single character set, but a Transformation Format that can represent both UCS-2 and UCS-4 sets. However, throughout the article, the ASCII character set is referred to as a character encoding, which doesn't quite sound right to me.

Should there be a consensus on the terminology to use? It may help with a lot of the confusion that goes with the subject.

ratarsed (talk) 09:46, 6 November 2012 (UTC)

As of 6.2/6.3, the Unicode standard says, in Chapter 1, "The Unicode Standard contains 1,114,112 code points, most of which are available for encoding of characters.", so there's only one character set - the first 65,536 constitute the Basic Multilingual Plane, but that's just a subset of Unicode.
For encodings, it says "Unicode characters are represented in one of three encoding forms: a 32-bit form (UTF-32), a 16-bit form (UTF-16), and an 8-bit form (UTF-8). The 8-bit, byte-oriented form, UTF-8, has been designed for ease of use with existing ASCII-based systems." shortly before that in Chapter 1.
In Chapter 2, it says, in 2.4 "Code Points and Characters":
On a computer, abstract characters are encoded internally as numbers. To create a complete character encoding, it is necessary to define the list of all characters to be encoded and to establish systematic rules for how the numbers represent the characters.
The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.
In the Unicode Standard, the codespace consists of the integers from 0 to 10FFFF16, comprising 1,114,112 code points available for assigning the repertoire of abstract characters.
and, in 2.5 "Encoding Forms":
Computers handle numbers not simply as abstract mathematical objects, but as combinations of fixed-size units like bytes and 32-bit words. A character encoding model must take this fact into account when determining how to associate numbers with the characters.
Actual implementations in computer systems represent integers in specific code units of particular size—usually 8-bit (= byte), 16-bit, or 32-bit. In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16- bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. The “UTF” is a carryover from earlier terminology meaning Unicode (or UCS) Transformation Format. Each of these three encoding forms is an equally legitimate mechanism for representing Unicode characters; each has advantages in different environments.
All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; they are thus fully interoperable for implementations that may choose different encoding forms for various reasons. Each of the three Unicode encoding forms can be efficiently transformed into either of the other two without any loss of data.
It also says:
The Unicode Consortium fully endorses the use of any of the three Unicode encoding forms as a conformant way of implementing the Unicode Standard. It is important not to fall into the trap of trying to distinguish “UTF-8 versus Unicode,” for example. UTF-8, UTF-16, and UTF-32 are all equally valid and conformant ways of implementing the encoded characters of the Unicode Standard.
My personal inclination is to refer to "UTF-8-encoded Unicode", e.g. a file can be "ASCII text", meaning that it is a sequence of ASCII code points (so if it's a sequence of octets, none of those octets have the 8th bit set), or it could be "ISO 8859-1 text", meaning that it's a sequence of ISO 8859-1 code points (so the octets with the 8th bit not set are ASCII characters), or it could be "UTF-8-encoded Unicode text", or it could be "UTF-16-encoded text" (which means it's either big-endian or little-endian, indicated either by a byte-order mark or some out-of-band indication such as "the person who sent it to me told me it's big-endian" or "this is Windows, so it's little-endian"), or it could be "UTF-32-encoded text".
A separate characteristic of those files is the subset of the character set they contain; an "ISO 8859-1 text" file could contain the ASCII subset (so if it's a stream of octets, no octet in the file has the 8th bit set), in which case it's also an "ASCII text" file. A Unicode file, regardless of encoding, could contain the Basic Multilingual Plane subset. Guy Harris (talk) 18:45, 22 April 2014 (UTC)
(Note, of course, that "UTF-8-encoded Unicode" is redundant; if it's encoding something other than Unicode, it's not UTF-8. The redundant phrase, however, may serve to remind people that "UTF-8" encodes the entirety of Unicode, so, unless somebody explicitly says "but this text must use only characters in the Basic Multilingual Plane" or "but this text must use only characters in ISO 8859-1" or even "but this text must use only ASCII characters", code processing that text must be prepared to see arbitrary Unicode characters. It may also serve to remind people that "is this UTF-8 or is this Unicode?" is an ill-formed question, perhaps based on the long-out-of-date assumption that UTC-2 is Unicode and that unless each character is represented by two octets it's "not really Unicode".) Guy Harris (talk) 19:38, 22 April 2014 (UTC)

Delete vs Backspace[edit]

As video terminals began to replace printing ones, the value of the "rubout" character was lost. DEC systems, for example, interpreted "Delete" to mean "remove the character before the cursor" and this interpretation also became common in Unix systems.

That would be because the DEC terminals where connected to Unix systems which were configured to understand the terminals. Some Unix systems used the 7f (DEL) character as the "interrupt" key, while others used the 03 (ETX) character for that purpose.

Most other systems used "Backspace" for that meaning and used "Delete" to mean "remove the character at the cursor". That latter interpretation is the most common now.

This seems to conflate the DEL character with the Delete key, which when used to mean "delete character at (under or forward of) the cursor" typically sends a sequence like 1b-5b-33-7e (ESC [ 3 ~), rather than the single character 7f (DEL).

Yes there are local applications such as Unix Xterm, or remote connectors such as PuTTY, SSH, & Telnet, but their choice between DEL & BS depends on the target service and/or local preference settings, so they don't sway the argument of which is "most common". Furthermore, some target services follow the EMACS tradition and use character 04 (EOT) for "delete character under cursor".

But most applications now do not receive a stream of characters at all; rather they receive events from the local windowing system (either directly, or from the browser within which they run).

Other changes have also occurred: the Return key has been renamed Enter or just ↲ on most keyboards, and it is treated as End-Of-Record or Next-Line rather than as a return on the same line.

Martin Kealey (talk) 01:56, 10 July 2014 (UTC)

More accurately:
The DEC terminals - and, before DEC made terminals, the non-DEC terminals such as the Teletype Model 33 - were connected to various DEC operating systems, which interpreted DEL as "delete previous character", and that eventually got adopted by UNIX systems as well. Originally, UNIX systems imitated Multics systems, and used # for "delete previous character" and @ for "delete previous line; the Multics systems at least had the excuse that they had to support non-ASCII terminals with wired-in local echo, such as the IBM 2741 and IBM 1050, so they couldn't do DEC-style tricks when echoing DEL; UNIX didn't have that problem, but they went with it anyway, and used DEL as the interrupt character.
The BSD folk decided that was bogus and implemented a more DEC-style tty interface, with the erase and kill characters echoing DEC-style (print the deleted characters between slashes/backslashes on printing terminals, erase them with backspace and space on display terminals), and with DEC-style choices of DEL for erase, ^U for line kill, and ^C for interrupt; that ended up becoming the most common tty interface on UN*Xes as well. The characters were settable on UN*X, so sometimes BS rather than DEL was used.
So it was DEC's operating systems, not UNIX, that gave us "DEL as what you type to delete the previous character".
And the notion that you type Return at the end of the line, even if it sends CR, is also a DECism; no DEC OS I remember required that you type both CR and LF at the end of a line. At least with the older OSes, the CR would be echoed as CR LF, and would show up as CR LF as input. (RSX-11 and VMS were a bit weird here, in that they treated FORTRAN line format as the proper text format; I think that typing CR ended the line, but echoed only as CR, and the next line output to the terminal would begin with LF and end with CR as sent to the terminal, because it would normally have SP as the initial FORTRAN control character. But I digress....)
UNIX followed in the Multics "LF by itself, with no CR, at the end of a line" tradition; typing CR would end the line, cause CR LF to be echoed, and cause just an LF to appear in the input stream. Guy Harris (talk) 02:30, 10 July 2014 (UTC)
I've made some edits to clarify that this is a software interpretation of input characters, and to get rid of the video terminal stuff entirely, as well as to ask for citations about the BS-vs-DEL claims. Guy Harris (talk) 03:10, 10 July 2014 (UTC)

Second representation of the printable character list[edit]

I've added a previously removed second representation of the ASCII characters that supports easy copy-pasting. I wasn't aware that it already has been on the page. My edit got reverted. However, I think it should be still there.

In Wikipedia, there are lots of examples where information is displayed multiple times, even when we don't count efforts to help disabled people, like "spoken wikipedia" or descriptions below images: take AES as an example. The text perfectly describes the steps but the images display a second representation of those steps (and a third is in the image description, but as previously described we don't count it).

We can see the redundancy also on the principle of the lead paragraph: Except of the X in "is a X", most information gets repeated in the article below. It gives the reader a concise definition of the topic, that can be retrieved without having to read the whole article. The second representation of the ascii list fulfills this second purpose: the reader doesn't have to read every single character to get a full list of ASCII characters.

And, redundancy is still present in the list itself:

Binary Oct Dec Hex Glyph
010 0000 040 32 20 (space)

The list gives us three representations of the the character's number.

I think there are different use cases linked to both representations: first (currently only) representation helps readers with various conversions between the character and its ASCII address. Therefore the multiple number formats. The second (disputed) representation helps the reader in the case they want to act on the whole set of characters: I've used it for password generation, and others might want to use it in a program the case they write in a language that doesn't have such an easy linking between numbers and characters like C.

What do you think? — Preceding unsigned comment added by ‎Muelleum (talkcontribs) 21:04, 12 August 2014 (UTC)

Hello there! Hm, so the main purpose would be to make the whole ASCII set of characters easily available for copying and pasting? Maybe some kind of a compromise could be to provide it in form of a note after "There are 95 printable characters in total", using the {{Efn}} template? — Dsimic (talk | contribs) 21:54, 12 August 2014 (UTC)
I'm OK with that. Muelleum (talk) 23:24, 12 August 2014 (UTC)
Looking good, having a scrollable box was the only solution for long lines in reference tooltips. — Dsimic (talk | contribs) 23:32, 12 August 2014 (UTC)

A three-part article on ASCII[edit]

Sometime in the 1980's, I read an article in three successive issues of a personal (micro) computer magazine by the "inventor of ASCII", whoever that was, that went over all the non-alphabetic codes and was very enlightening. I've never been able to find it again. If anybody knows, please send me a message. Thanks.deisenbe (talk) 16:08, 14 August 2014 (UTC)

You are probably looking for "Inside ASCII". This was originally published as:
  • Bemer, R. W. (May 1978). "Inside ASCII - Part I". Interface Age (Portland, OR: Dilithium Press) 3 (5): 96–102. 
  • Bemer, R. W. (June 1978). "Inside ASCII - Part II". Interface Age (Portland, OR: Dilithium Press) 3 (6): 64–74. 
  • Bemer, R. W. (July 1978). "Inside ASCII - Part III". Interface Age (Portland, OR: Dilithium Press) 3 (7): 80–87. 
Unfortunately it is almost impossible to find copies of Interface Age anymore. I am not sure how available this is either but it was also republished as:
  • Bemer, R. W. (1980). "Inside ASCII". General Purpose Software. Best of Interface Age 2. Portland, OR: Dilithium Press. Chapter 1. ISBN 0-918398-37-1. 
Bob Bemer wrote many other things on ASCII as well. Perhaps you can find one of these:
I hope that was helpful. 50.126.125.240 (talk) 00:56, 3 January 2016 (UTC)
That must be it, though the magazine doesn't ring a bell. Thank you. I met the author, and had a discussion of what he called the "Data-Link Escape" code. This must have been at the big microcomputer expo in Los Angeles in the spring of 1980. deisenbe (talk) 21:06, 3 January 2016 (UTC)

In the Order section: Numbers are sorted naïvely as strings?[edit]

The article in the Order section talks about "ASCIIbetical order". The following quote appears in the text:

"Numbers are sorted naïvely as strings; for example, "10" precedes "2""

Is naïvely the right word? Natively maybe?

I am researching one proposed TerSCII table and curious as to why ASCIIbetical order is what it is. There is a lot of thinking went into how the ASCII table is built. It would be nice to carry over lessons learned from ASCII to TerSCII.

2606:6000:6042:9600:25F2:8029:9F5C:52C9 (talk) 19:46, 4 June 2015 (UTC)Wilx

No, "naïvely" is what is intended - if you just, well, *naïvely* assume that all strings should be sorted the same way, you end up with "10" being less than "2", as "1" is < "2". ("Naïvely", as in "showing a lack of experience, wisdom, or judgement", i.e. not realizing that if you sort numbers as strings they won't come out in numerical order.)
That whole section doesn't really explain what it's talking about; what it's really discussing is sorting strings by simply comparing individual characters' code values, without paying any attention to getting numbers sorted by numerical value, words sorted without regard to case, etc.. That's less a characteristic of ASCII than of simple (naïve) string comparison operations. You could have EBCDICibetical order as well, for example. Guy Harris (talk) 22:47, 4 June 2015 (UTC)
In particular, you are extremely unlikely to find a character encoding scheme that would magically make naïve string comparison magically sort strings the way humans would want them sorted, if you're going to compare ternary strings by comparing the numerical values of the characters in the string from beginning to end. Having the encodings for upper-case and lower-case letters be adjacent, so that 'A' < 'a' < 'B' < 'b' < 'C' < 'c' etc., and putting accented letters in the appropriate places might help, but that's not all there is to sorting words, and that won't fix the problem of sorting numbers, either. Localization of sorting may end up being about as painful in ternaryland as in binaryland.... Guy Harris (talk) 22:57, 4 June 2015 (UTC)

Control-Z as end-of-file[edit]

First for TOPS-10. The use of Control-Z as End-Of-File existed but only from the Teletype. Control-Z on paper-tape, mag-tape, disk-files was just another character. In other words, this was specific to the terminal device driver. I don't remember if there was a standard escape mechanism for the various control characters - other than the program using a raw mode.

Second, also for TOPS-10, disk-files had a count of words. Not a count of characters and not a count of records. So plain text files could have 0 to 4 NULs at the end to finish the last word. The input routine ignored all NULs coming in - also due to sequence numbered files being word aligned for every line - see SOS and PIP.

Third, as far as CP/M goes, the original use of Control-Z was as a filler character since the OS only did a count of (128 byte) records. So a character file would have 0 to 127 SUB at the end to fill out the last record. (Why they used SUB instead of NUL like DEC baffles me.)

Then, common usage changed this to merge TOPS-10's TTY end-of-file and the filler idea to have a Control-Z as an explicit 'end' character.

I have some TOPS-10 and CP/M manuals. I can do some better research if desired.