Talk:C0 and C1 control codes
|WikiProject Computing / CompSci||(Rated List-class)|
- 1 does anyone here have access to ISO/IEC-6429
- 2 Is "String Terminator" abbreviated "SI"?
- 3 C1 not derived from/used in ISO/IEC 8859-n
- 4 CUA stuff
- 5 RFC 1345
- 6 Backspace
- 7 This article is not about all control characters
- 8 unclear lines
- 9 C1 (ISO 8859 and Unicode)
- 10 Octal
- 11 7F
- 12 Restructuration
- 13 Various standards
- 14 ^X links
- 15 Purpose
- 16 sources discussing smtp rather than ISO 10646
- 17 Missing information
- 18 merge vs deletion
does anyone here have access to ISO/IEC-6429
and if so can they check the codes in the C1 table (particularlly the 3 not identified by unicode) against it? Plugwash 02:34, 23 January 2006 (UTC) ECMA 48, the european version of this standard, is available online. --Random832 23:32, 1 July 2007 (UTC)
- Supposedly ECMA-48 is identical (and is available for free). The ISO (and ANSI) documents all cost money. Tedickey (talk) 10:23, 10 March 2008 (UTC)
Is "String Terminator" abbreviated "SI"?
Control code 0x9C is listed as:
0x9C SI ST String Terminator
However, SI is the abbreviation for:
0x0F SI Shift In
Is the SI in String Terminator supposed to be ST?
220.127.116.11 21:34, 4 May 2007 (UTC)
C1 not derived from/used in ISO/IEC 8859-n
The C1 codes were included in the ISO-8859-n series of encodings [...].
I think this is wrong if ISO-8859-n means ISO/IEC 8859. I only have access to draft versions of ISO/IEC 8859, but they explicitly say (C1 code points) use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429., see here. --Abdull 08:10, 8 June 2007 (UTC)
- there is a subtule but important difference between ISO/IEC 8859-1 and the IANA charset ISO-8859-1. One is an incomplete standard without control codes the other adds them in to make a usable standard. Plugwash 21:42, 1 July 2007 (UTC)
A few of the entries describe the use of a control key as a shortcut in many Windows programs and CUA X11 programs. For example: "In many programs, a keyboard input of Ctrl-Y is a "redo" command to undo the last Ctrl-Z undo command."
That's true, but the fact that Microsoft, when porting their Office software from the Mac to their own OS, used control keystrokes as a substitute for the missing command key has nothing to do with the meaning of any control character as a C0 control code.
Even if I'm completely wrong, I can't imagine how the undo/redo meanings of ^Z/^Y could be relevant but the clipboard meanings of ^X/^C/^V, the file command meanings of ^N/^O/^S, or the select-all meaning of ^A, the find-related meanings of ^F/^G/^R, etc. --18.104.22.168 07:36, 24 September 2007 (UTC)
Do we really need to include the RFC 1345 acronymns? Aside from some limited usage in a UNIX utility, I haven't come across any evidence that they saw use elsewhere. Caerwine Caer’s whines 22:32, 16 June 2008 (UTC)
- I'd tend to agree - though deciding whether to remove them would take some investigation Tedickey (talk) 00:43, 17 June 2008 (UTC)
The comments about backspace, and its linked topic do not mention its use for underlining and bold. The comment in the table is rather crowded, but rather than a blanket "deprecated", the point should be made that while composition of characters is not generally supported in terminals, the underline/bold generally are Tedickey (talk) 12:19, 19 June 2008 (UTC)
I think the description of Backspace is incorrect. This character have not different uses for input and output (the same way of CR or ESC characters, for example): it always move the cursor leftwards, so the phrase "To provide disambiguation between the two potential uses of backspace" have no sense.
A more precise description could be one in the same style of CR or ESC characters, for example:
Move the cursor one position leftwards. The Backspace key on a keyboard will send this character that is usually used to delete the character to the left of the cursor; to do that the three character sequence BS SPACE BS (0x08 0x20 0x08) is used. In early computer technology, where a character once printed could not be erased, the backspace was sometimes used to generate combinations of two characters, like à that could be produced using the three character sequence a BS ` (0x61 0x08 0x60), the method to print underline or overstrike characters combining _ or - with any character, or the standard method in APL programming language to create new operators combining two existing operators, like / BS - Aacini (talk) 05:35, 2 November 2008 (UTC)
This article is not about all control characters
Just a friendly reminder. This article is not about every possible usage of a control character, nor even about usage on every system where 00HEX–1FHEX are control characters. This is about a specific set control characters, the C0 and C1 sets as defined by ISO/IEC 2022. Some of those meanings are generalized, so while instances where an application or system further defines their usage are relevant, a use which is totally unrelated to the character as defined in ISO/IEC 2022 belongs in either a separate article or in control character. Caerwine Caer’s whines 02:58, 12 July 2008 (UTC)
The section C1 (ISO 8859 and Unicode) will become clearer if "if being used in an environment where 8-bit characters are not supported or where these octets are being used instead to add additional graphics characters" is removed. Also, I have passed a '+' outside the parentheses in a table column label. —Preceding unsigned comment added by 22.214.171.124 (talk) 08:46, 12 January 2010 (UTC)
- The sentence could be broken up, but removing it would lose the hint for why 7-bit controls are useful. (Sending 2 bytes instead of 1 is not necessarily a good thing). Tedickey (talk) 09:33, 12 January 2010 (UTC)
C1 (ISO 8859 and Unicode)
I renamed the heading "C1 (ISO 8859 and Unicode)" as "C1 set" since C1 is not defined in either ISO 8859 or Unicode. C0 and C1 can be used in ISO 8859 or Unicode text, but they don't define C0 or C1. — Preceding unsigned comment added by 126.96.36.199 (talk) 10:06, 27 September 2011 (UTC)
- And so what is «C0 Controls and Basic Latin» and «C1 Controls and Latin-1 Supplement» in Unicode standard?
- ECMA-35 and ECMA-48 define the use of C0/C1 for ISO-8859-1. Without a document such as that for Unicode (or UTF-8), all the documents that you have mentioned do is to show pictures of the codes that are mapped from ISO-8859-1; the C0/C1 behavior has not been specified. A reliable source on the matter would not leave leeway for guessing what might be meant TEDickey (talk) 08:16, 19 July 2012 (UTC)
- I just want say Unicode standard
- recognize those values as control character,
- gives their range and aliases
- as character, implicitely attributes them a byte sequence depending on the UTF in use.
- Might be you just want to say that Unicode does not specify the exact behavior of each control character.
- Additionaly, a link can be established to Unicode control characters.
- In The Unicode Standard, Version 6.1 page 23, they say: Basic Type control is «Usage defined by protocols or standards outside the Unicode Standard», and classifies them as category Cc with status abstract character.
- And they add «Control Codes. Sixty-five code points (U+0000..U+001F and U+007F..U+009F) are defined specifically as control codes, for compatibility with the C0 and C1 control codes of the ISO/IEC 2022 framework. A few of these control codes are given specific interpretations by the Unicode Standard. (See Section 16.1, Control Codes.)»
- §16.1 is in page 544 for C0.
- In page 545 an additional semantic is clarified for at least eleven of them «Specification of Control Code Semantics» — Preceding unsigned comment added by 188.8.131.52 (talk) 11:18, 19 July 2012 (UTC)
- I just want say Unicode standard
- But that's the point: the paragraph as written states that Unicode "provides" these codes, but it is in a context (and no clarification is made there) to point out that Unicode provides no definition of their behavior. The C1 codes without being translated would be illegal in UTF-8 encoding (because the values in 128-159 are continuation bytes). Without clarification, the paragraph is misleading. The word "provides" is inappropriate in this context - "assigns" would be more idiomatic, and corresponds to the sources you indicate TEDickey (talk) 22:32, 19 July 2012 (UTC)
- C1 is not illegal in UTF-8. U+0085 (NEL / Next Line) is encoded as C2 85 in UTF8. I found this document which suggests that:
|“||NEL is the only C1 character recognized by Unicode||”|
- Screen isn't a terminal emulator; nor is mosh - they're applications which use terminals and rely upon those to provide a lot of the functionality associated with a terminal emulator. TEDickey (talk) 21:31, 11 October 2012 (UTC)
- Yes, Mosh does do terminal emulation. See here: "... the opportunity to build a clean UTF-8 terminal emulator from scratch ...". Mosh significantly reinterprets control characters and escape sequences, before sending them to the final terminal emulator. -Hirsutism (talk) 22:36, 11 October 2012 (UTC)
- I'm aware of the opinion of its developer(s), but since it relies on the terminal (and ncurses) for the functionality, it's like screen - a translator which isn't a complete terminal emulator. You're not likely to find an authoritative source which agrees with that opinion. TEDickey (talk) 22:56, 11 October 2012 (UTC)
- We're getting stuck in a side-tangent here. The precise definition of "terminal emulator" isn't important for this Wikipedia page. What matters here is: Putty + Mosh recognize NEL (encoded as C2 85) as a newline character. Even this empirical evidence is a side-tangent... the main discussion is about whether the Unicode spec fully recognizes NEL (or other C1 characters). --Hirsutism (talk) 15:28, 12 October 2012 (UTC)
- Sure. But your suggested source isn't what one might term authoritative, due to several simple errors. For example, on the paragraph following the one you're interested in, he states
Since VT100 (that uses C1 extensively)...
which is incorrect. Scanning quickly, I see other errors. If you're simply stating that you can find someone agreeing with your point, that's easily done of course (google is your friend). TEDickey (talk) 23:03, 12 October 2012 (UTC)
- Octal is wonderful, but hasn't its time passed? An extra column would be quite confusing, so why add it? There are probably lots of people who really have no interest in octal, so I think a good reason for adding it would be needed. Johnuniq (talk) 09:10, 29 October 2011 (UTC)
- I object too. Of course, octal is derived from hex (or decimal), so it would just be a dependent addition (deriveable). Of course one can add: so is decimal - all right. Only, decimal is used directly nowadays (e.g. when entering by keyboard). Someone else could argue: hey letys add UTF-8, UTF-16, and such. So I do object. -DePiep (talk) 22:14, 30 October 2011 (UTC)
The 'C' column includes many missing entries. In the language 'C' it is ordinary to use octal escape sequences to express and enter these missing entries. Why not fill out the missing entries in the C column in octal - such as '\003' - solves the OP, completes the column, and provides a reference to programmers wishing to use the control codes under discussion. — Preceding unsigned comment added by 184.108.40.206 (talk) 00:20, 5 February 2015 (UTC)
- In Unix, it's sometimes referred to as "Ctrl-?" or "^?"... AnonMoos (talk) 05:25, 15 June 2012 (UTC)
Neither - ECMA-35 / ISO-2022 make SPACE and DELETE special cases (not control characters, and not a member of C0/C1). The positions used for those in the 128-255 range are printable characters, by the way. TEDickey (talk) 23:55, 1 August 2012 (UTC)
I suggest to restructure this article, as is:
- (why control codes)
- (main dates)
- Main standards interoperability issues
- utf-8, windows-1252, etc.
- Main protocols and applications
- terminal, file text, unix, videotext, etc
- Main standards interoperability issues
- Code assignations
- C0 set
- C1 set
- Example of sequence using control code — Preceding unsigned comment added by 220.127.116.11 (talk) 17:25, 19 July 2012 (UTC)
These links are all circular, or point to articles about usage of shortcut combinations on Windows, which has nothing to do with control codes. I recommend reverting the addition of them.Spitzak (talk) 05:20, 21 September 2013 (UTC)
- I partially agree with your observation, but not with your conclusion.
- I deliberately put the links in because semantically there is a difference between a control character given in notation ^X (specifies a key combination with Ctrl, not a specific function - associated functions are operating system and application specific), a control character given in notation \x (specific formatting to some programming languages), named control characters distinguished by function (Linefeed, Tabulator, Bell, Null) or named control characters distinguished by code (NUL, ETX, etc.) in specific standards like ASCII etc.
- While not being circular, at present some of the links have the same target (which often does not reflect above semantics correctly), but this is a problem of sub-optimal target linking in redirects rather than a problem of adding local links to the terms as is. We will have to retarget some redirects and restructure some articles to create semantically more correct link targets, but this won't happen overnight. However, we will create awareness for this "unevenness" only by starting to incorporate the links - over time, this will create a momentum which will help to shift the targets to be more semantically correct. If we don't add the links, neither the semantically differences nor the structure will become apparent to most users, so changes in this area would happen only randomly and without a clear direction rather than systematically following some overall structure.
- --Matthiaspaul (talk) 11:12, 21 September 2013 (UTC)
- The ^X notation actually indicates the character with the value of an ASCII 'X' xor'd with 0x40. Although often the same it is not a symbol for the key sequence. For instance ^@ means a character that is more likely produced by typing ctrl+space. In any case I think links leading to discussion of Windows shortcuts are wrong, these shortcuts are processed directly from keyboard input and at no point is a C0/C1 control code ever used.Spitzak (talk) 01:52, 29 May 2014 (UTC)
What this article doesn't really make clear is why C0 and C1 are in Unicode. The use of U+2400 ... U+243F is immediately obvious, and I guess it makes some sense to reserve NUL, TAB, CR and LF.
But what are you supposed to do when you encounter SI? Obviously you aren't meant to switch to a different character set, because if people wanted to encode a character not in Unicode they'd use a PUA character. Maybe it's part of a quoted string of bytes to send to some machine for which SI does make sense? No, because then you'd use the visual representation ␏.
If you find BEL, are you supposed to sound a bell? Of course not. A Unicode text is just that, text, not a string of instructions to do something. Even when displayed, it tends to be scrollable and no bell moment exists. And you wouldn't want to allow text to ring bells anyway. Again, for quoted bytes there's the visual representation.
What about SOH? Again, meaningless in text unless quoted. Most of these control codes are useless as part of text. Insofar as they make sense at all, it's as formatting, which isn't within the Unicode scope, but within things like HTML and CSS, or whatever format your word processor uses. The only reason it makes sense to reserve NUL, TAB, CR and LF is the sheer ubiquity of simple file formats (we call them text files, but they do contain formatting in addition to text) and memory representations of strings that need these.
- They're in Unicode to preserve compatibility with ASCII etc. character sets. AnonMoos (talk) 03:36, 7 February 2015 (UTC)
- C1 comes from ISO-6429 (aka EMCA-48), and ISO-2022 (aka ECMA-35). It is not so much for compatibility (since the Unicode standard merely lists the names without attempting to describe functionality) as because ISO10646 grew out of the standardization work for the older encodings. Because Unicode does not describe functionality, it does not standardize C0/C1, merely makes a few assumptions relying upon those other documents as the relevant standards TEDickey (talk) 12:05, 7 February 2015 (UTC)
sources discussing smtp rather than ISO 10646
The given sources are discussing smtp rather ISO 10646 as such:
The following is a draft for an RFC updating SMTP to
allow and encourage use of ISO 10646 (now DIS, of course).
- If you read this paragraph:
- In Internet messages, the dynamic compaction method (compaction method 5) is used, the initial state being G=32, P=32, R=32, with each octet specifying a value of C. (Translated into normal English, that sentence means: "The text is in 8-bit Latin-1 until we get to the first HOP, if any!") Transitions to other character sets, represented by rows and, in some cases, planes, is done with a sequence that begins with the HOP ("High Octet Preset") code (decimal 129). The SGCI ("Single Graphic Character Introducer") is not used (i.e. we use "level 1" of method 5).
- It's pretty clear to me it is discussing how the ISO 10646 draft is applied to SMTP. It's not introducing HOP or SGCI itself, it is pulling them from the draft. It would be great if someone could find old ISO 10646 drafts and we could quote them instead, but even in the absence of copies of those old drafts, I don't think there is any other plausible interpretation of this paragraph. SJK (talk) 12:23, 9 April 2015 (UTC)
Without the said draft, you cannot distinguish the interpretation which you wish to make from an equally plausible one that refers to some ISO-2022 feature which is commented upon as not being in ISO 10646. As such, your commentary in the topic amounts to original research. As I said, you need a supplementary source to provide the information rather than interpreting TEDickey (talk) 00:43, 10 April 2015 (UTC)
Please see Ken Whistler, Formal Name Aliases for Control Characters, L2/11-281, Unicode Consortium, July 20, 2011, which explains the situation much better than my previous reference did:
Notes Regarding Omissions I have deliberately omitted three control code names and their abbreviations which occur in one (obsolete) RFC, but which are an artifact of early unapproved drafts of 10646. To wit: 0080 PADDING CHARACTER (PAD) 0081 HIGH OCTET PRESET (HOP) 0099 SINGLE GRAPHIC CHARACTER INTRODUCER (SGC) Those 3 were proposed (on spec) in early drafts of 10646, for what became a failed architectural direction for 10646. They would be completely forgotten now except for the persistent (and pernicious) RFC that lists them without indicating their failed status. Nobody has ever implemented them, so they are nothing more than character encoding curiosities.
These control codes had names in Unicode 1.0 but these names were later removed. The article should explain when and why.
10646-1 forbids the use of C1 controls, requiring an ESC FE sequence instead. The article should detail when and why this came about and whether or not it is still in force in Unicode. — Preceding unsigned comment added by 18.104.22.168 (talk) 03:22, 6 September 2015 (UTC)
- That (ESC Fe) was made obsolete a long time ago, and removed. See this for example. TEDickey (talk) 12:55, 6 September 2015 (UTC)
merge vs deletion
While it's interesting that Unicode has a subset of C0/C1 codes, deleting most of the content of this topic to replace it by a redirect to a summary paragraph should have some discussion involving the editors who've been maintaining the page. TEDickey (talk) 08:28, 4 August 2016 (UTC)