|WikiProject Computing||(Rated C-class, Mid-importance)|
incomplete! working on it.
- 1 Machine Description
- 2 How systems programming was affected by 6600 Architecture
- 3 How PPU were actually used
- 4 6600 and Cyber
- 5 'Population Count' ("Count Ones") Instruction
- 6 Comment on Pop Count
- 7 Comment on Operating Systems
- 8 Google News Archive Search on 6600
- 9 Branch timings
- 10 Scoreboarding
- 11 Register naming
- 12 Delivery of first 6600
- 13 Overly long intro?
- 14 PP and memory timing
- 15 PP or PPU
In the article, the two paragraphs beginning "The machine as a whole" are very confused and probably inaccurate. Thronton's book should be consulted for an accurate description of the all-important barrel architecture. The sentence beginning "Programs were written, with some difficulty" is just false.
How systems programming was affected by 6600 Architecture
Programmers tended to treat PPU programming as if each PPU was a completely independent computer, and as if the other PPUs did not exist. That was a direct result of the barrel architecture.
As for interaction with the CPU, this did not exist; central memory was read or written from the PPU as if it were some kind of simple device without an explicit channel number. The execution of the CPU was not affected by PPU programs, with the solitary exception of the PPU monitor.
Altho the article correctly observes that I/O was the province of the PPUs, they were remarkably ill-suited to the task. For example, they possessed no interrupt architecture. Therefore, PPU programmers had to code delay loops which repeatedly checked device statuses. The resultant code was very painful to write due to the numerous loops and exit conditions. To write I/O interfaces casually ignoring device status was to invite PPU hangs which would gradually degrade the entire operating system.
- I find this description to be quite inaccurate. Interaction between PP and CP of course did exist -- via shared memory data structures. Indeed, the circular buffer approach used is very efficient and handles concurrency correctly without any need for interlocks. As for "remarkably ill-suited to the task" -- not at all. WIth enough PPUs that each I/O device had its own, lack of interrupts is not a real issue. The only thing you might argue is that this approach was unfamiliar to programmers. Perhaps less so early on, because interrupts were not universally used in the 1960s (consider the 1620 -- synchronous I/O on a single processor machine). The proof is in the pudding. PLATO was a 1000 terminal, 600 user timesharing system, highly interactive, very high I/O rates, running on just two Cyber 73 class machines (combined power about the same as a single 6600). I submit that such a project would have failed miserably if it were even close to true that PPs are ill-suited to doing I/O. In fact, they did very well, provided you have good programmers working on the job. Finally, re "each PPU an independent computer" -- naturally, since that is in fact what they are, except for the shared access to the I/O channels. The OS would manage that access. Also, there were a few cases where multiple PP programs would cooperate on I/O -- high speed disk I/O for example (at PLATO) or long-block tape I/O (standard OS feature). Paul Koning (talk) 16:01, 9 November 2009 (UTC)
- the contributor above confuses (as many do) the meaning of "CP" -- Central Processor. The average PP program did not interact with the central processor, because that was the province of the monitor. Instead, the PPs interacted with "CM" -- Central Memory, and it is this to which the contributor refers. —Preceding unsigned comment added by Dmausner (talk • contribs) 21:29, 27 March 2010 (UTC)
How PPU were actually used
I removed the sentence beginning with "For instance, a program might use..." because the suggested interaction of CPU and PPU programming never occurred in actual practice.
The PPU were reserved for the exclusive use by the operating system and primarily for I/O operations, inasmuch as the PPU could access the data channels and the CPU could not. Because of this, PPU were not employed to assist in CPU computations or logic, even though the "virtual" concurrency of execution in CPU and PPU might offer the possibility.
PPU programming language was easily acquired because the same assembler could interpret both CPU and PPU mnemonics; however, PPU programs could access the entire machine's memory and the data channels. Hence amateurs were forbidden to load code into PPU by a number of preventative measures, not the least of which was the absence of an operating system command or process for seducing the system to do that from a user's job stream.
The easest way to load a custom PPU program was to go to the bootstrap panel and tell it to load from tape to PP0 on reset. Push the reset button and your code loaded. Other than that it was almost imposible to get a PP program loaded. The monitor was in charge of telling each PPU what to load next.
At Michigan State University (MSU) SCOPE was "enhanced" into "Scope/Hustler" as a pun on the way that user jobs were "hustled" in and out of control points (jobs ready to run). In order to improve performance parts of the PPU Monitor function were moved to the CPU. The monitor then forced a Context jump on the tick at which point the CPU would run the control points to find the highest priority job that was ready for the CPU and the context jump into that user job.
The other major improvement (Not sure if this was in SCOPE as I was writing for SCOPE/HUSTLER) was that the RA+1 calls (system calls) were improved by having the user programs execute an intentional illigal operation. When this happened the user job did a context switch to the OS which processed the RA+1 call and then dropped into the scheduling process.
This meant that the CPU was only interupted when the user job wanted a function from the OS OR when the PPU Monitor wanted to make sure that the CPU got some cycles. — Preceding unsigned comment added by 184.108.40.206 (talk) 19:39, 1 August 2012 (UTC)
Some universities possessed PPU emulators which either ran in the CPU or in a PPU (in place of the actual PPU program). These programs could in principle make PPU software development easier and safer since they protected the operating system from channel hangs and memory over-writes.
6600 and Cyber
I used a CDC 6400 and a Cyber 70/74. From what I knew at the time, the Cyber 70 was basically the same as the 6600. If I'm correct, the 6600 was one of the last second generation computers and the Cyber 70 was 3rd generation. Are these things correct? -
Also, in the second paragraph under Central Processor, "CU" is used one time. Is that supposed to be CP? I assume it is, since someone changed it in other places, so I'm going to take the liberty to change the remaining CU -> CP. Someone correct this if I'm wrong.
-Bubba73 02:56, 6 Jun 2005 (UTC)
Yes, it should be CP, which is how it was referred to in the 6600 documentation. Geoff97 07:45, 6 Jun 2005 (UTC)
'Population Count' ("Count Ones") Instruction
[CONTROL DATA 6400/6500/6600 COMPUTER SYSTEMS Reference Manual]:
47 CXi Xk Count the number of "1's" in Xk to Xi (15 Bits)
"This instruction counts the number of "1 's" in operand register Xk and stores the count in the lower order 6 bits of operand register Xi. Bits 6 through 59 are cleared to zero."
Grishman (1974) gave a silly reason for this opcode, as if storage were so tight that packing individual BITS might be good practice. I have never been able to come up with any commercial application that would find this useful. Where it would be useful is in ciphers and cryptology at the NSA level.
As the article notes, "Most of these [50 CDC-6600 ever built] went to various nuclear bomb-related labs." It seems to me that this opcode wouldn't have found its way into the germanium, I mean silicon, without some Federal involvement in original design. Comments? Tribune 16:30, 12 July 2005 (UTC) What is source for "50 CDC 6600s.. built"? I believe the number is closer to 200 (and almost as many 6400s). Also, although some certainly went to bomb-makers, far more ultimately went to universities and engineering shops. Capek 01:35, 18 August 2006 (UTC)
[COMPASS]: CXi Xk
- Xi <- 60 bit value equal to number of ON bits in Xk
Grishman notes [p. 115] that the instruction is "rarely used." Then he says, "This instruction 'is of use' when binary data, such as yes-no responses to questionnaires, are stored one datum per bit rather than one datum per word (as they would be in a FORTRAN type LOGICAL array) so that sixty times as much information can be stored in a given block of memory. A count ones instruction may then be used to determine the total number of yes responses (1 bits) to a word." I find this unconvincing: one using any CDC architecture to pack/unpack 6-bit characters -- let alone individual bits -- has code so long and slow that an efficient way of counting 'on' bits will be the least of one's problems. Tribune 06:18, 16 July 2005 (UTC)
- The PLATO system used the population count instruction for fuzzy matching. Answers would be run through a transformation that used various bits to code various aspects of the words, and the expected answers were coded the same way. The coding was done in such a way that a near match would have most of the bits in the encoded form equal. So fuzzy matching requires simply popcount(a XOR b) < n -- a total of 4 instructions. Paul Koning (talk) 15:55, 9 November 2009 (UTC)
Comment on Pop Count
In the early 1970's I was told that pop count was added to the instruction set at the specific request of one particular US government customer who had been given a demonstration of the earliest 6600. Once the custom circuit modules were designed, it made no sense to omit them from successor machines.
The circuits for pop count were trivial.... literally a bunch of adders in a tree. But there was a indeed a desire on the part of NSA to have the instruction for doing set intersection and counting the size of the result. The instruction has other uses as well, including looking for a physical disk position that has at least n blocks available. The instruction was NOT, however, particularly slow, despite being implemented in the divide unit: it was 8 minor cycles, slightly more than a conditional branch (nominally 6, if in stack) and less than a multiply (10). More philosophically, there's also the observation that it's a function which is easy and cheap to do in hardware, and very slow to do in software. Since it's useful, put it in. Capek 01:35, 18 August 2006 (UTC)
The pop count was implemented in the 6600 floating divide unit, which was notoriously the slowest of the functional units. Pop count was the slowest single instruction. This was unrelated to the iterative divide strategy, however. Thornton's book shows that the count was produced by summing the bit count through several static logic adders. I think he wrote that they summed 4- (or 5-) bit chunks of the 60-bit word, in parallel.
The MACE operating system at one time contained an idle loop, used when the CPU was not needed, comprising four pop-count instructions in a row followed by a branch back to them. I have no idea why the author saw any benefit to this; the SCOPE operating system used to hang the CPU on a program stop when it had nothing to do. M
An additional use for pop count was in setting a register to zero in parallel with other functional units to gain overall speed. I believe that this was used in some nuclear physics programs. Geoff97 19:02, 19 August 2006 (UTC)
A use for the Count! In '67 I used the Count instruction on the CDC6600 at NYU's Courant Institute (Atomic Energy Commission was the sponsor of this installation). It was a frivolous use, but incredibly efficient: I used it to run a computer-dating service for my 11th grade prom.
There were several hundred boys and girls participating, and I wanted to do a round-robin match, comparing every possible heterosexual pair. Participants filled out a form with 60 multiple-choice questions, each with four possible answers. Thus the answer to question n could be coded in only two bits, which I spanned across the n-th position in two words--two words which therefore contained the answers to all 60 questions. The Boolean instructions in the 6600 operated on entire words, but calculated each bit independently (costing no time, because the CPU ran the 60 bits in parallel). Thus I could count the matching answers between a boy and a girl by using the function: Count((B1 XOR G1)NOR(B2 XOR G2)). I recall that several (socially) successful matches resulted. In the forty years since, I don't think I've written a program that comes close to that one in efficiency. Of course, I haven't written in machine language since then either.
Thanks for the opportunity to brag about this; at the time, obviously, I couldn't talk about what I was doing. My access to the machine was to work on a new OS called SOAPSUDS (Schwartz's Own Athene Processor Serial Uniprocessor Debugging Simulator), which allowed the CDC6600 to emulate a similar machine with 20 parallel CPUs. Our expectation at the time was that parallel processing would be the basis of the next leap in hardware performance, and we wanted to get a head start on programming for such an architecture. Instead, of course, Moore's law asserted itself, and hardware went in another direction entirely.
--Brian67 —Preceding unsigned comment added by 220.127.116.11 (talk) 22:29, 8 July 2008 (UTC)
- While population count is in the divide unit, it isn't even close to accurate that it is "the slowest instruction". Actually, it takes only 4 cycles, if I remember right, which ties it with floating multiply and makes it only one cycle slower than the fastest instructions. The other instructions in the divide unit (the actual divide instructions) are indeed slowest by far, but that doesn't affect CX. Paul Koning (talk) 01:58, 11 December 2013 (UTC)
Comment on Operating Systems
It could not be said fairly that NOS showed performance many times better than SCOPE. For one thing, both systems were so spare that even on a bad day 99% of the available CPU time was consumed by user programs. The two systems had radically different disk file systems, each with disadvantages. In practice they behaved in a pretty similar manner, and to prefer one over the other was a matter of style over substance, if a customer were even able to appreciate such a difference!
(The following discussion will be moved into an article on the CDC operating systems after a resonable period of time.)
The file systems differed in the representation of disk extent allocation in memory. The original COS/SCOPE 1 method, preserved in MACE, KRONOS, and NOS, divided a disk surface into conceptual tracks. (On the very first disk and drum devices, these corresponded to physical tracks.) The track reservations were stored as 12-bit bytes in central memory words (the TRT) for ease of use by the 12-bit PPU. A nonzero reservation contained a pointer to the next allocated track in a file.
This design produced a limit of 4095 tracks per device. If the physical device had more physical tracks, you had a choice of allocating more sectors per virtual track, or splitting the physical device into more logical devices with their own reservation tables. if you didn't mind creating huge sector allocations, you could also combine multiple physical devices into one logical device.
This very simple (with limitations) design was also predicated on the idea that each PPU desiring to perform disk I/O contained all of the channel and track changing logic in the PPU-resident code (the address region between 100 and 1077 octal). This made disk error detection and correction extremely difficult as drives became more complex in the 1980s. It also meant that a busy CPU could generate many competing requests from many PPUs for the same disk channels. In such cases PPUs had to wait for their turn, as mediated by the system monitor program.
The later SCOPE file system method, in SCOPE 2 & 3 and NOS/BE, also divided the disk surface into allocation zones with some number of sectors per reservation block. The reservations were stored as a bit vector in central memory words (the RBR), the offset of the bit indicating which block was reserved. In order to represent the sequence of blocks in a file, a separate table was created in a variable-sized region in upper central memory (the RBT). This table contained 12-bit bytes in a pair of CPU words. Each byte contained the bit offset number of the allocated block. One byte in each pair of words contained a pointer to the next pair of words, forming a linked list.
Calculating the bit offsets required some arithmetic best performed in the CPU. As such, SCOPE designers centralized disk access to one PPU program which remained active when any disk requests were present. This program exclusively communicated with the CPU disk allocation code via system requests. This placed all the error detection and diagnostic code in the full memory of one PPU. It also could evaluate all disk requests to optimize head movement.
A SCOPE PPU program wishing to perform disk I/O entered system requests instead of performing its own channel operations. The complex communication process included routing the request from the monitor to the unified disk processor, which performed the channel operations, then sent the data to the PPU over the same channel, or to central memory. Clearly the SCOPE design had many moving parts.
In comparing the COS and SCOPE file systems, these points may be made. The SCOPE design required significantly less storage, at a time when storage was the greatest cost of any computer. The SCOPE RBR was about 1/12 the size of the COS TRT. The SCOPE design cleverly realized that not every RBR sequence is in use; the RBT could be created in memory when the file was opened. Moreover, the RBT design permitted a file to flow from one device to another. This was the basis of the SCOPE permanent files design. Finally, an RBR could represent significantly more than 4095 tracks.
The SCOPE storage advantage was offset by the system complexity required to maintain the variable field length of the RBT. It should have been represented as a control point like any other process, but this would have removed some of the storage efficiency. Furthermore, the unified disk processor PPU program was never very effective at head motion optimization, as the system loads evolved toward time sharing.
The COS file system suffered from too much simplicity. In addition to the storage disadvantage above, its design did not permit ad-hoc device overflow to other physical or logical devices. The TRT design induced extremely wasteful sector allocations on large-capacity drives. As a defence, KRONOS introduced the twin concepts of direct and indirect permanent files, in which the user was expected to know whether his file was large (therefore, direct—stored in place) or small (therefore, indirect—copied to a small-allocation zone at some performance cost).
In the end, comparing system performance with an apples-to-apples batch job stream, without add-on subsystems running, KRONOS and SCOPE naked operating system performance was roughly equal.
However, "without add-on subsystems" is a major qualification, since those subsystems were always running in practice. That's grist for another mill.
Google News Archive Search on 6600
Someone with the right enthusiam should include info from these few news articles from 1960s, found with the new Google News Archive Search. I'd just add these to the external links but I don't think they'd get the attention they deserve. The Google News Archive is an amazing insight into the past. :) The first article, from 1961, describes a "huge, advanced-design 6600 computer" to be under development and one, from 1968, how "Four years ago, Minneapolis-based Control Data Corp. brought out its model 6600 computer, the largest machine of its type in the world." I really hope articles about old events, items and people will start getting external links to actual news stories now that we have this new Google service! (Oh, and I'm not paid for endorsing Google like this, I personally just think it's a Good Thing(tm) ;) --ZeroOne 23:59, 14 September 2006 (UTC)
In the section on the CP, at the point where it says the stack is flushed by an unconditional branch, it then says unconditional branches are faster than conditional branches. As I recall, this was the reverse. Many times used =X0 X0 for "unconditional" branches just to avoid the stack flush. Should this not read "it was sometimes faster (and would never be slower) to use a conditional jump," rather than "it was sometimes faster (and would never be slower) than a conditional jump"?
Poochner 18:21, 22 March 2007 (UTC)
The unconditional branch instruction, JP, was always 'out of stack'. In other words, the instruction stack was always invalidated. The conditional branch instructions only caused stack invalidation when their destination was not in the stack. (One can think of the 6600 instruction stack as an instruction cache with a single 'line'. So any out-of-stack branch would invalidate the cache.) So yes, an conditional branch was generally preferred - even for unconditional branches. The EQ instruction was almost always used - comparing B0 to B0.
Note that on the CDC 7600, words in the stack did not need to represent contiguous memory locations. IIRC, the JP instruction did not invalidate the stack on the 7600, but Return Jump still did. Last resort, a monitor eXchange Jump would certainly do it...
--Wws 04:29, 25 March 2007 (UTC)
- Coding standard was that unconditional branches were always "EQ label" (which is short for "EQ B0, B0, label" -- i.e., branch if B0 is equal to B0, which is of course always true). The "JP" instruction was used only if EQ was the wrong thing to use, usually because the branch target was a computed value or a table entry (in which case the JP instruction with a register for destination address would be used) or, rarely, if the action of invalidating the instruction stack was actually required (self-modifying code, such as overlay loading). Paul Koning (talk) 16:06, 9 November 2009 (UTC)
Since the CDC 6600 was the first computer with a scoreboard for dynamic rescheduling, this should definitely be included in the article. It's mentioned in the CDC 6000 series article, but not here. -- PyroPi (talk) 02:54, 11 May 2010 (UTC)
The B-registers are named in the article "scratchpad" registers. As I can remember, at the CDC 6400 they were named "index" registers. Is there a different naming on the CDC 6600? Thanks for the clarification. CeeGee 07:02, 9 July 2013 (UTC)
- "Index" register is correct. For example, they are described as such in the Compass 3 manual (document no. 60492600), section 2.5, page 2-8. Paul Koning (talk) 02:03, 11 December 2013 (UTC)
- Actually, there is conflict in the manuals. The Compass manual calls them index registers, the 6600 reference manual calls them increment registers. The latter term seems better, after all they go with the "increment" unit, and they are used for things other than indexing. Paul Koning (talk) 16:57, 11 December 2013 (UTC)
Delivery of first 6600
I have seen various versions of the claim that the first 6600 was delivered to CERN a year before the delivery to Lawrence Livermore Lab in California. This is simply not true! The first 6600 was delivered to Livermore Lab. One year before that we were still laying out the artwork for the printed circuit boards and there was no 6600. As I remember it the system we shipped to CERN was serial 4 or 5.
Overly long intro?
- You are right. I think the paragraph on the 7600 obviously should not be in the intro, so I moved it. I think the second paragraph needs to be moved out of the intro too. Bubba73 You talkin' to me? 00:42, 6 January 2015 (UTC)
- I moved the second para into a new "Models" subsect of the "Description" sect. I also changed the intro text from referring to "a 6600" to "the first 6600" (because it sounded awkward), but someone needs to verify that the CERN model was indeed the first CDC 6600. — Loadmaster (talk) 17:51, 6 January 2015 (UTC)
PP and memory timing
The discussion of the barrel says that it was partly for cost (sounds right -- not to mention space) and partly because of the CP memory timing which is 10 PP cycles. That doesn't sound right. All memory has the same timing (it's after all the same modules), 1000 ns full cycle. So a CM cycle is one major cycle. But in any case, the CM to PP interaction is isolated in the "pyramid" so the two timings aren't directly tied together. The relevant memory timing may instead be that of PP memory. Given the logic used, execution steps would run at small multiples of 100 ns, but with memory running at 1000 ns that speed would be wasted. 1 microsecond cycle time is fast enough for the intended uses of the PPs, so by having 10 PPs in a barrel, the effective PP execution cycle becomes 1000 ns which perfectly matches the PP memory cycle time. The key question (by Wikipedia rules) is not really whether this makes sense, but whether we can find a source to cite for it. Paul Koning (talk) 15:01, 7 January 2015 (UTC)