The 801 was an experimental central processing unit (CPU) design developed by IBM during the 1970s. It is considered to be the first modern RISC design, relying on processor registers for all computations and eliminating the many variant addressing modes found in CISC designs. Originally developed as the processor for a telephone switch, it was later used as the basis for a minicomputer and a number of products for their mainframe line. The initial design was a 24-bit processor; that was soon replaced by 32-bit implementations of the same concepts and the original 24-bit 801 was used only into the early 1980s.
The 801 was extremely influential in the computer market. Armed with huge amounts of performance data, IBM was able to demonstrate that the simple design was able to easily outperform even the most powerful classic CPU designs, while at the same time producing machine code that was only marginally larger than the heavily optimized CISC instructions. Applying these same techniques even to existing processors like the System/370 generally doubled the performance of those systems as well. This demonstrated the value of the RISC concept, and all of IBM's future systems were based on the principles developed during the 801 project.
In 1974, IBM began examining the possibility of constructing a telephone switch to handle a 300 calls per second. They calculated that each call would require 20,000 instructions to complete, and when one added timing overhead and other considerations, such a machine would required performance of about 12 MIPS. This would require a significant advance in performance; their current top-of-the-line machine, the IBM System/370 Model 168 of late 1972, offered about 3 MIPS.
The group working on this project at the Thomas J. Watson Research Center, including John Cocke, designed a processor for this purpose. To reach the required performance, they considered the sort of operations such a machine required and removed any that were not appropriate. This led to the removal of a floating-point unit for instance, which would not be needed in this application. More critically, they also removed many of the instructions that worked on data in main memory and left only those instructions that worked on the internal processor registers, as these were much faster to use (effectively zero-time) and the simple code in a telephone switch could be written to use only these types instructions. The result of this work was a conceptual design for a simplified processor with the required performance.
The telephone switch project was canceled in 1975, but the team had made considerable progress on the concept and in October IBM decided to continue it as a general-purpose design. With no obvious project to attach it to, the team decided to call it the "801" after the building they worked in. For the general-purpose role, the team began to consider real-world programs that would be run on a typical minicomputer. IBM had collected enormous amounts of statistical data on the performance of real-world workloads on their machines and this data demonstrated that over half the time in a typical program was spent performing only five instructions; load value from memory, store value to memory, branch, compare fixed-point numbers, and add fixed-point numbers. This suggested that the same simplified design would work just as well for a general-purpose minicomputer as a special-purpose switch.
This conclusion flew in the face of contemporary processor design, which was based on the concept of using microcode. IBM had been among the first to make widespread use of this technique as part of their famous System/360 series. The 360s, and 370s, came in a variety of performance levels that all ran the same machine language code. On the high-end machines, many of these instructions were implemented directly in hardware, while low-end machines could instead simulate those instructions using a sequence of other instructions. This allowed a single application binary interface to run across the entire line, and allowed the users to feel confident that if more performance was ever needed they could move up to a faster machine without any other changes.
Microcode allowed a simple processor to offer many instructions, which had been used by the designers to implement a wide variation of addressing modes. For instance, an instruction like
ADD might have a dozen versions, one that adds two numbers in internal registers, one that adds a register to a value in memory, one that adds two values from memory, etc. This allowed the programmer to select the exact variation that they needed for any particular task. The processor would read that instruction and use microcode to break it into a series of internal instructions. For instance, adding two numbers in memory might be implemented by loading those two numbers into registers, adding them, and then saving it back out again.
The team noticed a side-effect of this concept; when faced with the plethora of possible versions of a given instruction, compiler authors would almost always pick a single version. This was almost always the one that was implemented in hardware on the low-end machines. That ensured that the machine code generated by the compiler would run as fast as possible on the entire lineup. While using other versions of instructions might run even faster on a machine that implemented other versions of the instruction in hardware, the complexity of knowing which one to pick on an ever-changing list of machines made this extremely unattractive, and compiler authors largely ignored these possibilities.
As a result, the majority of the instructions available in the instruction set were never used in compiled programs. And it was here that the team made the key realization of the 801 project:
Imposing microcode between a computer and its users imposes an expensive overhead in performing the most frequently executed instructions.
Microcode takes a non-zero time to examine the instruction before it is performed. The same underlying processor with the microcode removed would eliminate this overhead and run those instructions faster. And since microcode essentially ran small subroutines dedicated to a particular hardware implementation, it was ultimately performing the same basic task that the compiler was, implementing higher-level instructions as a sequence of machine-specific instructions. Simply removing the microcode and implementing that in the compiler could result in a faster machine.
One concern was that programs written for such a machine would take up more memory; some tasks that could be accomplished with a single instruction on the 370 would have to be expressed as multiple instructions on the 801. For instance, adding two numbers from memory would require two load-to-register instructions, a register-to-register add, and then a store-to-memory. This could potentially slow the system overall if it had to spend more time reading instructions from memory than it formerly took to decode them. As they continued work on the design and improved their compilers, they found that overall program length continued to fall, eventually becoming roughly the same length as those written for the 370.
The initially proposed architecture was a machine with sixteen 24-bit registers and without virtual memory. It used a two-operand format in the instruction, so that instructions were generally of the form
A = A + B. The resulting CPU was operational by the summer of 1980 and was implemented using Motorola MECL-10K discrete component technology on large wire-wrapped custom boards. The CPU was clocked at 66 ns cycles (approximately 15.15 MHz) and could compute at the fast speed of approximately 15 MIPS.
The 801 architecture was used in a variety of IBM devices, including channel controllers for their S/370 mainframes (such as the IBM 3090),:377 various networking devices, and eventually the IBM 9370 mainframe core itself. The original version of the 801 architecture was the basis for the architecture of the IBM ROMP microprocessor:378 used in the IBM RT PC workstation computer and several experimental computers from IBM Research.
Having been originally designed for a limited-function system, the 801 design lacked a number of features seen on larger machines. Notable among these was the lack of hardware support for virtual memory, which was not needed for the controller role and had been implemented in software on early 801 systems that needed it. For more widespread use, hardware support was a must-have feature. Additionally, by the 1980s the computer world as a whole was moving towards 32-bit systems, and there was a desire to do the same with the 801.
Moving to a 32-bit format had another significant advantage. In practice, it was found that the two-operand format was difficult to use in typical math code. Ideally, both input operands would remain in registers where they could be re-used in subsequent operations, but as the output of the operation overwrote one of them, it was often the case that one of the values had to be re-loaded from memory. By moving to a 32-bit format, the extra bits in the instruction words allowed an additional register to be specified, so that the output of such operations could be directed to a separate register. The larger instruction word also allowed the number of registers to be increased from sixteen to thirty-two, a change that had clearly been suggested by examination of 801 code. In spite of the expansion of the instruction words from 24 to 32-bits, programs did not grow by the corresponding 33% due to avoided loads and saves due to these two changes.
Other desirable additions include instructions for working with string data that was encoded in "packed" format with several ASCII characters in a single memory word, and additions for working with binary-coded decimal, including an adder that could carry across four-bit decimal numbers.
When the new version of the 801 was run as a simulator on the 370, the team was surprised to find that code compiled to the 801 and run in the simulator would often run faster than the same source code compiled directly to 370 machine code using the 370's PL/1 compiler. When they ported their experimental "PL.8" language back to the 370 and compiled applications using it, they also ran faster than existing PL/1 code, as much as three times as fast. This was due to the compiler making RISC-like decisions about how to compile the code to internal registers, thereby optimizing out as many memory accesses as possible which would be expensive on the 801, which led to much better register re-use and thus higher performance even on a CISC processor.
In the early 1980s the lessons learned on the 801 were put back into the new "America" design. This was a three-chip processor set including an instruction processor that fetches and decodes instructions, a fixed-point processor that shares duty with the instruction processor, and a floating-point processor for those systems that require it. Designed by the 801 team, the final design was sent to IBM's Austin office which developed it into the IBM RS/6000 system. The RS/6000 running at 25 MHz was one of the fastest machines of its era. It outperformed other RISC machines by two to three times on common tests, and trivially outperformed older CISC systems.
After the RS/6000, the company turned its attention to a version of the 801 concepts that could be efficiently fabricated at various scales. The result was the IBM POWER instruction set architecture and the PowerPC offshoot. For his work on the 801, John Cocke was awarded the Turing Award in 1987, National Medal of Technology in 1991, and the National Medal of Science in 1994.
- Cocke & Markstein 1990, p. 4.
- Savard, John. "On the 370/165 and the 360/85".
- Cocke & Markstein 1990, p. 5.
- Sack, Harald (7 April 2016). "microcode in the system 360". SciHub.
- Cocke & Markstein 1990, pp. 6-7.
- "The 801 Minicomputer - An Overview" (PDF). October 8, 1976. p. 9.
- "System 801 Principles of Operation" (PDF). January 16, 1976.
- Radin 1982.
- Dewar, Robert B.K.; Smosna, Matthew (1990). Microprocessors: A Programmer's View. McGraw-Hill.
- Cocke & Markstein 1990, p. 9.
- Cocke & Markstein 1990, p. 7.
- Cocke & Markstein 1990, p. 8.
- "NSTMF". NSTMF. Retrieved 2020-05-12.
- Flynn, Michael J. (1995). Computer architecture: pipelined and parallel processor design. pp. 54–56. ISBN 0867202041.
- Cocke, John; Markstein, Victoria (January 1990). "The evolution of RISC technology at IBM" (PDF). IBM Journal of Research and Development. 34 (1): 4–11. doi:10.1147/rd.341.0004.
- Cocke, John (March 1988). "The Search For Performance In Scientific Processors". Communications of the ACM. 31 (3): 252. doi:10.1145/1283920.1283945. ISBN 978-1-4503-1049-9.
- Radin, G. (1982). The 801 minicomputer. ASPLOS-I. Proceedings of the first international symposium on Architectural support for programming languages and operating systems. pp. 39–47. doi:10.1145/800050.801824. ISBN 0-89791-066-4.
- "Altering Computer Architecture is Way to Raise Throughput, Suggests IBM Researchers". Electronics V. 49, N. 25 (23 December 1976), pp. 30–31.
- V. McLellan: "IBM Mini a Radical Departure". Datamation V. 25, N. 11 (October 1979), pp. 53–55.
- Dewar, Robert B.K.; Smosna, Matthew (1990). Microprocessors: A Programmer's View. McGraw-Hill. pp. 258–264.
- Tabak, Daniel (1987). RISC Architecture. Research Studies Press. pp. 69–72.