||This article needs additional citations for verification. (August 2011)|
The ILLIAC IV was one of the first attempts to build a massively parallel computer. One of a series of research machines (the ILLIACs from the University of Illinois), the ILLIAC IV design featured fairly high parallelism with up to 256 processors, used to allow the machine to work on large data sets in what would later be known as vector processing. After several delays and redesigns, the computer was delivered to NASA'a Ames Research Center outside of San Francisco in 1971. After thorough testing and four years of NASA use, Illiac IV was connected to the ARPANet for distributed use in November 1975, becoming the first available supercomputer, beating Cray's Cray-1 by nearly 12 months.
By the early 1960s computer designs were approaching the point of diminishing returns. At the time, computer design focused on adding as many instructions as possible to the machine's CPU, a concept known as "orthogonality", which made programs smaller and more efficient in use of memory. It also made the computers themselves fantastically complex, and in an era when many CPUs were hand-wired from individual transistors, the cost of additional orthogonality was often very high. Adding instructions could potentially slow the machine down; maximum speed was defined by the signal timing in the hardware, which was in turn a function of the overall size of the machine. The state of the art hardware design techniques of the time used individual transistors to build up logic circuits, so any increase in logic processing meant a larger machine. CPU speeds appeared to be reaching a plateau.
Several solutions to these problems were explored in the 1960s. One, then known as overlap but today known as an instruction pipeline, allows a single CPU to work on small parts of several instructions at a time. Normally the CPU would fetch an instruction from memory, "decode" it, run the instruction and then write the results back to memory. While the machine is working on any one stage, say decoding, the other portions of the CPU are not being used. Pipelining allows the CPU to start the load and decode stages (for instance) on the "next" instruction while still working on the last one and writing it out. Pipelining was a major feature of Seymour Cray's groundbreaking design, the CDC 7600, which outperformed almost all other machines by about ten times when it was introduced.
Another solution to the problem was parallel computing; building a computer out of a number of general purpose CPUs. The "computer" as a whole would have to be able to keep all of the CPUs busy, asking each one to work on a small part of the problem and then collecting up the results at the end into a single "answer". Not all tasks can be handled in this fashion, and extracting performance from multiple processors remains a problem even today, yet the concept has the advantage of having no theoretical limit to speed – if you need more performance, simply add more CPUs. General purpose CPUs were very expensive, however, so any "massively parallel" design would either be too expensive to be worth it, or have to use a much simpler CPU design.
Westinghouse explored the latter solution in a project known as Solomon. Since the highest performing computers were being used primarily for math processing in science and engineering, they decided to focus their CPU design on math alone. They designed a system in which the instruction stream was fetched and decoded by a single CPU, the "control unit" or CU. The CU was attached to an array of processors built to handle floating point math only, the "processing element"s, or PEs. Since much of the complexity of a CPU is due to the instruction fetching and decoding process, Solomon's PEs ended up being much simpler than the CU, so many of them could be built without driving up the price. Modern microprocessor designs are quite similar to this layout in general terms, with a single instruction decoder feeding a number of subunits dedicated to processing certain types of data. Where Solomon differed from modern designs was in the number of subunits; a modern CPU might have three or four integer units and a similar number of floating point, in Solomon there were 256 PE's, all dedicated to floating point.
Solomon would read instructions from memory, decode them, and then hand them off to the PE's for processing. Each PE had its own memory for holding operands and results, the PE Memory module, or PEM. The CU could access the entire memory via a dedicated memory bus, whereas the PE's could only access their own PEM. Although there are problems, known as embarrassingly parallel, that can be handled by entirely independent units, these problems are generally rare. To allow results from one PE to be used as inputs in another, a separate network connected each PE to its eight closest neighbors. Similar arrangements were common on massively parallel machines in the 1980s.
Unlike modern designs, Solomon's PEs could only run a single instruction at a time, and every PE had to be running the same instruction. That means the system was only useful when working on data sets that had "wide" arrays that could be spread out over the PEs. These sorts of problems are not uncommon in scientific processing, and are very common today when working with multimedia data. The concept of applying a single instruction to a large number of data elements at once is now common to most microprocessor designs, where it is referred to as SIMD, for "Single Instruction, Multiple Data". In Solomon, the CU would normally load up the PEMs with data, scatter the instructions across the PEMs, and then start feeding the instructions to the PE's, one at every clock cycle.
Under a contract from the US Air Force's RADC research arm, they had built a breadboard prototype machine in 1964, but the RADC contract ended and Westinghouse decided not to follow it up on their own.
ILLIAC IV 
When Solomon ended, the principal investigator, Daniel Slotnick, joined the Illiac design team at the University of Illinois at Urbana-Champaign. Illinois had been designing and building large computers for the U.S. Department of Defense and the Defense Advanced Research Projects Agency (DARPA) since 1949. In 1964 the University signed a contract with DARPA to fund the effort, which became known as ILLIAC IV, since it was the fourth computer designed and created at the University. Development started in 1965, and a first-pass design was completed in 1966.
In many ways the machine was treated as an experimental design, so it included the most advanced features then available. The logic circuits were based on ECL integrated circuits (ICs), whereas many machines of the era still relied on individual transistors or low-speed ICs. Texas Instruments was contracted for the ECL based ICs. Each PE was given 2048-words of 240 ns thin film memory (later replaced with semiconductor memory) for storing results. Burroughs also supplied the specialized disk drives, which featured a separate stationary head for every track and could offer speeds up to 500 Mbit/s and stored about 80 MB per 36" disk. They also provided a Burroughs B6500 mainframe to act as a front-end controller. Connected to the B6500 was a laser optical recording medium, a write-once system that stored up to 1 Tbit on a plastic disk covered with a thin metal film.
The ILLIAC had a 64-bit word design. The CPU had sixty-four 64-bit registers and another four 64-bit accumulators. The PEs had only six 64-bit registers, each with a special purpose. One of these, RGR, was used for communicating data to neighboring PEs, moving one "hop" per clock cycle. Another, RGD, indicated whether or not that PE was currently active. The PEs had instruction formats for 64, 32 and 8-bit data, and could be placed into a 32-bit mode that made it appear that there were 128 PEs. The PEs were powerful floating point and integer processors that could be operated as normalize/unnormalized, round/truncate, and short/long arithmetic.
The design goal called for a computer with the ability to process 1 billion floating point operations per second, or in today's terminology, 1 GFLOPS. To do this the basic design would require 256 PEs running on a 13 MHz clock, driven by four CPUs. Originally they intended to house all 256 PEs in a single large mainframe, but the project quickly ran behind schedule. Instead, a modification was made to divide the ALUs into quadrants of 64 with a single CU each, housed in separate cabinets. Eventually it became clear that only one quadrant would become available in any realistic timeframe, reducing performance from 1 GFLOPS to about 200 MFLOPS.
Sample work at the University was primarily aimed at ways to efficiently fill the PEs with data, thus conducting the first "stress test" in computer development. In order to make this as easy as possible, several new computer languages were created; IVTRAN and TRANQUIL were parallelized versions of FORTRAN, and Glypnir was a similar conversion of ALGOL. Generally these languages provided support for loading arrays of data "across" the PEs to be executed in parallel, and some even supported the unwinding of loops into array operations.
ILLIAC moves 
When the computer was being built at the Burroughs Corporation's Great Valley Lab in the late 1960s, it was met with hostility by protesters who were suspicious of the University's tie with the Department of Defense (through ARPA), and felt that the University had sold out to a conspiracy to develop nuclear weapons. The fear was unfounded, but government paranoia was running rampant in the time following the massacre at Kent State University. The protests reached a boiling point on 9 May 1970, in a day of "Illiaction". The Director of the Project decided the University could not guarantee the safety of the machine. It was then decided that the machine would be delivered to the NASA Ames Research Center, rather than to Illinois. The work was picked up by NASA, then still cash-flush in the post-Apollo years and interested in almost anything "high tech". They formed a new Advanced Computing division, and the machine was delivered to NASA Ames.
Originally Texas Instruments made a commitment to build the Processing Elements (PEs) out of large scale integrated (LSI) circuits. Several years into the project, TI backed out and said that they could not produce the LSI chips at the contracted price. This required a complete redesign using medium scale integrated circuits, in place of LSI. This increased the size of the chips on the CUs from about 1 in square to about 6 by 10 inches. The resulting system grew in size to accommodate the larger CUs, leaving too little room for the full machine, and the system was scaled back to only a single quadrant.
The machine was 10' high, 8' deep and 50' long. The power supplies for the machine were so large that it required designing a single tongue fork lift to remove and reinstall the power supply. The power supply buss bars on the machine spanned distances greater than three feet, and were octopus-like in design. Thick copper, the busses were coated in epoxy that often cracked resulting in shorts and an array of other issues. ILLIAC IV was designed by Burroughs Corporation and built in quadrants in Great Valley, PA during the years of 1967 through 1972. It had a traditional one address accumulator architecture, rather than the revolutionary stack architecture pioneered by Burroughs in the 5500/6500 machines. Illiac IV was designed in fact to be a "back end processor" to a B6700. The cost overruns caused by not getting the LSI chips and other design screw ups by Burroughs (the control unit was built with positive logic and the PEs with negative logic, etc) made the project untenable.
Starting in 1970, the machine became the subject of student demonstrations at Illinois. First, that the project had been secretly created on campus. When this claim proved to be false, the focus shifted to the role of Universities in secret military research. Slotnick was not in favor of running classified programs on the machine. ARPA wanted the machine room encased in copper to prevent off site snooping of classified data. Slotnick refused to do that. He went further and insisted that all research performed on Illiac IV would be published. If the machine had been installed in Urbana this would have been the case. However, two things caused the machine to be delivered to NASA Ames. One was that Slotnick was concerned that the physical presence of the machine on campus might attract violence on the part of student radicals. This and the requirement to do secret research with the machine lead ARPA to move the machine to NASA Ames Research Center, where it was installed in a secure environment. The machine was never delivered to Illinois, arriving in 1972. Rumor has it that simulations run on the machine made the nuclear test ban treaties possible. In 1972, when the first (and only quadrant) was operational at NASA, it was 13 times faster than any other machine operating at the time. The Control Unit and a few PEs may be seen today at the Computer History Museum in California.
By the time of delivery in 1971, the original $8 million estimated from the first design in 1966 had risen to $31 million. Burroughs, unfamiliar with parallel test processes, could never get the computer to reach its estimated 1 GFLOPS; the best they could muster was 250 MFLOPS, with peaks of 150. NASA also decided to replace the B6500 with a PDP-10, which were in common use at Ames, but this required the development of new compilers and support software. When the ILLIAC was finally turned on in 1971, NASA's changes proved incompatible with the original design, causing intermittent failure. Efforts to correct the reliability allowed it to run its first complete program in 1972, and go into full operation in 1975. Due to MASA regulations that the computer, prone to overheating, could not be operational without observation, the machine was operated only Monday to Friday and had up to 40 hours of planned maintenance a week.
Nevertheless the ILLIAC was increasingly used over the next few years, and Ames added their own FORTRAN version, CFD. On problems that could be parallelized the machine was still the fastest in the world, outperforming the CDC 7600 by two to six times, and it is generally credited as the fastest machine in the world until 1981. For NASA the machine was "perfect", as its performance was tuned for programs running the same operation on lots of data, which is exactly what computational fluid dynamics is all about. The machine was eventually decommissioned in 1982, and NASA's advanced computing division ended with it.
Burroughs was able to use the basic design for only one commercial system, the Parallel Element Processing Ensemble, or PEPE. PEPE was designed to allow high-accuracy tracking of 288 incoming ICBM warheads, each one assigned to a modified PE. Burroughs built only one PEPE system, although a follow-on design was built by Bell Labs.
Although the ILLIAC travails ended in uninspiring results, attempts to understand the reasons for the difficulties of the ILLIAC IV architecture pushed forward research in parallel computing. Illiac IV was a member of the class of parallel computers, referred to as SIMD (Single Instruction stream, Multiple Data stream), essentially an array processor. During the 1980s, with the falling processor cost created by Moore's Law, a number of companies created MIMD (Multiple Instruction, Multiple Data) to build even more parallel machines, with compilers that could make better use of the parallelism. The Thinking Machines CM-1 and CM-2 are excellent examples of the MIMD concept,
Most supercomputers of the era took another approach to higher performance, using a single very high speed vector processor. Similar to the ILLIAC in concept, these processor designs loaded up many data elements into a single custom processor instead of a large number of speicalized ones. The classic example of this design is the Cray-1, which had performance similar to the ILLIAC. There was more than a little "backlash" against the ILLIAC design as a result, and for some time the supercomputer market looked on massively parallel designs with disdain, even when they were successful. As Seymour Cray famously quipped, "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?" However keeping with the ploughing analogy consider what you would want behind your tractor (Control Unit) would you want one mouldboard or 64 mouldboards(PEs)? The answer is obvious looking at any modern plough or computer.
Moore's Law overtook the specialized SIMD ILLIAC approach making the MIMD approach preferred for almost all scientific computing. The vector processor approach evolved into the pipeline architecture used in most processors today. Hence in the end, it was a synthesis of MIMD and Pipeline that is used for many "supercomputers". Nevertheless, an echo of this SIMD architecture lives on in modern GPU design which themselves are now being incorporated into modern supercomputing systems.
See also 
- The ILLIAC IV System 307 – From Computer Structures Principles and Examples (C. Gordon Bell et al.)
- ILLIAC IV CFD
- ILLIAC IV