Very long instruction word
Very long instruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism (ILP). Whereas conventional processors mostly only allow programs that specify instructions to be executed one after another, a VLIW processor allows programs that can explicitly specify instructions to be executed at the same time (i.e. in parallel). This type of processor architecture is intended to allow higher performance without the inherent complexity of some other approaches.
|This section needs additional citations for verification. (March 2014)|
Traditional approaches to improving performance in processor architectures include breaking up instructions into sub-steps so that instructions can be executed partially at the same time (known as pipelining), dispatching individual instructions to be executed completely independently in different parts of the processor (superscalar architectures), and even executing instructions in an order different from the program (out-of-order execution). These approaches all involve increased hardware complexity (higher cost, larger circuits, higher power consumption) because the processor must intrinsically make all of the decisions internally for these approaches to work. The VLIW approach, by contrast, depends on the programs themselves providing all the decisions regarding which instructions are to be executed simultaneously and how conflicts are to be resolved. As a practical matter this means that the compiler (software used to create the final programs) becomes much more complex, but the hardware is simpler than many other approaches to parallelism.
The acronym VLIW may also refer to variable-length instruction word, a criteria in instruction set design to allow for a more flexible layout of the instruction set and higher code density (depending on the instructions to be used). For example, this approach makes it possible to load an immediate value of the size of a machine word into a processor register, which would not be feasible if each instruction was limited to the size of machine word. The flexibility comes at an additional cost for instruction decoding.
|This section does not cite any references or sources. (March 2014)|
A processor that executes every instruction one after the other (i.e. a non-pipelined scalar architecture) may use processor resources inefficiently, potentially leading to poor performance. The performance can be improved by executing different sub-steps of sequential instructions simultaneously (this is pipelining), or even executing multiple instructions entirely simultaneously as in superscalar architectures. Further improvement can be achieved by executing instructions in an order different from the order they appear in the program; this is called out-of-order execution.
These three techniques all come at the cost of increased hardware complexity. Before executing any operations in parallel the processor must verify that the instructions do not have interdependencies. For example, if a first instruction's result is used as a second instruction's input then they cannot execute at the same time and the second instruction can't be executed before the first. Modern out-of-order processors have increased the hardware resources which do the scheduling of instructions and determining of interdependencies.
The VLIW approach, on the other hand, executes operations in parallel based on a fixed schedule determined when programs are compiled. Since determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that the three techniques described above require. As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs.
In superscalar designs, the number of execution units is invisible to the instruction set. Each instruction encodes only one operation. For most superscalar designs, the instruction width is 32 bits or fewer.
In contrast, one VLIW instruction encodes multiple operations; specifically, one instruction encodes at least one operation for each execution unit of the device. For example, if a VLIW device has five execution units, then a VLIW instruction for that device would have five operation fields, each field specifying what operation should be done on that corresponding execution unit. To accommodate these operation fields, VLIW instructions are usually at least 64 bits wide, and on some architectures are much wider.
For example, the following is an instruction for the SHARC. In one cycle, it does a floating-point multiply, a floating-point add, and two autoincrement loads. All of this fits into a single 48-bit instruction:
f12=f0*f4, f8=f8+f12, f0=dm(i0,m3), f4=pm(i8,m9);
Since the earliest days of computer architecture, some CPUs have added several additional arithmetic logic units (ALUs) to run in parallel. Superscalar CPUs use hardware to decide which operations can run in parallel. VLIW CPUs use software (the compiler) to decide which operations can run in parallel. Because the complexity of instruction scheduling is pushed off onto the compiler, the hardware's complexity can be substantially reduced.
A similar problem occurs when the result of a parallelisable instruction is used as input for a branch. Most modern CPUs "guess" which branch will be taken even before the calculation is complete, so that they can load up the instructions for the branch, or (in some architectures) even start to compute them speculatively. If the CPU guesses wrong, all of these instructions and their context need to be "flushed" and the correct ones loaded, which is time-consuming.
This has led to increasingly complex instruction-dispatch logic that attempts to guess correctly, and the simplicity of the original RISC designs has been eroded. VLIW lacks this logic, and therefore lacks its power consumption, possible design defects and other negative features.
In a VLIW, the compiler uses heuristics or profile information to guess the direction of a branch. This allows it to move and preschedule operations speculatively before the branch is taken, favoring the most likely path it expects through the branch. If the branch goes the unexpected way, the compiler has already generated compensatory code to discard speculative results to preserve program semantics.
The term VLIW, and the concept of VLIW architecture itself, were invented by Josh Fisher in his research group at Yale University in the early 1980s. His original development of trace scheduling as a compilation technique for VLIW was developed when he was a graduate student at New York University. Prior to VLIW, the notion of prescheduling functional units and instruction-level parallelism in software was well established in the practice of developing horizontal microcode.
Fisher's innovations were around developing a compiler that could target horizontal microcode from programs written in an ordinary programming language. He realized that to get good performance and target a wide-issue[clarification needed] machine, it would be necessary to find parallelism beyond that generally within a basic block. He also developed region scheduling techniques to identify parallelism beyond basic blocks. Trace scheduling is such a technique, and involves scheduling the most likely path of basic blocks first, inserting compensation code to deal with speculative motions, scheduling the second most likely trace, and so on, until the schedule is complete.
Fisher's second innovation was the notion that the target CPU architecture should be designed to be a reasonable target for a compiler — that the compiler and the architecture for a VLIW processor must be co-designed. This was partly inspired by the difficulty Fisher observed at Yale of compiling for architectures like Floating Point Systems' FPS164, which had a complex instruction set architecture (CISC) that separated instruction initiation from the instructions that saved the result, requiring very complicated scheduling algorithms. Fisher developed a set of principles characterizing a proper VLIW design, such as self-draining pipelines, wide multi-port register files, and memory architectures. These principles made it easier for compilers to write fast code.
The first VLIW compiler was described in a Ph.D. thesis by John Ellis, supervised by Fisher. The compiler was christened Bulldog, after Yale's mascot.  John Ruttenberg also developed certain important algorithms for scheduling.
Fisher left Yale in 1984 to found a startup company, Multiflow, along with co-founders John O'Donnell and John Ruttenberg. Multiflow produced the TRACE series of VLIW minisupercomputers, shipping their first machines in 1987. Multiflow's VLIW could issue 28 operations in parallel per instruction. The TRACE system was implemented in an MSI/LSI/VLSI mix packaged in cabinets, a technology that fell out of favor when it became more cost-effective to integrate all of the components of a processor (excluding memory) on a single chip.
Multiflow was too early to catch the following wave, when chip architectures began to allow multiple-issue CPUs [clarification needed]. The major semiconductor companies recognized the value of Multiflow technology in this context, so the compiler and architecture were subsequently licensed to most of these companies.
One of the licensees of the Multiflow technology is Hewlett-Packard, which Josh Fisher joined after Multiflow's demise. Bob Rau, founder of Cydrome, also joined HP after Cydrome failed. These two would lead computer architecture research within Hewlett-Packard during the 1990s.
In addition to the above systems, at around the same period (i.e. 1989-1990), Intel implemented VLIW in the Intel i860, their first 64bit microprocessor; the i860 was also the first processor to implement VLIW on a single chip. This processor could operate in both simple RISC mode and VLIW mode:
In the early 1990s, Intel introduced the i860 RISC microprocessor. This simple chip had two modes of operation: a scalar mode and a VLIW mode. In the VLIW mode, the processor always fetched two instructions and assumed that one was an integer instruction and the other floating-point.
The i860's VLIW mode was used extensively in embedded DSP applications since the application execution and datasets were simple, well ordered and predictable, allowing the designer to take full advantage of the parallel execution advantages that VLIW lent itself to; in VLIW mode the i860 was able to maintain floating-point performance in the range of 20-40 double-precision MFLOPS (an extremely high figure for its time and for a processor operating at 25-50Mhz).
In the 1990s, Hewlett-Packard researched this problem as a side effect of ongoing work on their PA-RISC processor family. They found that the CPU could be greatly simplified by removing the complex dispatch logic from the CPU and placing it into the compiler. Compilers of the day were much more complex than those from the 1980s, so the added complexity in the compiler was considered to be a small cost.
VLIW CPUs are usually constructed of multiple RISC-like functional units that operate independently. Contemporary VLIWs typically have four to eight main functional units. Compilers generate initial instruction sequences for the VLIW CPU in roughly the same manner that they do for traditional CPUs, generating a sequence of RISC-like instructions. The compiler analyzes this code for dependence relationships and resource requirements. It then schedules the instructions according to those constraints. In this process, independent instructions can be scheduled in parallel. Because VLIWs typically represent instructions scheduled in parallel with a longer instruction word that incorporates the individual instructions, this results in a much longer opcode (thus the term "very long") to specify what executes on a given cycle.
Examples of contemporary VLIW CPUs include the TriMedia media processors by NXP (formerly Philips Semiconductors), the SHARC DSP by Analog Devices, the C6000 DSP family by Texas Instruments, the STMicroelectronics ST200 family based on the Lx architecture (designed in Josh Fisher's HP lab by Paolo Faraboschi), and the MPPA MANYCORE family by KALRAY. These contemporary VLIW CPUs are primarily successful as embedded media processors for consumer electronic devices.
VLIW features have also been added to configurable processor cores for SoC designs. For example, Tensilica's Xtensa LX2 processor incorporates a technology dubbed FLIX (Flexible Length Instruction eXtensions) that allows multi-operation instructions. The Xtensa C/C++ compiler can freely intermix 32- or 64-bit FLIX instructions with the Xtensa processor's single-operation RISC instructions, which are 16 or 24 bits wide. By packing multiple operations into a wide 32- or 64-bit instruction word and allowing these multi-operation instructions to be intermixed with shorter RISC instructions, FLIX technology allows SoC designers to realize VLIW's performance advantages while eliminating the code bloat of early VLIW architectures. The Infineon Carmel DSP is another VLIW processor core intended for SoC; it uses a similar code density improvement technique called "configurable long instruction word" (CLIW). 
Outside embedded processing markets, Intel's Itanium IA-64 EPIC and Elbrus 2000 appear as the only examples of a widely used VLIW CPU architectures. However, EPIC architecture is sometimes distinguished from a pure VLIW architecture, since EPIC advocates full instruction predication, rotating register files, and a very long instruction word that can encode non-parallel instruction groups. VLIWs also gained significant consumer penetration in the GPU market, though both Nvidia and AMD have since moved to RISC architectures in order to improve performance on non-graphics workloads.
|This section does not cite any references or sources. (March 2014)|
When silicon technology allowed for wider implementations (with more execution units) to be built, the compiled programs for the earlier generation would not run on the wider implementations, as the encoding of the binary instructions depended on the number of execution units of the machine.
Transmeta addresses this issue by including a binary-to-binary software compiler layer (termed code morphing) in their Crusoe implementation of the x86 architecture. Basically, this mechanism is advertised to recompile, optimize, and translate x86 opcodes at runtime into the CPU's internal machine code. Thus, the Transmeta chip is internally a VLIW processor, effectively decoupled from the x86 CISC instruction set that it executes.
Intel's Itanium architecture (among others) solved the backward-compatibility problem with a more general mechanism. Within each of the multiple-opcode instructions, a bit field is allocated to denote dependency on the previous VLIW instruction within the program instruction stream. These bits are set at compile time, thus relieving the hardware from calculating this dependency information. Having this dependency information encoded into the instruction stream allows wider implementations to issue multiple non-dependent VLIW instructions in parallel per cycle, while narrower implementations would issue a smaller number of VLIW instructions per cycle.
Another perceived deficiency of VLIW architectures is the code bloat that occurs when not all of the execution units have useful work to do and thus have to execute NOPs. This occurs when there are dependencies in the code and the functional pipelines must be allowed to drain before subsequent operations can proceed.
Since the number of transistors on a chip has grown, the perceived disadvantages of the VLIW have diminished in importance. The VLIW architecture is growing in popularity, particularly in the embedded market, where it is possible to customize a processor for an application in an embedded system-on-a-chip. Embedded VLIW products are available from several vendors, including the FR-V from Fujitsu, the BSP15/16 from Pixelworks, the ST231 from STMicroelectronics, the TriMedia from NXP, the CEVA-X DSP from CEVA, the Jazz DSP from Improv Systems, and Silicon Hive. The Texas Instruments TMS320 DSP line has evolved, in its C6xxx family, to look more like a VLIW, in contrast to the earlier C5xxx family.
- Explicitly parallel instruction computing (EPIC)
- Transport triggered architecture (TTA)
- Elbrus processors
- Wai-Kai Chen (2000). "Memory, Microprocessor, and ASIC". books.google.com. CRC Press. pp. 11–14, 11–15. Retrieved 2014-08-19.
- Heidi Pan; Krste Asanovic (2001). "Heads and Tails: A Variable-Length Instruction Format Supporting Parallel Fetch and Decode". scale.eecs.berkeley.edu. Retrieved 2014-08-19.
- "CONTROL DATA 6400/6500/6600 COMPUTER SYSTEMS Reference Manual". 1969-02-21. Retrieved 7 November 2013.
- Fisher, Joseph A. (1983). "Proceedings of the 10th annual international symposium on Computer architecture" (PDF). International Symposium on Computer Architecture. New York, NY, USA: ACM. pp. 140–150. doi:10.1145/800046.801649. ISBN 0-89791-101-6. Retrieved 2009-04-27.
- "ACM 1985 Doctoral Dissertation Award". ACM. Retrieved 2007-10-15.
For his dissertation Bulldog: A Compiler for VLIW Architecture.
- "An Introduction To Very-Long Instruction Word (VLIW) Computer Architecture". Philips Semiconductors. Archived from the original on 2011-09-29.
- "EEMBC Publishes Benchmark Scores for Infineon Technologies' Carmel DSP Core and TriCore TC11IB Microcontroller"
- Paper That Introduced VLIWs
- Book on the history of Multiflow Computer, VLIW pioneering company
- ISCA "Best Papers" Retrospective On Paper That Introduced VLIWs
- VLIW and Embedded Processing
- FR500 VLIW-architecture High-performance Embedded Microprocessor