History of compiler construction
|History of computing|
|Timeline of computing|
In computing, a compiler is a computer program that transforms source code written in a programming language or computer language (the source language), into another computer language (the target language, often having a binary form known as object code or Machine code). The most common reason for wanting to transform source code is to create an executable program.
Any program written in a high level programming language must be translated to object code before it can be executed, so all programmers using such a language use a compiler or an interpreter. Thus, compilers are very important to programmers. Any improvement to a compiler leads to a large number of improved executable programs.
Compilers are large and complex programs, but systematic analysis and research by computer scientists has led to a clearer understanding of compiler construction and a large body of theory has been developed around them. Research into compiler construction has led to tools that make it much easier to create compilers, so that today computer science students can create their own small language and develop a simple compiler for it in a few weeks.
First compilers 
Software for early computers was primarily written in assembly language. It is usually more productive for a programmer to use a high level language, and programs written in a high level language are able to be reused on different kinds of computers. Even so, it took a while for compilers to become established, because they generated code that did not perform as well as hand-written assembler, they were daunting development projects in their own right, and the very limited memory capacity of early computers created many technical problems for practical compiler implementations.
The first compiler was written by Grace Hopper, in 1952, for the A-0 System language. The term compiler was coined by Hopper. The FORTRAN team led by John W. Backus at IBM is generally credited as having introduced the first complete compiler, in 1957. The first FORTRAN compiler took 18 person-years to create.
The first ALGOL 58 compiler was completed by the end of 1958 by Friedrich L. Bauer, Hermann Bottenbruch, Heinz Rutishauser, and Klaus Samelson for the Z22 computer. Bauer et al. had been working on compiler technology for the Sequentielle Formelübersetzung (i.e. Sequential Formula Translation) in the previous years.
By 1960, an extended Fortran compiler, ALTAC, was available on the Philco 2000, so it is probable that a Fortran program was compiled for both IBM and Philco computer architectures in mid-1960. The first known demonstrated cross-platform high level language was COBOL. In a demonstration in December 1960, a COBOL program was compiled and executed on both the UNIVAC II and the RCA 501.
The COBOL compiler for the UNIVAC II was probably the first to be written in a high level language, namely FLOW-MATIC, by a team led by Grace Hopper.
Self-hosting compilers 
Like any other software, there are benefits from implementing a compiler in a high-level language. In particular, a compiler can be self-hosted – that is, written in the programming language it compiles. Building a self-hosting compiler is a bootstrapping problem, ie. the first such compiler for a language must be either hand written machine code or compiled by a compiler written in another language, or compiled by running the compiler in an interpreter.
The Navy Electronics Laboratory International ALGOL Compiler or NELIAC was a dialect and compiler implementation of the ALGOL 58 programming language developed by the Naval Electronics Laboratory in 1958.
NELIAC was the brainchild of Harry Huskey — then Chairman of the ACM and a well known computer scientist, and supported by Maury Halstead, the head of the computational center at NEL. The earliest version was implemented on the prototype USQ-17 computer (called the Countess) at the laboratory. It was the world's first self-compiling compiler. This means that the compiler was first coded in simplified form in assembly language (the bootstrap), and then re-written in its own language, compiled by this bootstrap compiler, and re-compiled by itself, making the bootstrap obsolete.
The first self-hosting compiler (excluding assemblers) was written for Lisp by Tim Hart and Mike Levin at MIT in 1962. They wrote a Lisp compiler in Lisp, testing it inside an existing Lisp interpreter. Once they had improved the compiler to the point where it could compile its own source code, it was self-hosting.
- The compiler as it exists on the standard compiler tape is a machine language program that was obtained by having the S-expression definition of the compiler work on itself through the interpreter. (AI Memo 39)
This technique is only possible when an interpreter already exists for the very same language that is to be compiled. It borrows directly from the notion of running a program on itself as input, which is also used in various proofs in theoretical computer science, such as the proof that the halting problem is undecidable.
Context-free grammars and parsers 
A parser is an important component of a compiler. It parses the source code of a computer programming language to create some form of internal representation. Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them. Parsers can be written by hand or generated by a parser generator. A context-free grammar provides a simple and precise mechanism for describing the block structure of a program – that is, how programming language constructs are built from smaller blocks. The formalism of context-free grammars was developed in the mid-1950s by Noam Chomsky.
Block structure was introduced into computer programming languages by the ALGOL project (1957–1960), which, as a consequence, also featured a context-free grammar to describe the resulting ALGOL syntax.
Context-free grammars are simple enough to allow the construction of efficient parsing algorithms which, for a given string, determine whether and how it can be generated from the grammar. If a programming language designer is willing to work within some limited subsets of context-free grammars, more efficient parsers are possible.
LR parsing 
The LR parser (left to right) was invented by Donald Knuth in 1965 in a paper, "On the Translation of Languages from Left to Right". An LR parser is a parser that reads input from Left to right (as it would appear if visually displayed) and produces a Rightmost derivation. The term LR(k) parser is also used, where k refers to the number of unconsumed lookahead input symbols that are used in making parsing decisions.
Knuth proved that LR(k) grammars can be parsed with an execution time essentially proportional to the length of the program, and that every LR(k) grammar for k > 1 can be mechanically transformed into an LR(1) grammar for the same language. In other words, it is only necessary to have one symbol lookahead to parse any deterministic context-free grammar(DCFG).
Korenjak (1969) was the first to show parsers for programming languages could be produced using these techniques. Frank DeRemer devised the more practical Simple LR (SLR) and Look-ahead LR (LALR) techniques, published in his PhD dissertation at MIT in 1969. This was an important breakthrough, because LR(k) translators, as defined by Donald Knuth, were much too large for implementation on computer systems in the 1960s and 1970s.
In practice, LALR offers a good solution, because LALR(1) parsers are more powerful (that is, they can parse more complex grammars) than SLR(1) and LL(1) grammars. LR(1) grammars are more powerful again than LALR(1); however, an LR(1) grammar requires a canonical LR parser which would be extremely large in size and is not considered practical. The syntax of many programming languages are defined by grammars that can be parsed with an LALR(1) parser, and for this reason LALR parsers are often used by compilers to perform syntax analysis of source code.
A recursive ascent parser implements an LALR parser using mutually-recursive functions rather than tables. Thus, the parser is directly encoded in the host language similar to recursive descent. Direct encoding usually yields a parser which is faster than its table-driven equivalent for the same reason that compilation is faster than interpretation. It is also (in principle) possible to hand edit a recursive ascent parser, whereas a tabular implementation is nigh unreadable to the average human.
Recursive ascent was first described by Thomas Pennello in his article "Very fast LR parsing" in 1986. The technique was later expounded upon by G.H. Roberts in 1988 as well as in an article by Leermakers, Augusteijn, Kruseman Aretz in 1992 in the journal Theoretical Computer Science.
LL parsing 
An LL parser parses the input from Left to right, and constructs a Leftmost derivation of the sentence (hence LL, as opposed to LR). The class of grammars which are parsable in this way is known as the LL grammars. LL grammars are an even more restricted class of context-free grammars than LR grammars. Nevertheless, they are of great interest to compiler writers, because such a parser is simple and efficient to implement.
LL(k) grammars can be parsed by a recursive descent parser which is usually coded by hand.
The design of ALGOL sparked investigation of recursive descent, since the ALGOL language itself is recursive. The concept of recursive descent parsing was discussed in the January 1961 issue of CACM in separate papers by A.A. Grau and Edgar T. "Ned" Irons.   Richard Waychoff independently conceived of and used recursive descent in the Burroughs ALGOL compiler in March 1961.
LR parsing can handle a larger range of languages than LL parsing, and is also better at error reporting, i.e. it detects syntactic errors when the input does not conform to the grammar as soon as possible.
Earley parser 
Grammar description languages 
John Backus proposed "metalinguistic formulas" to describe the syntax of the new programming language IAL, known today as ALGOL 58 (1959). Backus's work was based on the Post canonical system devised by Emil Post.
Further development of ALGOL led to ALGOL 60; in its report (1963), Peter Naur named Backus's notation Backus Normal Form (BNF), and simplified it to minimize the character set used. However, Donald Knuth argued that BNF should rather be read as Backus–Naur Form, and that has become the commonly accepted usage.
Niklaus Wirth defined Extended Backus–Naur Form (EBNF), a refined version of BNF, in the early 1970s for PL/0. Augmented Backus–Naur Form (ABNF) is another variant. Both EBNF and ABNF are widely used to specify the grammar of programming languages, as the inputs to parser generators, and in other fields such as defining communication protocols.
Parser generators 
A parser generator generates the lexical-analyser portion of a compiler. It is a program that takes a description of a formal grammar of a specific programming language and produces a parser for that language. That parser can be used in a compiler for that specific language. The parser detects and identifies the reserved words and symbols of the specific language from a stream of text and returns these as tokens to the code which implements the syntactic validation and translation into object code. This second part of the compiler can also be created by a compiler-compiler using a formal rules-of-precedence syntax-description as input.
The first compiler-compiler to use that name was written by Tony Brooker in 1960 and was used to create compilers for the Atlas computer at the University of Manchester, including the Atlas Autocode compiler. However it was rather different from modern compiler-compilers, and today would probably be described as being somewhere between a highly customisable generic compiler and an extensible-syntax language. The name 'compiler-compiler' was far more appropriate for Brooker's system than it is for most modern compiler-compilers, which are more accurately described as parser generators. It is almost certain that the "Compiler Compiler" name has entered common use due to Yacc rather than Brooker's work being remembered.
In the early 1960s, Robert McClure at Texas Instruments invented a compiler-compiler called TMG, the name taken from "transmogrification". In the following years TMG was ported to several UNIVAC and IBM mainframe computers.
Another early example of a parser generator is META II, first released in 1962 by Val Schorre. META II accepted grammars and code generation rules, and was able to compile itself and other languages. META II was the first documented version of a metacompiler. It also translated to one of the earliest instances of a virtual machine.
The Multics project, a joint venture between MIT and Bell Labs, was one of the first to develop an operating system in a high level language. PL/I was chosen as the language, but an external supplier could not supply a working compiler. The Multics team developed their own subset dialect of PL/I known as Early PL/I (EPL) as their implementation language in 1964. TMG was ported to GE-600 series and used to develop EPL by Douglas McIlroy, Robert Morris, and others.
Not long after Ken Thompson wrote the first version of Unix for the PDP-7 in 1969, Doug McIlroy created the new system's first higher-level language: an implementation of McClure's TMG. TMG was also the compiler definition tool used by Ken Thompson to write the compiler for the B language on his PDP-7 in 1970. B was the immediate ancestor of C.
An early LALR parser generator was called "TWS", created by Frank DeRemer and Tom Pennello.
XPL is a dialect of the PL/I programming language, developed in 1967, used for the development of compilers for computer languages. It was designed and implemented by a team with William McKeeman, James J. Horning, and David B. Wortman at Stanford University and the University of California, Santa Cruz. It was first announced at the 1968 Fall Joint Computer Conference in San Francisco.
It is the name of both the programming language and the LALR parser generator system (or TWS: translator writing system) based on the language.
XPL featured a relatively simple bottom-up compiler system dubbed MSP (mixed strategy precedence) by its authors. It was bootstrapped through Burroughs Algol onto the IBM System/360 computer. Subsequent implementations of XPL featured an SLR(1) parser.
yacc is a parser generator (loosely, compiler-compiler), not to be confused with lex, which is a lexical analyzer frequently used as a first stage by yacc. yacc was developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates an LALR(1) compiler based on a grammar written in a notation similar to Backus-Naur Form.
Johnson worked on yacc in the early 1970s at Bell Labs. He was familiar with TMG and its influence can be seen in yacc and the design of the C programming language. Because Yacc was the default compiler generator on most Unix systems, it was widely distributed and used. Developments like GNU Bison are still in use.
The compiler generated by yacc requires a lexical analyzer. Lexical analyzer generators, such as lex or flex are widely available. The IEEE POSIX P1003.2 standard defines the functionality and requirements for both Lex and Yacc.
Cross compilation 
A cross compiler runs in one environment but produces object code for another. Cross compilers are used for embedded development, where the target computer has limited capabilities.
The ALGOL 68C compiler generated ZCODE output, that could then be either compiled into the local machine code by a ZCODE translator or run interpreted. ZCODE is a register-based intermediate language. This ability to interpret or compile ZCODE encouraged the porting of ALGOL 68C to numerous different computer platforms.
Optimizing compilers 
Compiler optimization is the process of improving the quality of object code without changing the results it produces.
The developers of the first FORTRAN compiler aimed to generate code that was better than the average hand-coded assembler, so that customers would actually use their product. In one of the first real compilers, they often succeeded.
Later compilers, like IBM's Fortran IV compiler, placed more priority on good diagnostics and executing more quickly, at the expense of object code optimization. It wasn't until the IBM 360 series that IBM provided two separate compilers: a fast executing code checker, and a slower optimizing one.
Frances E. Allen, working alone and jointly with John Cocke, introduced many of the concepts for optimization. Allen's 1966 paper, Program Optimization, introduced the use of graph data structures to encode program content for optimization. Her 1970 papers, Control Flow Analysis and A Basis for Program Optimization established intervals as the context for efficient and effective data flow analysis and optimization. Her 1971 paper with Cocke, A Catalogue of Optimizing Transformations, provided the first description and systematization of optimizing transformations. Her 1973 and 1974 papers on interprocedural data flow analysis extended the analysis to whole programs. Her 1976 paper with Cocke describes one of the two main analysis strategies used in optimizing compilers today.
Allen developed and implemented her methods as part of compilers for the IBM 7030 Stretch-Harvest and the experimental Advanced Computing System. This work established the feasibility and structure of modern machine- and language-independent optimizers. She went on to establish and lead the PTRAN project on the automatic parallel execution of FORTRAN programs. Her PTRAN team developed new parallelism detection schemes and created the concept of the program dependence graph, the primary structuring method used by most parallelizing compilers.
Programming Languages and their Compilers by John Cocke and Jacob T. Schwartz, published early in 1970, devoted more than 200 pages to optimization algorithms. It included many of the now familiar techniques such as redundant code elimination and strength reduction.
Peephole Optimization 
Peephole optimization is a very simple but effective optimization technique. It was invented by William McKeeman and published in 1965 in CACM. It was used in the XPL compiler that McKeeman helped develop.
Capex COBOL Optimizer 
Capex Corporation developed the "COBOL Optimizer" in the mid 1970s for COBOL. This type of optimizer depended, in this case, upon knowledge of 'weaknesses' in the standard IBM COBOL compiler, and actually replaced (or patched) sections of the object code with more efficient code. The replacement code might replace a linear table lookup with a binary search for example or sometimes simply replace a relatively 'slow' instruction with a known faster one that was otherwise functionally equivalent within its context. This technique is now known as "Strength reduction". For example on the IBM/360 hardware the CLI instruction was, depending on the particular model, between twice and 5 times as fast as a CLC instruction for single byte comparisons.
Modern compilers typically provide optimization options, so programmers can choose whether or not to execute an optimization pass.
When a compiler is given a syntactically incorrect program, a good, clear error message is helpful. From the perspective of the compiler writer, it is often difficult to achieve.
The WATFIV Fortran compiler was developed at the University of Waterloo, Canada in the late 1960s. It was designed to give better error messages than IBM's Fortran compilers of the time. In addition, WATFIV was far more usable, because it combined compiling, linking and execution into one step, whereas IBM's compilers had three separate components to run.
PL/C was a computer programming language developed at Cornell University in the early 1970s. While PL/C was a subset of IBM's PL/I language, it was designed with the specific goal of being used for teaching programming. The two researchers and academic teachers who designed PL/C were Richard W. Conway and Thomas R. Wilcox. They submitted the famous article "Design and implementation of a diagnostic compiler for PL/I" published in the Communications of ACM in March 1973.
PL/C eliminated some of the more complex features of PL/I, and added extensive debugging and error recovery facilities. The PL/C compiler had the unusual capability of never failing to compile any program, through the use of extensive automatic correction of many syntax errors and by converting any remaining syntax errors to output statements.
Notable compilers 
- Amsterdam Compiler Kit
- Berkeley Pascal, created by Ken Thompson in 1975.
- GNU Compiler Collection, formerly the GNU C Compiler. First created by Richard Stallman in 1987, GCC is the major compiler used to build Linux.
- Turbo Pascal, created by Anders Hejlsberg, first released in 1983.
- WATFOR, created at the University of Waterloo. One of the first popular educational compilers, although now largely obsolete.
See also 
- History of programming languages
- Lex (and Flex lexical analyser), the token parser commonly used in conjunction with yacc (and Bison).
- BNF, a metasyntax used to express context-free grammar: that is, a formal way to describe formal languages.
-  The World's First COBOL Compilers
- Backus et al. "The FORTRAN automatic coding system", Proc. AFIPS 1957 Western Joint Comuter Conf., Spartan Books, Baltimore 188–198
-  Rosen, Saul. ALTAC, FORTRAN, and compatibility. Proceedings of the 1961 16th ACM national meeting
- T. Hart and M. Levin "The New Compiler", AIM-39 CSAIL Digital Archive – Artificial Intelligence Laboratory Series
- Tim Hart and Mike Levin. "AI Memo 39-The new compiler". Retrieved 2008-05-23.
- Chomsky, Noam (Sep 1956). "Three models for the description of language". Information Theory, IEEE Transactions 2 (3): 113–124. doi:10.1109/TIT.1956.1056813. Retrieved 2007-06-18.
- Knuth, Donald. "On the Translation of Languages from Left to Right". Retrieved 29 May 2011.
- Korenjak, A. “A Practical Method for Constructing LR(k) Processors,” Communications of the ACM, Vol. 12, No. 11, 1969
- DeRemer, F. Practical Translators for LR(k) Languages. PhD dissertation, MIT, 1969.
- DeRemer, F. “Simple LR(k) Grammars,” Communications of the ACM, Vol. 14, No. 7, 1971.
- Thomas J Pennello (1986). "Very fast LR parsing". ACM SIGPLAN Notices 21 (7).
- G.H. Roberts (1988). "Recursive ascent: an LR analog to recursive descent".
- Leermakers, Augusteijn, Kruseman Aretz (1992). "A functional LR parser".
- A.A. Grau, "Recursive processes and ALGOL translation", Commun. ACM, 4, No. 1, pp. 10–15. Jan. 1961
- Edgar T. Irons, "A syntax-directed compiler for ALGOL 60", Commun. ACM, 4, No. 1, Jan. 1961, pp. 51–55.
- P. M. Lewis, R. E. Stearns, "Syntax directed transduction," focs, pp.21–35, 7th Annual Symposium on Switching and Automata Theory (SWAT 1966), 1966
- Lewis, P. and Stearns, R. “Syntax-Directed Transduction,” Journal of the ACM, Vol. 15, No. 3, 1968.
- "The PL/0 compiler/interpreter".
- J. Earley, "An efficient context-free parsing algorithm", Communications of the Association for Computing Machinery, 13:2:94-102, 1970.
- Backus, J.W. (1959). "The syntax and semantics of the proposed international algebraic language of the Zurich ACM-GAMM Conference". Proceedings of the International Conference on Information Processing. UNESCO. pp. 125–132.
- Farrell, James A. (August 1995). "Compiler Basics". Extended Backus Naur Form. Retrieved 11 May 2011.
- Donald E. Knuth, "Backus Normal Form vs. Backus Naur Form", Commun. ACM, 7(12):735–736, 1964.
- R. M. McClure, TMG—A Syntax Directed Compiler Proc. 20th ACM National Conf. (1965), pp. 262–274.
- ACM – Paper on META II
-  Dennis M. Ritchie. The Development of the C Language
- McKeeman, William Marshall; Horning, James J.; and Wortman, David B., A Compiler Generator (1971), ISBN 978-0-13-155077-3.
- Computer Science Department, University of Toronto, "The XPL Programming Language"
- Johnson, S.C., “Yacc – Yet Another Compiler Compiler”, Computing Science Technical Report 32, AT&T Bell Labs, 1975
- F.E. Allen. Program optimization. In Mark I. Halpern and Christopher J. Shaw, editors, Annual Review in Automatic Programming, volume 5, pages 239–307. Pergamon Press, New York, 1969.
- Frances E. Allen and John Cocke. Graph theoretic constructs for program control flow analysis. Technical Report IBM Res. Rep. RC 3923, IBM T.J. Watson Research Center, Yorktown Heights, NY, 1972.
- Frances E. Allen. Control flow analysis. ACM SIGPLAN Notices, 5(7):1–19, July 1970.
- Frances E. Allen. A basis for program optimization. In Proc. IFIP Congress 71, pages 385–390. North-Holland, 1972.
- Frances E. Allen and John Cocke. A catalogue of optimizing transformations. In R. Rustin, editor, Design and Optimization of Compilers, pages 1–30. Prentice-Hall, 1971.
- Frances E. Allen. Interprocedural data flow analysis. In Proc. IFIP Congress 74, pages 398–402. North-Holland, 1974.
- Frances E. Allen. A method for determining program data relationships. In Andrei Ershov and Valery A. Nepomniaschy, editors, Proc. International Symposium on Theoretical Programming, Novosibirsk, USSR, August 1972, volume 5 of Lecture Notes in Computer Science, pages 299–308. Springer-Verlag, 1974.
- Frances E. Allen and John Cocke. A program data flow analysis procedure. Communications of the ACM, 19(3):137–147, March 1976.
- Vivek Sarkar. The PTRAN Parallel Programming System. In Parallel Functional Programming Languages and Compilers, edited by B. Szymanski, ACM Press Frontier Series, pages 309–391, 1991.
- John Cocke and Jacob T. Schwartz, Programming Languages and their Compilers. Courant Institute of Mathematical Sciences, New York University, April 1970.
- MCKEEMAN, W.M. Peephole optimization. Commun. ACM 8, 7 (July 1965), 443–444
- CACM March 1973 pp 169–179.
Further reading 
- Backus, John, et al., "The FORTRAN Automatic Coding System", Proceedings of the Western Joint Computer Conference, Los Angeles, California, February 1957. Describes the design and implementation of the first FORTRAN compiler by the IBM team.
- Knuth, D. E., RUNCIBLE-algebraic translation on a limited computer, Communications of the ACM, Vol. 2, p. 18, (Nov. 1959).
- Irons, Edgar T., A syntax directed compiler for ALGOL 60, Communications of the ACM, Vol. 4, p. 51. (Jan. 1961)
- Dijkstra, Edsger W. (1961). "ALGOL 60 Translation: An ALGOL 60 Translator for the X1 and Making a Translator for ALGOL 60 (Technical report 35). Amsterdam: Mathematisch Centrum. http://www.cs.utexas.edu/users/EWD/MCReps/MR35.PDF.
- Conway, Melvin E., Design of a separable transition-diagram compiler, Communications of the ACM, Volume 6, Issue 7 (July 1963)
- Floyd, R. W., Syntactic analysis and operator precedence, Journal of the ACM, Vol. 10, p. 316. (July 1963).
- Cheatham, T. E., and Sattley, K., Syntax directed compilation, SJCC p. 31. (1964).
- Randell, Brian; Russell, Lawford John, ALGOL 60 Implementation: The Translation and Use of ALGOL 60 Programs on a Computer, Academic Press, 1964
- Knuth, D. E., On the translation of languages from left to right., Information and Control, Vol. 8, p. 607. (1965).
- Cocke, John; Schwartz, Jacob T., Programming Languages and their Compilers: Preliminary Notes, Courant Institute of Mathematical Sciences technical report, New York University, 1969.
- Bauer, Friedrich L.; Eickel, Jürgen (Eds.), Compiler Construction, An Advanced Course, 2nd ed. Lecture Notes in Computer Science 21, Springer 1976, ISBN 3-540-07542-9
- Gries, David, Compiler Construction for Digital Computers, New York : Wiley, 1971. ISBN 0-471-32776-X