# LL grammar

The C grammar[1] is not LL(1): The bottom part shows a parser that has digested the tokens "int v ;main(){" and is about choose a rule to derive the nonterminal "Stmt". Looking only at the first lookahead token "v", it cannot decide which of both alternatives for "Stmt" to choose, since two input continuations are possible. They can be discriminated by peeking at the second lookahead token (yellow background).

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence (hence LL, compared with LR parser that constructs a rightmost derivation). A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

LL parsers are table-based parsers, similar to LR parsers. LL grammars can alternatively be characterized as precisely those that can be parsed by a predictive parser – a recursive descent parser without backtracking – and these can be readily written by hand. This article is about the formal properties of LL grammars; for parsing, see LL parser or recursive descent parser.

## Formal definition

Given a natural number ${\displaystyle k\geq 0}$, a context-free grammar ${\displaystyle G=(V,\Sigma ,R,S)}$ is an LL(k) grammar if

• for each terminal symbol string ${\displaystyle w\in \Sigma ^{*}}$ of length up to ${\displaystyle k}$ symbols,
• for each nonterminal symbol ${\displaystyle A\in V}$, and
• for each terminal symbol string ${\displaystyle w_{1}\in \Sigma ^{*}}$,

there is at most one production rule ${\displaystyle r\in R}$ such that for some terminal symbol strings ${\displaystyle w_{2},w_{3}\in \Sigma ^{*}}$,

• the string ${\displaystyle w_{1}Aw_{3}}$ can be derived from the start symbol ${\displaystyle S}$,
• ${\displaystyle w_{2}}$ can be derived from ${\displaystyle A}$ after first applying rule ${\displaystyle r}$, and
• the first ${\displaystyle k}$ symbols of ${\displaystyle w}$ and of ${\displaystyle w_{2}w_{3}}$ agree.[2]

Informally, when a parser has derived ${\displaystyle w_{1}Aw_{3}}$, with ${\displaystyle A}$ its leftmost nonterminal and ${\displaystyle w_{1}}$ already consumed from the input, then by looking at that ${\displaystyle w_{1}}$ and peeking at the next ${\displaystyle k}$ symbols ${\displaystyle w}$ of the current input, the parser can identify with certainty the production rule ${\displaystyle r}$ for ${\displaystyle A}$.

When rule identification is possible even without considering the past input ${\displaystyle w_{1}}$, then the grammar is called a strong LL(k) grammar.[3] In the formal definition of a strong LL(k) grammar, the universal quantifier for ${\displaystyle w_{1}}$ is omitted, and ${\displaystyle w_{1}}$ is added to the "for some" quantifier for ${\displaystyle w_{2},w_{3}}$. For every LL(k) grammar, a structurally equivalent strong LL(k) grammar can be constructed.[4]

An alternative, but equivalent, formal definition is the following: ${\displaystyle G=(V,\Sigma ,R,S)}$ is an LL(k) grammar if, for arbitrary derivations

${\displaystyle {\begin{array}{ccccccc}S&\Rightarrow ^{L}&w_{1}A\chi &\Rightarrow &w_{1}\nu \chi &\Rightarrow ^{*}&w_{1}w_{2}w_{3}\\S&\Rightarrow ^{L}&w_{1}A\chi &\Rightarrow &w_{1}\omega \chi &\Rightarrow ^{*}&w_{1}w'_{2}w'_{3},\\\end{array}}}$

when the first ${\displaystyle k}$ symbols of ${\displaystyle w_{2}w_{3}}$ agree with those of ${\displaystyle w'_{2}w'_{3}}$, then ${\displaystyle \nu =\omega }$.[5][6]

## Relation to other grammar classes

Allowing ε-rules increases the expressive power of a grammar: For every ε-free LL(k+1) grammar, there exists a LL(k) grammar with ε-rules that generates the same language.[7] Commonly, ε-free LL(k) grammars are used for LL(k) parsers.

The class of LL(k) languages forms a strictly increasing sequence of sets: LL(0) ⊊ LL(1) ⊊ LL(2) ⊊ ….[8] Since these are all DCFLs, a corollary is that for any fixed k, there are DCFLs that cannot be recognized by an LL(k) parser.

A generalization, called an LL(*) parser, is not restricted to a finite number k of tokens of lookahead, but can make parsing decisions by recognizing whether the following tokens belong to a regular language (for example by use of a Deterministic Finite Automaton). Accordingly there are the set of LL(*) grammars and the set of LL(*) languages.[9] It appears to be yet unclear where the latter set is located in the Chomsky hierarchy.

Every LL(k) grammar is also a LR(k) grammar. It is also decidable if a given LR(k) grammar is also an LL(m) grammar for some m.[10] An ε-free LL(1) grammar is also an SLR(1) grammar. An LL(1) grammar with symbols that have both empty and non-empty derivations is also an LALR(1) grammar. An LL(1) grammar with symbols that have only the empty derivation may or may not be LALR(1).[11]

LL grammars cannot have rules containing left recursion.[12] Each LL(k) grammar that is ε-free can be transformed into an equivalent LL(k) grammar in Greibach normal form (which by definition does not have rules with left recursion).[13]

### Simple deterministic languages

A context-free grammar is called simple deterministic,[14] or just simple,[15] if

• it is in Greibach normal form (i.e. each rule has the form ${\displaystyle Z\rightarrow aY_{1}\ldots Y_{n},n\geq 0}$), and
• different right hand sides for the same nonterminal ${\displaystyle Z}$ always start with different terminals ${\displaystyle a}$.

A set of strings is called a simple deterministic, or just simple, language, if it has a simple deterministic grammar.

The class of languages having an ε-free LL(1) grammar in Greibach normal form equals the class of simple deterministic languages.[16] This language class includes the regular sets not containing ε.[15] Equivalence is decidable for it, while inclusion is not.[14]

## Applications

LL grammars, particularly LL(1) grammars, are of great practical interest, as they are easy to parse, either by LL parsers or by recursive descent parsers, and [clarify] are designed to be LL(1) for this reason. Languages based on grammars with a high value of k have traditionally been considered[citation needed] to be difficult to parse, although this is less true now given the availability and widespread use[citation needed] of parser generators supporting LL(k) grammars for arbitrary k.

## Notes

1. ^ Brian W. Kernighan and Dennis M. Ritchie (Apr 1988). The C Programming Language. Prentice Hall Software Series (2nd ed.). Englewood Cliffs/NJ: Prentice Hall. ISBN 978-0131103627. Appendix A.13 "Grammar", p.193 ff. The top image part shows a simplified excerpt in an EBNF-like notation.
2. ^ Rosenkrantz & Stearns (1970, p. 227). Def.1. The authors do not consider the case k=0.
3. ^ Rosenkrantz & Stearns (1970, p. 235) Def.2
4. ^ Rosenkrantz & Stearns (1970, p. 235) Theorem 2
5. ^ where "${\displaystyle \Rightarrow ^{L}}$" denotes derivability by leftmost derivations, and ${\displaystyle w_{1},w_{2},w_{3},w'_{2},w'_{3}\in \Sigma ^{*}}$, ${\displaystyle A\in V}$, and ${\displaystyle \chi ,\nu ,\omega \in (\Sigma \cup V)^{*}}$
6. ^ Waite & Goos (1984, p. 123) Def. 5.22
7. ^ Rosenkrantz & Stearns (1970, p. 242)
8. ^ Rosenkrantz & Stearns (1970, p. 246-247): Using "${\displaystyle +}$" to denote "or", the string set ${\displaystyle \{a^{n}(b^{k}d+b+cc)^{n}:n\geq 1\}}$ has an ${\displaystyle LL(k+1)}$, but no ε-free ${\displaystyle LL(k)}$ grammar, for each ${\displaystyle k\geq 1}$.
9. ^ Parr & Fisher (2011)
10. ^ Rozenkratz & Stearns (1970, pp. 254–255)
11. ^ Beaty (1982)
12. ^ Rozenkratz & Stearns (1970, pp. 241) Lemma 5
13. ^ Rozenkratz & Stearns (1970, p. 242) Theorem 4
14. ^ a b Korenjak & Hopcroft (1966)
15. ^ a b Hopcroft & Ullman (1979, p. 229) Exercise 9.3
16. ^ Rosenkrantz & Stearns (1970, p. 243)