# Chomsky–Schützenberger theorem

In formal language theory, the Chomsky–Schützenberger theorem is either of two different theorems derived by Noam Chomsky and Marcel-Paul Schützenberger.

One of the two theorems is a statement about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

The other theorem, which bears the same name (Hotz & Kretschmer 1989), is a statement about representing a given context-free language in terms of two simpler languages. These two simpler languages, namely a regular language and a Dyck language, are combined by means of an intersection and a homomorphism.

## The theorem about counting words

### Statement of the theorem

In order to state the theorem, a few notions from algebra and formal language theory are needed.

A power series over $\mathbb{N}$ is an infinite series of the form

$f = f(x) = \sum_{k=0}^\infty a_k x^k = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \cdots$

with coefficients $a_k$ in $\mathbb{N}$. The multiplication of two formal power series $f$ and $g$ is defined in the expected way as the convolution of the sequences $a_n$ and $b_n$:

$f(x)\cdot g(x) = \sum_{k=0}^\infty \left(\sum_{i=0}^k a_i b_{k-i}\right) x^k.$

In particular, we write $f^2 = f(x)\cdot f(x)$, $f^3 = f(x)\cdot f(x)\cdot f(x)$, and so on. In analogy to algebraic numbers, a power series $f(x)$ is called algebraic over $\mathbb{Q}(x)$, if there exists a finite set of polynomials $p_0(x), p_1(x), p_2(x), \ldots , p_n(x)$ each with rational coefficients such that

$p_0(x) + p_1(x) \cdot f + p_2(x)\cdot f^2 + \cdots + p_n(x)\cdot f^n = 0.$

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree, or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

Chomsky–Schützenberger theorem. If $L$ is a context-free language admitting an unambiguous context-free grammar, and $a_k := | L \ \cap \Sigma^k |$ is the number of words of length $k$ in $L$, then $G(x)=\sum_{k = 0}^\infty a_k x^k$ is a power series over $\mathbb{N}$ that is algebraic over $\mathbb{Q}(x)$.

Proofs of this theorem are given by Kuich & Salomaa (1985), and by Panholzer (2005).

### Usage of the theorem

#### Asymptotic estimates

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Gruber, Lee & Shallit (2012): The unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

S → M | U
M → 0M1M | ε
U → S | 0M1U.

To obtain an algebraic representation of the power series G(x) associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by 'x', each occurrence of 'ε' by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = MU
M = M²x² + 1
U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write S(x), M(x), and U(x). The equation system can be resolved after S, resulting in a single algebraic equation:

x(2x-1)S^2 + (2x-1)S +1 = 0.

This quadratic equation has two solutions for S, one of which is the algebraic power series G(x). By applying methods from complex analysis to this equation, the number $a_n$ of words of length n generated by G can be estimated, as n grows large. In this case, one obtains $a_n \in O(2+\epsilon)^n$ but $a_n \notin O(2-\epsilon)^n$ for each $\epsilon>0$. See (Gruber, Lee & Shallit 2012) for a detailed exposition.

#### Inherent Ambiguity

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language $L_G$ over the alphabet $\{a,b\}$ consists of the words $a^{n_1}ba^{n_2}b\cdots a^{n_p}b$ with $p\ge 1$, $n_i>0$ for $i \in \{1,2,\ldots,p\}$, and $n_j \neq j$ for some $j \in \{1,2,\ldots,p\}$.

It is comparably easy to show that the language $L_G$ is context-free (Berstel & Boasson 1990). The harder part is to show that there does not exist an unambiguous grammar that generates $L_G$. This can be proved as follows: If $g_k$ denotes the number of words of length $k$ in $L_G$, then for the associated power series holds $G(x) = \sum_{k=0}^\infty g_k x^k = \frac{1-x}{1-2x}- \frac1x \sum_{k \ge 1} x^{k(k+1)/2-1}$. Using methods from complex analysis, one can prove that this function is not algebraic over $\mathbb{Q}(x)$. By the Chomsky-Schützenberger theorem, one can conclude that $L_G$ does not admit an unambiguous context-free grammar. See (Berstel & Boasson 1990) for detailed account.

## The theorem about representing context-free languages

Also for the other theorem bearing this name, a few notions from formal language theory are in order. A context-free language is regular, if can be described by a regular expression, or, equivalently, if it is accepted by a finite automaton. A homomorphism is based on a function $h$ which maps symbols from an alphabet $\Gamma$ to words over another alphabet $\Sigma$; If the domain of this function is extended to words over $\Gamma$ in the natural way, by letting $h(xy)=h(x)h(y)$ for all words $x$ and $y$, this yields a homomorphism $h:\Gamma^*\to \Sigma^*$. A matched alphabet $T \cup \overline T$ is an alphabet with two equal-sized sets; it is convenient to think of it as a set of parentheses types, where $T$ contains the opening parenthesis symbols, whereas the symbols in $\overline T$ contains the closing parenthesis symbols. For a matched alphabet $T \cup \overline T$, the Dyck language $D_T$ is given by

$D_T = \{\,w \in (T \cup \overline T)^* \mid w \text{ is a correctly nested sequence of parentheses} \,\}$

words that are well-nested parentheses over $T \cup \overline T$.

Chomsky–Schützenberger theorem. A language L over the alphabet $\Sigma$ is context-free if and only if there exists
• a matched alphabet $T \cup \overline T$
• a regular language $R$ over $T \cup \overline T$,
• and a homomorphism $h : (T \cup \overline T)^* \to \Sigma^*$
such that $L = h(D_T \cap R)$.

Proofs of this theorem are given by Hotz & Kretschmer (1989) and Autebert, Berstel & Boasson (1997).

## Bibliography

Autebert, Jean-Michel; Berstel, Jean; Boasson, Luc (1997). "Context-Free Languages and Push-Down Automata". In G. Rozenberg and A. Salomaa, eds., Handbook of Formal Languages, Vol. 1: Word, Language, Grammar (pp. 111–174). Berlin: Springer-Verlag. ISBN 3-540-60420-0.
Berstel, Jean; Boasson, Luc (1990). "Context-free languages". In van Leeuwen, Jan. Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics. Elsevier and MIT press. pp. 59–102. ISBN 0-444-88074-7.
Chomsky, Noam; Schützenberger, Marcel-Paul (1963). "The Algebraic Theory of Context-Free Languages". "In P. Braffort and D. Hirschberg, eds., Computer Programming and Formal Systems (pp. 118–161)". Amsterdam: North-Holland.
Gruber, Hermann; Lee, Jonathan; Shallit, Jeffrey (2012). "Enumerating regular expressions and their languages". arXiv:1204.4982 [cs.FL].
Hotz, G.; Kretschmer, T. (1989). "The power of the Greibach normal form". Elektronische Informationsverarbeitung und Kybernetik 25 (10): 507–512.
Kuich, Werner; Salomaa, Arto (1985). Semirings, Automata, Languages. Berlin: Springer-Verlag. ISBN 978-3-642-69961-0.
Panholzer, Alois (2005). "Gröbner Bases and the Defining Polynomial of a Context-free Grammar Generating Function". Journal of Automata, Languages and Combinatorics 10: 79–97.