= Chomsky–Schützenberger enumeration theorem =

In formal language theory, the Chomsky-Schützenberger enumeration theorem is a theorem derived by Noam Chomsky and Marcel-Paul Schützenberger about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

==Statement==
In order to state the theorem, a few notions from algebra and formal language theory are needed.

Let $\mathbb{N}$ denote the set of nonnegative integers. A power series over $\mathbb{N}$ is an infinite series of the form
$f = f(x) = \sum_{k=0}^\infty a_k x^k = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \cdots$
with coefficients $a_k$ in $\mathbb{N}$. The multiplication of two formal power series $f$ and $g$ is defined in the expected way as the convolution of the sequences $a_n$ and $b_n$:

$f(x)\cdot g(x) = \sum_{k=0}^\infty \left(\sum_{i=0}^k a_i b_{k-i}\right) x^k.$

In particular, we write $f^2 = f(x)\cdot f(x)$, $f^3 = f(x)\cdot f(x)\cdot f(x)$, and so on. In analogy to algebraic numbers, a power series $f(x)$ is called algebraic over $\mathbb{Q}(x)$, if there exists a finite set of polynomials $p_0(x), p_1(x), p_2(x), \ldots , p_n(x)$ each with rational coefficients such that

$p_0(x) + p_1(x) \cdot f + p_2(x)\cdot f^2 + \cdots + p_n(x)\cdot f^n = 0.$

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree
or, equivalently, only one leftmost derivation.
Having established the necessary notions, the theorem is stated as follows.

Chomsky-Schützenberger theorem. If $L$ is a context-free language admitting an unambiguous context-free grammar, and $a_k := | L \ \cap \Sigma^k |$ is the number of words of length $k$ in $L$, then $G(x)=\sum_{k = 0}^\infty a_k x^k$ is a power series over $\mathbb{N}$ that is algebraic over $\mathbb{Q}(x)$.

Proofs of this theorem are given by , and by .

==Usage==

===Asymptotic estimates===

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by : the unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

S → M | U
M → 0M1M | ε
U → 0S | 0M1U.

To obtain an algebraic representation of the power series $G(x)$ associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by x, each occurrence of ε by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = M + U
M = M²x² + 1
U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write $S(x)$, $M(x)$, and $U(x)$. The equation system can be resolved after S, resulting in a single algebraic equation:

$x(2x-1)S^2 + (2x-1)S +1 = 0$.

This quadratic equation has two solutions for S, one of which is the algebraic power series $G(x)$. By applying methods from complex analysis to this equation, the number $a_n$ of words of length n generated by G can be estimated, as n grows large. In this case, one obtains
$a_n \in O(2+\epsilon)^n$ but $a_n \notin O(2-\epsilon)^n$ for each $\epsilon>0$.

The following example is from :$\left\{\begin{array} { l }
{ S \rightarrow X Y } \\
{ T \rightarrow a T | T b T | Y c Y } \\
{ Y \rightarrow Y a Y | c Y | a b T a Y Y a | X } \\
{ X \rightarrow a | b | c }
\end{array} \Rightarrow \left\{\begin{array}{l}
s(z)=x(z) y(z) \\
t(z)=z t(z)+z t(z)^2+z y(z)^2 \\
y(z)=z y(z)^2+z y(z)+z^4 t(z) y(z)^2+x(z) \\
x(z)=3 z
\end{array}\right.\right.$which simplifies to$s(z)^8-27\left(z^3-z^2\right) s(z)^5+\ldots+59049 z^{10}=0$

===Inherent ambiguity===

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous.
For example, the Goldstine language $L_G$ over the alphabet $\{a,b\}$ consists of the words
$a^{n_1}ba^{n_2}b\cdots a^{n_p}b$
with $p\ge 1$, $n_i>0$ for $i \in \{1,2,\ldots,p\}$, and $n_j \neq j$ for some $j \in \{1,2,\ldots,p\}$.

It is comparably easy to show that the language $L_G$ is context-free. The harder part is to show that there does not exist an unambiguous grammar that generates $L_G$. This can be proved as follows:
If $g_k$ denotes the number of words of length $k$ in $L_G$, then for the associated power series holds
$G(x) = \sum_{k=0}^\infty g_k x^k = \frac{1-x}{1-2x}- \frac1x \sum_{k \ge 1} x^{k(k+1)/2-1}$.
Using methods from complex analysis, one can prove that this function is not algebraic over $\mathbb{Q}(x)$. By the Chomsky-Schützenberger theorem, one can conclude that $L_G$ does not admit an unambiguous context-free grammar.
