Chomsky–Schützenberger theorem

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In formal language theory, the Chomsky–Schützenberger theorem is either of two different theorems derived by Noam Chomsky and Marcel-Paul Schützenberger.

One of the two theorems is a statement about the number of words of a given length generated by an unambiguous context-free grammar. The theorem provides an unexpected link between the theory of formal languages and abstract algebra.

The other theorem, which bears the same name (Hotz & Kretschmer 1989), is a statement about representing a given context-free language in terms of two simpler languages. These two simpler languages, namely a regular language and a Dyck language, are combined by means of an intersection and a homomorphism.

The theorem about counting words[edit]

Statement of the theorem[edit]

In order to state the theorem, a few notions from algebra and formal language theory are needed.

A power series over \mathbb{N} is an infinite series of the form

f = f(x) = \sum_{k=0}^\infty a_k x^k = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \cdots

with coefficients a_k in \mathbb{N}. The multiplication of two formal power series f and g is defined in the expected way as the convolution of the sequences a_n and b_n:

f(x)\cdot g(x) = \sum_{k=0}^\infty \left(\sum_{i=0}^k a_i b_{k-i}\right) x^k.

In particular, we write f^2 = f(x)\cdot f(x), f^3 = f(x)\cdot f(x)\cdot f(x), and so on. In analogy to algebraic numbers, a power series f(x) is called algebraic over \mathbb{Q}(x), if there exists a finite set of polynomials p_0(x), p_1(x), p_2(x), \ldots , p_n(x) each with rational coefficients such that

p_0(x) + p_1(x) \cdot f + p_2(x)\cdot f^2 + \cdots + p_n(x)\cdot f^n = 0.

A context-free grammar is said to be unambiguous if every string generated by the grammar admits a unique parse tree, or, equivalently, only one leftmost derivation. Having established the necessary notions, the theorem is stated as follows.

Chomsky–Schützenberger theorem. If L is a context-free language admitting an unambiguous context-free grammar, and a_k := | L \ \cap \Sigma^k | is the number of words of length k in L, then G(x)=\sum_{k = 0}^\infty a_k x^k is a power series over \mathbb{N} that is algebraic over \mathbb{Q}(x).

Proofs of this theorem are given by Kuich & Salomaa (1985), and by Panholzer (2005).

Usage of the theorem[edit]

Asymptotic estimates[edit]

The theorem can be used in analytic combinatorics to estimate the number of words of length n generated by a given unambiguous context-free grammar, as n grows large. The following example is given by Gruber, Lee & Shallit (2012): The unambiguous context-free grammar G over the alphabet {0,1} has start symbol S and the following rules

S → M | U
M → 0M1M | ε
U → S | 0M1U.

To obtain an algebraic representation of the power series G(x) associated with a given context-free grammar G, one transforms the grammar into a system of equations. This is achieved by replacing each occurrence of a terminal symbol by 'x', each occurrence of 'ε' by the integer '1', each occurrence of '→' by '=', and each occurrence of '|' by '+', respectively. The operation of concatenation at the right-hand-side of each rule corresponds to the multiplication operation in the equations thus obtained. This yields the following system of equations:

S = MU
M = M²x² + 1
U = Sx + MUx²

In this system of equations, S, M, and U are functions of x, so one could also write S(x), M(x), and U(x). The equation system can be resolved after S, resulting in a single algebraic equation:

x(2x-1)S^2 + (2x-1)S +1 = 0.

This quadratic equation has two solutions for S, one of which is the algebraic power series G(x). By applying methods from complex analysis to this equation, the number a_n of words of length n generated by G can be estimated, as n grows large. In this case, one obtains a_n \in O(2+\epsilon)^n but a_n \notin O(2-\epsilon)^n for each \epsilon>0. See (Gruber, Lee & Shallit 2012) for a detailed exposition.

Inherent Ambiguity[edit]

In classical formal language theory, the theorem can be used to prove that certain context-free languages are inherently ambiguous. For example, the Goldstine language L_G over the alphabet \{a,b\} consists of the words a^{n_1}ba^{n_2}b\cdots a^{n_p}b with p\ge 1, n_i>0 for i \in \{1,2,\ldots,p\}, and n_j \neq j for some j \in \{1,2,\ldots,p\}.

It is comparably easy to show that the language L_G is context-free (Berstel & Boasson 1990). The harder part is to show that there does not exist an unambiguous grammar that generates L_G. This can be proved as follows: If g_k denotes the number of words of length k in L_G, then for the associated power series holds G(x) = \sum_{k=0}^\infty g_k x^k = \frac{1-x}{1-2x}- \frac1x \sum_{k \ge 1} x^{k(k+1)/2-1} . Using methods from complex analysis, one can prove that this function is not algebraic over \mathbb{Q}(x). By the Chomsky-Schützenberger theorem, one can conclude that L_G does not admit an unambiguous context-free grammar. See (Berstel & Boasson 1990) for detailed account.

The theorem about representing context-free languages[edit]

Also for the other theorem bearing this name, a few notions from formal language theory are in order. A context-free language is regular, if can be described by a regular expression, or, equivalently, if it is accepted by a finite automaton. A homomorphism is based on a function h which maps symbols from an alphabet \Gamma to words over another alphabet \Sigma; If the domain of this function is extended to words over \Gamma in the natural way, by letting h(xy)=h(x)h(y) for all words x and y, this yields a homomorphism h:\Gamma^*\to \Sigma^*. A matched alphabet T \cup \overline T is an alphabet with two equal-sized sets; it is convenient to think of it as a set of parentheses types, where T contains the opening parenthesis symbols, whereas the symbols in \overline T contains the closing parenthesis symbols. For a matched alphabet T \cup \overline T, the Dyck language D_T is given by

D_T = \{\,w \in (T \cup \overline T)^* \mid w \text{ is a correctly nested sequence of parentheses} \,\}

words that are well-nested parentheses over T \cup \overline T.

Chomsky–Schützenberger theorem. A language L over the alphabet \Sigma is context-free if and only if there exists
  • a matched alphabet T \cup \overline T
  • a regular language R over T \cup \overline T,
  • and a homomorphism h : (T \cup \overline T)^* \to \Sigma^*
such that L = h(D_T \cap R).

Proofs of this theorem are given by Hotz & Kretschmer (1989) and Autebert, Berstel & Boasson (1997).


Autebert, Jean-Michel; Berstel, Jean; Boasson, Luc (1997). "Context-Free Languages and Push-Down Automata". In G. Rozenberg and A. Salomaa, eds., Handbook of Formal Languages, Vol. 1: Word, Language, Grammar (pp. 111–174). Berlin: Springer-Verlag. ISBN 3-540-60420-0. 
Berstel, Jean; Boasson, Luc (1990). "Context-free languages". In van Leeuwen, Jan. Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics. Elsevier and MIT press. pp. 59–102. ISBN 0-444-88074-7. 
Chomsky, Noam; Schützenberger, Marcel-Paul (1963). "The Algebraic Theory of Context-Free Languages". In P. Braffort and D. Hirschberg, eds., Computer Programming and Formal Systems (pp. 118–161). Amsterdam: North-Holland. 
Flajolet, Philippe; Sedgewick, Robert (2009). Analytic Combinatorics. Cambridge: Cambridge University Press. ISBN 978-0-521-89806-5. 
Gruber, Hermann; Lee, Jonathan; Shallit, Jeffrey (2012). "Enumerating regular expressions and their languages". arXiv:1204.4982 [cs.FL].
Hotz, G.; Kretschmer, T. (1989). "The power of the Greibach normal form". Elektronische Informationsverarbeitung und Kybernetik 25 (10): 507–512. 
Kuich, Werner; Salomaa, Arto (1985). Semirings, Automata, Languages. Berlin: Springer-Verlag. ISBN 978-3-642-69961-0. 
Panholzer, Alois (2005). "Gröbner Bases and the Defining Polynomial of a Context-free Grammar Generating Function". Journal of Automata, Languages and Combinatorics 10: 79–97.