Probabilistic context-free grammar: Difference between revisions

Content deleted Content added

Inline

Revision as of 05:57, 30 September 2010

A stochastic context-free grammar (SCFG; also probabilistic context-free grammar, PCFG) is a context-free grammar in which each production is augmented with a probability. The probability of a derivation (parse) is then the product of the probabilities of the productions used in that derivation; thus some derivations are more consistent with the stochastic grammar than others. SCFGs extend context-free grammars in the same way that hidden Markov models extend regular grammars. SCFGs have application in areas as diverse as Natural language processing to the study of RNA molecules. SCFGs are a specialized form of weighted context-free grammars.

Techniques

A variant of the CYK algorithm finds the Viterbi parse of a sequence for a given SCFG. The Viterbi parse is the most likely derivation (parse) of the sequence by the given SCFG.

The Inside and Outside algorithms are analogues of the Forward algorithm and Backward algorithm, and can be used to compute the total probability of all derivations that are consistent with a given sequence, based on some SCFG. This is equivalent to the probability of the SCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar.

The Inside/Outside algorithms can also be used to compute the probabilities that a given production will be used in a random derivation of a sequence. This is used as part of an Expectation-maximization algorithm to learn the maximum likelihood probabilities for an SCFG based on a set of training sequences that the SCFG should model. The algorithm is analogous to that used by hidden Markov models.

Applications

Natural language processing

Context-free grammars were originally conceived in an attempt to model natural languages, i.e. those normally spoken by humans.^{[citation needed]} Some research has extended this idea with SCFGs.

Here is a tiny example of a 2-rule PCFG grammar. Each rule is preceded by a probability that reflects the relative frequency with which it occurs.

0.7 VP → V NP

0.3 VP → V NP NP

Given this grammar, we can now say that the number of NPs expected while deriving VPs is 0.7 x 1 + 0.3 x 2 = 1.3.

In particular, some speech recognition systems use SCFGs to improve their probability estimate and thereby their performance.^{[citation needed]}

Recently, Probabilistic CFG's have played a role in explaining the Accessibility Hierarchy, which seeks to explain why certain structures are more difficult to understand than others, e.g. those with relative clauses like "they had forgotten that the box which Pat brought with apples in was lost".

If there is a probabilistic account of more likely constructions, then one can compute an information theoretic measure (entropy) for the constructs. If the cognitive apparatus for syntax is based on information theoretic considerations, then it may very well employ something similar to PCFG.^[1]

RNA

Context-free grammars are adept at modeling the secondary structure of RNA^[2]^[3]. Secondary structure involves nucleotides within a single-stranded RNA molecule that are complementary to each other, and therefore base pair. This base pairing is biologically important to the proper function of the RNA molecule. Much of this base pairing can be represented in a context-free grammar (the major exception being pseudoknots).

For example, consider the following grammar, where a,c,g,u represents nucleotides and S is the start symbol (and only non-terminal):

S → aSu | cSg | gSc | uSa

This simple CFG represents an RNA molecule consisting entirely of two wholly complementary regions, in which only canonical complementary pairs are allowed (i.e. A-U and C-G).

By attaching probabilities to more sophisticated CFGs, it is possible to model bases or base pairings that are more or less consistent with an expected RNA molecule pattern. SCFGs are used to model the patterns in RNA gene families in the Rfam Database, and search genome sequences for likely additional members of these families. SCFGs have also been used to find RNA genes using comparative genomics. In this work, homologs of a potential RNA gene in two related organisms were inspected using SCFG techniques to see if their secondary structure is conserved. If it is, the sequence is likely to be an RNA gene, and the secondary structure is presumed to be conserved because of the functional needs of that RNA gene. It has been shown that SCFGs could predict the secondary structure of an RNA molecule similarly to existing techniques: such SCFGs are used, for example, by the Stemloc program.

Comparison with Generative Grammar

With the publication of Gold's Theorem^[4] 1967 it was claimed that grammars for natural languages governed by deterministic rules could not be learned based on positive instances alone. This was part of the argument from the poverty of stimulus, presented in 1980^[5] and implicit since the early works by Chomsky of the 1950s. Among other arguments, this led to the nativist view, that a form of grammar (including a complete conceptual lexicon in certain versions) were hardwired from birth. This view is largely limited to GB and MP theories.

However, Gold's learnability result can easily be circumvented, if we either assume the learner learns a near-perfect approximation to the correct language or that the learner learns from typical input rather than arbitrarily devious input. In fact, it has been proved that simply receiving input from a speaker who produces positive instances randomly rather than according to a pre-specified devious plan leads to identifiability in the limit with probability 1^[6]^[7].

A problem faced in any formal syntax is that often more than one production rule may apply to a structure, thus resulting in a conflict. The greater the coverage, the higher this conflict, and all grammarians (starting with Pāṇini) have spent considerable effort devising a prioritization for the rules, which usually turn out to be defeasible. Another difficulty is overgeneration, where unlicensed structures are also generated. Probabilistic grammars circumvent these problems by using the frequency of various productions to order them, resulting in a "most likely" (winner-take-all) interpretation, which by definition, is defeasible given additional data. As usage patterns are altered in diachronic shifts, these probabilistic rules can be re-learned, thus upgrading the grammar.

One may construct a probabilistic grammar from a traditional formal syntax by assigning each non-terminal a probability taken from some distribution, to be eventually estimated from usage data. On most samples of broad language, probabilistic grammars that tune these probabilities from data typically outperform hand-crafted grammars (although some rule-based grammars are now approaching the accuracies of PCFG).

Recently, probabilistic grammars appear to have gained some cognitive plausibility. It is well known that there are degrees of difficulty in accessing different syntactic structures (e.g. the Accessibility Hierarchy for relative clauses). Probabilistic versions of Minimalist Grammars have been used to compute information-theoretic entropy values which appear to correlate well with psycholinguistic data on understandability and production difficulty.^[8]

Converting a context-free grammar to Greibach Normal Form and Chomsky normal form using CNF/GNF software

CNF / GNF : is a software make the automatic Convertion of a context-free grammar to:

1- Chomsky normal form grammar (CNF).

2- Greibach normal form grammar (GNF).

and also it show you the new grammar in each of the steps of the transformation.

A- Chomsky normal normal grammar (CNF)

1- reduced grammar.

2- epsilon-free grammar.

3- grammar without cycles and unit rules .

4-CNF.

B-Greibach normal form grammar (GNF).

1- reduced grammar.

2- epsilon-free grammar.

3- without cycles and unit rules.

4- non-left recursive grammar.

5- GNF.

with this application you can have CNF and GNF for any context-free grammar just in few steps.

just run the application and create new project and enter your grammar and click done and you will have :

CNF and GNF grammar and the new grammar in each of the steps of the transformation.

References

^ John Hale (2006). "Uncertainty About the Rest of the Sentence". Cognitive Science. 30. Dept Linguistics, Michigan State University: 643–672. doi:10.1207/s15516709cog0000_64.
^ Durbin, Eddy, Krogh, Mitchison, Biological sequence analysis, Cambridge University Press, 1998. This bioinformatics textbook includes an accessible introduction to the use of SCFGs for modelling RNA, as well as the history of this application to 1998.
^ Sean R. Eddy and Richard Durbin (1994), "RNA sequence analysis using covariance models", Nucleic Acids Research, 22 (11): 2079-88. [1]
^ Gold, E. (1967). Language identification in the limit. Information and Control 10, 447-474.
^ Chomsky, N. (1980). Rules and representations Oxford: Basil Blackwell.
^ Clark, A. (2001). Unsupervised Language Acquisition: Theory and Practice. PhD thesis
^ Horning, J. J. (1969). A study of grammatical inference. Ph.D. thesis, Computer Science Department, Stanford University
^ John Hale (2006). "Uncertainty About the Rest of the Sentence". Cognitive Science. 30. Dept Linguistics, Michigan State University: 643–672. doi:10.1207/s15516709cog0000_64.

External links

[1] John Hale (2006). "Uncertainty About the Rest of the Sentence". Cognitive Science. 30. Dept Linguistics, Michigan State University: 643–672. doi:10.1207/s15516709cog0000_64.

[2] Durbin, Eddy, Krogh, Mitchison, Biological sequence analysis, Cambridge University Press, 1998. This bioinformatics textbook includes an accessible introduction to the use of SCFGs for modelling RNA, as well as the history of this application to 1998.

[3] Sean R. Eddy and Richard Durbin (1994), "RNA sequence analysis using covariance models", Nucleic Acids Research, 22 (11): 2079-88. [1]

[4] Gold, E. (1967). Language identification in the limit. Information and Control 10, 447-474.

[5] Chomsky, N. (1980). Rules and representations Oxford: Basil Blackwell.

[6] Clark, A. (2001). Unsupervised Language Acquisition: Theory and Practice. PhD thesis

[7] Horning, J. J. (1969). A study of grammatical inference. Ph.D. thesis, Computer Science Department, Stanford University

[8] John Hale (2006). "Uncertainty About the Rest of the Sentence". Cognitive Science. 30. Dept Linguistics, Michigan State University: 643–672. doi:10.1207/s15516709cog0000_64.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]