Talk:Parsing expression grammar

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer science (Rated B-class, Mid-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the project's quality scale.
 Mid  This article has been rated as Mid-importance on the project's importance scale.
 

Incorrect example[edit]

I'm not convinced that the grammar listed for  \{ a^n b^n c^n : n > 0 \} is correct.

A ← (a A b)?
B ← (b B c)?

A matches any number of a's followed by the same number of b's. B matches any number of b's followed by the same number of c's.

S ← &(A !b) a* B !c

&(A !b) ensures that we have any number of a's followed by the exactly the same number of b's. a* consumes all the a's, and then B !c matches any number of b's followed by the exactly the same number of c's.

In other words, A and B overlap: they match the same string of b's. They are more symmetrical that the rule suggests.

The problem is that the A and B rules also match the empty string, so this grammar accepts for n=0, which per the prose, it shouldn't. (Also, since we know there should be at least one a, we can use a+ instead of a*).

I'm going to change the A and B rules not to match the null string, the S rule to use a+, and use math markup for the equation (and n >= 1 rather than n > 0, both for consistency with Context-sensitive grammar).

Malcolm rowe 13:06, 26 September 2005 (UTC)

(Removing my own earlier comment.) QTJ 05:23, 11 October 2006 (UTC)

See also REBOL?[edit]

What does REBOL have to do with parsing expression grammars? Using them or providing a mechanism to use them doesn't count.

I agree, finding a given language in the See also section is surprising. If PEG is central/essential or at least present as standard feature in this language, explain it after this entry. Otherwise, move it to the External links (or remove it!).
I suggest also that links in the PEG parser generators section are removed from the External links one (duplicates)
PhiLho 08:53, 3 January 2007 (UTC)

Please clarify N and Σ in the examples[edit]

The definition of a parsing expression grammar says it consists of

a finite set N of nonterminal symbols
a finite set Σ of terminal symbols that is disjoint from N
a finite set P of parsing rules

Please provide clearly and explicitly in all of the examples, what all three of these are.

S <- if C then S else S / if C then S

"if" "then" and "else" are in the finite set of terminal symbols? If so, then say so. And so forth for the rest of the examples.

thank you. --LAM 66.28.244.66 17:04, 15 May 2006 (UTC)


^^ ++! Additionally it does not help to say "if/then/else" would be standard C-style, because in C you don't have "then", this has been confusing.

Another problem that should IMHO be mentioned in the article manifests itself here: The order in which one specifies the choices is very important. E.g. the following rule would not be correct:

S <- if C then S / if C then S else S

--92.202.58.200 (talk) 18:10, 16 December 2008 (UTC)

First example is incorrect.[edit]

The first example (PEG for mathematical formulas that use the basic four operations) is (was, now) incorrect. It allows expressions such as 3.2 + ..2..3..000

Only a single period should be allowed. —Preceding unsigned comment added by HappyDog (talkcontribs)

Thanks, that's a good point. I think the best way is just to take out the decimal point, and only allow the example to recognize non-negative integers. If it's supposed to cover real numbers, it needs a negative sign too, and some needlessly complex parsing rules. I've changed the example to just use non-negative integers. rspeer / ɹəədsɹ 23:37, 25 May 2006 (UTC)
A good solution. Thanks for the prompt reply. --HappyDog 17:07, 26 May 2006 (UTC)
I think you forgot / Product in the last line of the example. Am I right, or am I missing something? I.e. I understand it that Expr was meant to be either Sum or Product. 62.219.142.171 (talk) —Preceding undated comment added 13:44, 25 July 2011 (UTC).
Nevermind, please, was looking at the wrong place, the comment above is wrong. — Preceding unsigned comment added by 62.219.142.171 (talk) 13:48, 25 July 2011 (UTC)
Same commenter still, I realized what was causing the nuisance. Wouldn't it be better if the snippet looked like this:

Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Expr    ← Product (('+' / '-') Product)*

Because the `Sum' in the existing example isn't really a sum (it contains both sum and product), but later assigning it to `Expr' is a redundancy. Does it make sense? — Preceding unsigned comment added by 62.219.142.171 (talk) 15:20, 25 July 2011 (UTC)

Suggest Merge of Packrat Parser Into This[edit]

I suggest that the packrat parser article stub be merged into this article. Since I am a parsing technology author myself, I recuse myself from the discussion beyond this suggestion, to avoid conflicts-of-interest. My rational is that packrat parsing is not widely enough adopted to get a full stub. Packrat parser could then be a redirect to this page, instead.

-- QTJ 00:40, 19 October 2006 (UTC)

I second QTJ's suggestion.

--Junkblocker 00:39, 20 January 2007 (UTC)

I propose that the two are separated again. PEGs are a formalism whilst the packrat parser is an algorithm. I found the merge confusing when reading here. Are the two formally equivalent? If not, they deserve separate articles. Even if they are, I think this is a good idea. We would not combine, e.g. the article on regular expressions with that on finite state automata. The definition of the packrat parser given here is thin. It should be expanded, but probably not within the PEG page.

Your thoughts? --winterstein (talk) 07:39, 11 May 2009 (UTC)

I think it's better to keep them together for the moment. Longer articles, possibly spanning several closely related topics, are generally favoured here because they make for better reading. It's not even clear that we will ever have enough information here for a decent-sized article on the combined topic. If you find the article confusing as it is, you might want to give packrat its own section and expand it. But a split would mean that each article must be summarised at the other; and for such small articles this would be a bit funny. --Hans Adler (talk) 08:23, 11 May 2009 (UTC)

Mistake in (* comments *) example?[edit]

For (* comments *) the article claims:

C ← "(*" N* "*)"
NC / (!"(*" Z)
Zany single character

but wouldn't N* eat everything including the closing comment bracket? Shouldn't this be:

C ← "(*" N* "*)"
NC / (!"*)" Z)
Zany single character

to avoid eating the closing comment backet? The only reason to not allow N ← (* is to generate an error if unbalanced. 59.112.63.188 18:09, 15 April 2007 (UTC)

You're right that the closing bracket needs to be avoided within N. So does the opening bracket, because you need to be able to balance comment brackets to nest them. I'll change the example. rspeer / ɹəədsɹ 19:59, 15 April 2007 (UTC)

Different interpretation[edit]

We say this:

Parsing expression grammars look similar to regular expressions or context-free grammars (CFG) in Backus-Naur form (BNF) notation, but have a different interpretation.

It's not clear to me how they have a different interpretation from regular expressions. Seems to me that's what they are, with the possible exception of the & and ! lookahead functionality. Can someone explain how they are different? --Doradus 03:33, 13 August 2007 (UTC)

The most important difference is that they are recursive. Regular expressions can't refer to themselves or other regular expressions. rspeer / ɹəədsɹ 08:07, 25 October 2007 (UTC)

Problems with the current definition[edit]

Under disadvantages it is mentioned that PEG's cannot handle left recursive rules. I believe this statement is correct, but it is not implied by the current definition. ie. The current definition of a peg is deficient.

—Preceding unsigned comment added by 202.37.96.11 (talk) 22:54, 24 October 2007 (UTC)

I think the statement that left recursion is not supported by packrat parsers should be removed, as a paper *mentioned in the article* provides a way to parse left recursive PEGs. --Tetha (talk) 12:06, 10 December 2008 (UTC) Tetha

The definition of a peg doesn't mention grouping of expressions (e), but the example uses it. Which is right?

I suspect the answer is the definition needs to be extended to match the example.

—Preceding unsigned comment added by 202.37.96.11 (talk) 22:47, 24 October 2007 (UTC)

The present definition defines only the syntax, except that there is a remark on how rewriting differs from rewriting for CFGs. An exact definition of how PEG rules are used should be supplied. Rp (talk) 15:54, 29 January 2009 (UTC)
The comment from 202.37.96.11 is correct, and the response to it by Rp is a non sequitur. How odd that such a basic mistake in the definition has lasted so long. Also, the a^n b^n c^n example is silly and overly complex: S ← &A a+ B is sufficient (this matches a prefix, not the whole input, but the a^n B^n example does the same, and the !(a/b/c) is ridiculous -- any symbol other than a, b, or c could occur). -- 98.108.225.240 (talk) 03:15, 6 February 2011 (UTC)

Hard to see PEGs as new[edit]

Note that ordered choice interpretation of arbitrary context-free grammars is not new. One example of this is the TXL programming language (http://www.txl.ca/), which has existed since the mid 80s. Add that regular expressions can be de-sugared to context free grammars and it is hard to see PEGs as a new kind of formalism. —Preceding unsigned comment added by Thurston51 (talkcontribs) 23:40, 27 November 2007 (UTC)

PEGs are not the same as CFGs + choice interpretation - they specify an analytic grammar rather than a generative grammar. As for being a 'new formalism', the original paper states that they are basically a refinement of TS/TDPL and gTS/GTDPL from the '70s. They are, however, a good tool for specifying parsers. Mrdardie (talk) 04:35, 1 October 2008 (UTC)
Analytical grammars don't appear to be a particular formalism or even a particular subset of the context-free grammars, they just seem to be context-free grammars being used in a particular way, and even that isn't completely clear. The article on it appears to confuse the semantics of grammars with parsing algorithms for grammars. The same is happening in this article on PEGs. E.g. it isn't a property of regular expressions that "regular expressions behave by backtracking", that is just a particular way of parsing. The semantics of PEGs are described by sketching some elements of a possible implementation (e.g. "each nonterminal represents a parsing function"). If the semantics are described in the standard way, by defining the language described by a nonterminal in terms of the languages described by the elements in its parsing expression, it becomes easier to compare them with CFGs. For CFGs, regular expressions on the right hand side of a rule are just syntactic sugar. As far as I can see, if predicates are restricted to fixed strings, PEGs are just a notational variant of LL(k) grammars, and most of what is said about them here is traditionally phrased in terms of top-down parsing for CFGs. So the real difference is that predicates may contain nonterminals; I don't know if there is any theory on the consequences. Rp (talk) 11:47, 3 April 2009 (UTC)

PEGS vs CFGs[edit]

How do PEGs compare to CFGs in expressive power? Rp (talk) 10:42, 30 November 2007 (UTC)

The author states in the original paper that he believes PEGs are less expressive than CFGs because PEGs can be implemented in linear time by a memoizing compiler, but that 'a proof has been surprisingly elusive'. Not sure if the problem has been addressed since. Mrdardie (talk) 04:43, 1 October 2008 (UTC)

This is wrong. As the wikipedia-article here demonstrates, PEGs are able to describe certain context sensitive languages, so CFG cannot be a superset of PEG. However, I am still wondering if PEG are more powerful than CFG or of PEG and CFG cannot be compared using subset relations --Tetha (talk) 15:39, 11 December 2008 (UTC)

Imprecise wording of an example[edit]

I think, the following is imprecise:

The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b.

The "a" in "a b" will not be matched, though the "a" is not "the first in an arbitrarily-long sequence of a's followed by a b", because the " arbitrarily-long sequence of a's" is empty in this case, so it does not have the first element. I'd propose to change it to "the first in a non-empty sequence of a's followed by b". I won't do the change myself, because I'm a newbie. --Mikon (talk) 09:30, 26 January 2009 (UTC)

In general, feel free to change things even while being a newbie... but your explanation doesn't make sense to me. The sequence of a's in "a b" isn't empty: it has one a. You can't have a first a in an empty sequence. rspεεr (talk) 11:00, 26 January 2009 (UTC)

Number of alternatives in PEG rules versus CFGs[edit]

I read this months ago and it was bad enough I had to correct it:

Unlike in a context-free grammar or other generative grammars, in a parsing expression grammar there must be exactly one rule in the grammar having a given nonterminal on its left-hand side. [...]

Now, somebody felt it prudent to put this text back in. This is wrong. This difference is a purely syntactic issue that offers no meaningful insight into the differences between PEGs and CFGs. It's just that instead of writing alternate possibilities as, say

A -> ()
A -> (A)
A -> A A

You write this grammar like this instead:

A -> () | (A) | A A 

To further illustrate my argument, I would point to the first page of Shriram Krishnamurthi's book, Programming Languages: Application and Interpretation, which is electronically available under the Creative Commons, and I quote as follows:

The first insignificant attribute is the syntax. Syntaxes are highly sensitive topics, but in the end, they don’t tell us very much about a program’s behavior. For instance, consider the following four fragments:

   1. a [25]
   2. (vector-ref a 25)
   3. a [25]
   4. a [25]

Which two are most alike? The first and second, obviously! Why? Because the first is in Java and the second is in Scheme, both of which signal an error if the vector associated with a has fewer than 25 entries; the third, in C, blithely ignores the vector’s size, leading to unspecified behavior, even though its syntax is exactly the same as that of the Java code. The fourth, in ML or Haskell, is an application of a to the list containing just one element, 25: that is, it’s not an array dereference at all, it’s a function appliction!

http://www.cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04-26/

Obscuranym (talk) 04:01, 21 April 2009 (UTC)

The problem is that two distinct rules A -> B, A -> C are not the same as the ordered choice A -> B/C. The following grammer is ambiguous:
 S -> A
 S -> a
 A -> a
It is, hence, an important part of the definition that only one rule per non-terminal exists. I would, however, say that there is at most one rule per non-terminal, or?
Mattias Ulbrich 141.3.27.22 (talk) 11:00, 26 February 2010 (UTC)
I disagree. The difference is that the alternatives in a PEG are ordered. They don't form a single rule any more or less than the alternatives in a CFG do. Rp (talk) 22:41, 26 February 2010 (UTC)
Yes, CFGs and PEGs are semantically distinct; in fact I added the notion of "ordered choice" to the article. However, your argument is based solely on notation (i.e. syntax). Rp is exactly right: there is a trivial and natural isomorphism between the two notations, but not the meanings.
To clarify the relationship to the quote from PLAI, you are, in effect, arguing that the reason Java fragment in (1) and the Scheme fragment in (2) differ is that they look different. Although PEGs and CFGs are very different beasts, it's not because they look different: after all, the Java and Scheme fragments look different, even though then denote almost the same thing. Obscuranym (talk) 04:09, 18 March 2010 (UTC)
Mattias is right, the other two are wrong. In a PEG, A ← x / y is defined as attempting x and, if it fails, backing up and then attempting y. No such rule exists for A ← x ; A ← y ... of course, one could implement a PEG system that treats the latter as if it were the former, but this article is about the definition of a PEG as it exists in the literature, which some editors seem unfamiliar with and uncaring about. -- 98.108.225.240 (talk) 03:28, 6 February 2011 (UTC)
You don't appear to understand our objection, which is that Matthias and you confuse two entirely different things:
  • whether to regard A ← x / y (for PEGs) or A ← x | y (for CFGs) as one or two rules; this is purely a metasyntactic detail, irrelevant for most purposes, and it has nothing to do with what is actually relevant, namely,
  • the difference in interpretation (ordered with no backtracking for PEGs vs. nondeterministic choice for CFGs). Rp (talk) 21:43, 8 February 2011 (UTC)

Expressivity Example Incorrect?[edit]

The example given for a language that cannot be recognized by PEGs is: S ← 'x' S 'x' | 'x'. Is this not recognizable through the following PEG? S ← 'x' ('x' 'x')* —Preceding unsigned comment added by 69.251.13.250 (talk) 11:21, 29 December 2009 (UTC)

True, it recognizes the same language, but generates different parse trees -- you're exploiting the fact that the language has a particularly simple definition "odd number of x'es". There are more complicated (and hence harder to understand in an encyclopedia entry) examples where PEGs not only cannot produce the same parse tree but cannot even recognize the same language. AshtonBenson (talk) 22:16, 2 January 2010 (UTC)
I think one such esample would be S<- 'x' S 'x' | 'x' | 'q'. Haven't thought too hard about it though. AshtonBenson (talk) 22:17, 2 January 2010 (UTC)
I think S ← 'q' / 'x' S 'x' / 'x' ('x' 'x')* recognizes the same language. — Preceding unsigned comment added by 190.31.46.132 (talk) 15:16, 4 September 2011 (UTC)

Original Paper Citation[edit]

I think this article is remiss in not citing the original paper that introduced PEGs. The full citation is Bryan Ford, "Parsing Expression Grammars: A Recognition Based Syntactic Foundation", Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2004. The online home is: http://portal.acm.org/citation.cfm?id=964001.964011

Since this recent paper introduced the concept of PEGs, I think both that it should be cited and referenced as the source of the idea, but the discussion of the concept itself should probably be derived from this work. 128.173.236.168 (talk) 15:43, 17 March 2010 (UTC)

I agree. I have added the reference. Rp (talk) 17:03, 30 July 2010 (UTC)

PEG vs. RegExp[edit]

"Compared to regular expressions, PEGs are strictly more powerful" - Unlike PEG, regexes have back-references, so how PEG can be "strictly more powerful"? Enerjazzer (talk) 12:58, 7 July 2010 (UTC)

Regular expressions don't have back references. Sure, Perl added back references and still call them regular expressions, but the article uses the term in the more generally accepted sense. Obscuranym (talk) 12:59, 9 July 2010 (UTC)
The article on regexes should make this clear. Back references in regular expression patterns (in actual tools) predate Perl, they are an obvious idea given their use in substitution replacements. Rp (talk) 20:47, 9 July 2010 (UTC)
OK, changed the text to make this clear. Enerjazzer (talk) 12:50, 10 July 2010 (UTC)

"Thus ordered choice is not commutative, unlike unordered choice as in context-free grammars and regular expressions." I deleted "regular expressions" from that sentences because the choice operator | in regexes performs ordered choice.[1] --88.73.20.200 (talk) 02:04, 23 March 2012 (UTC)

LL(1)?[edit]

Isn’t this just non-strict LL(1)? It works like LL(1), but conflicts get ignored, we say “first rule has precedence”. Otherwise I would like to have an example for great stuff Packrat can parse. --Chricho (talk) 12:48, 11 September 2010 (UTC)

I haven't taken the time to study the exact relationship between PEGs and the various classes of grammars, but it'd be something more akin to LL(∞). Obscuranym (talk) 21:36, 13 September 2010 (UTC)
The ANDing of alternatives is different. Rp (talk) 12:43, 14 September 2010 (UTC)
Okay, & and ! are not possible to simulate with LL(k) or any CFL-parser. However, the actual difference between a parser calling itself Packrat-parser and a parser calling itself LL(1) is very small, the LL(1) parsers often provide a roll-back-feature (LL(*), not LL(∞)) which is more than enough for real-life programming languages and not slower than Packrat’s & and !. Packrat certainly does not run in O(n), it depends on the grammar, choose this poor, pathologic example: S ← a (&S) S b | b, it will run in O(2^n), that is not a real-life-example, but in real life even LL(*)-backtracking does not have O(2^n). --Chricho (talk) 00:00, 15 September 2010 (UTC)

Example for CFL\L(PEG)[edit]

I think the anti-LL(1) example works for Packrat, too:

 S ← A|B
 A ← a A b | ε
 B ← a B b b | ε

or did I get anything wrong with the & operator and it is possible to express this language with a PEG? --Chricho (talk) 00:11, 15 September 2010 (UTC)

I wish someone would answer this. Rp (talk) 21:51, 8 February 2011 (UTC)

A question[edit]

Does:

 S ← &(a b) a b | &(a c) a c

accept the input ac?

 S ← a b | a c

would certainly not accept it. Or wait: “The choice operator e1 / e2 first invokes e1, and if e1 succeeds, returns its result immediately. Otherwise, if e1 fails, then the choice operator backtracks to the original input position at which it invoked e1, but then calls e2 instead, returning e2's result.” – that cannot be true. I think it makes the decision based on the first-set, it will not backtrack. --Chricho (talk) 00:17, 15 September 2010 (UTC)

Instead of talking about what you think, refer to the literature. -- 98.108.225.240 (talk) 03:33, 6 February 2011 (UTC)

use of 'memorization'[edit]

The linked topic doesn't have anything relevant. Either that's an invented term, inapt (caching comes to mind as an alternative), or the linked topic needs some sort of disambiguation to support the link TEDickey (talk) 11:36, 23 November 2010 (UTC)

The correct term is memoization, as in, creating memos. Whoever made this change probably thought they were correcting a typo. Msnicki (talk) 17:33, 23 November 2010 (UTC)

Is PEG Implemented in Any Programming Language?[edit]

I couldn't find one, but there are mentions of real parsers in the article. Would anyone post a link to a program / library where they are used? Thanks 109.64.130.34 (talk) 17:18, 25 August 2013 (UTC)

You mean something like this? http://www.inf.puc-rio.br/~roberto/lpeg/ Hans Adler 19:20, 25 August 2013 (UTC)
Yup, it would be good to add it. Not sure in what way, but the motivation for it is that PEG is, well, at least some times, mentioned in the context of programming as an alternative to regular expressions (I happen to believe that it would be a great alternative). Being able to try it "live" (I don't know Lua, but it's an easy to pick up language with interactive interpreter) should help by providing as many examples as one wishes. 109.64.130.34 (talk) 23:18, 25 August 2013 (UTC)
I only took a glance, but I think that library has a misleading name: it doesn't support nonterminals, so it cannot express certain uses of recursion in PEGs. What remains is in essence assertions, which Perl regular expressions have been providing for ages (albeit with limitations on the assertion expression).
Now what to do about it? Assertions in regular expressions deserve mention in Wikipedia, but do they deserve their own article? Neither Regular expression nor Assertion (computing) mentions them, and they are indeed a big step towards PEGs, so they can be described here - as long as they are not described as being the same thing as PEGs. Rp (talk) 10:32, 26 August 2013 (UTC)
I've found this: https://github.com/nikodemus/esrap through further searching. I know Common Lisp enough to tell that is does implement all that is mentioned in the article, but it does also things on top of that, like, for example, invoking arbitrary Lisp function on parsed text. So, I don't know. I'd also imagine there must be some code in Haskell that uses the parser from a paper linked here to create some usable tools, but I don't know Haskell well enough to tell. 109.64.130.34 (talk) 17:18, 26 August 2013 (UTC)
This one looks more strict implementation: http://common-lisp.net/project/cl-peg/ 109.65.104.130 (talk) 18:51, 11 September 2013 (UTC)
I think I finally found a good example (no or very basic programming skill is required, available for use online with the requirement of JavaScript being enabled) http://pegjs.majda.cz/ (this is a JavaScript library that implements PEG, it also has a "try online" feature). 109.65.104.130 (talk) 22:46, 29 September 2013 (UTC)
These maybe? Comparison of parser generators (see the section on Parsing Expression Grammars) Bruce IV (talk) 15:56, 5 October 2013 (UTC)

Disadvantages[edit]

The statement under the heading "Subtle grammatical errors" is pushing the definition of "disadvantage" a little. It's a pretty ridiculous leap of logic to say that PEGs are flawed because it takes effort to translate ambiguities from a lesser formalism. Those ambiguities are a flaw in the original grammar, not in the PEG translation. In fact, I'm going to remove this silly and bizarre statement, since it's completely flawed reasoning and adds nothing insightful to the article anyway. — Preceding unsigned comment added by 82.9.176.129 (talk) 14:46, 24 September 2013 (UTC)

Agreed. However, I think it's as much of an error to assume that ambiguities in grammars are always a mistake than to assume the opposite. Rp (talk) 06:55, 25 September 2013 (UTC)