Talk:Parsing expression grammar
|WikiProject Computer science||(Rated B-class, Mid-importance)|
- 1 Incorrect example
- 2 See also REBOL?
- 3 Please clarify N and Σ in the examples
- 4 First example is incorrect.
- 5 Suggest Merge of Packrat Parser Into This
- 6 Mistake in (* comments *) example?
- 7 Different interpretation
- 8 Problems with the current definition
- 9 Hard to see PEGs as new
- 10 PEGS vs CFGs
- 11 Imprecise wording of an example
- 12 Number of alternatives in PEG rules versus CFGs
- 13 Expressivity Example Incorrect?
- 14 Original Paper Citation
- 15 PEG vs. RegExp
- 16 LL(1)?
- 17 Example for CFL\L(PEG)
- 18 A question
- 19 use of 'memorization'
- 20 Is PEG Implemented in Any Programming Language?
- 21 Disadvantages
I'm not convinced that the grammar listed for is correct.
- A ← (a A b)?
- B ← (b B c)?
A matches any number of a's followed by the same number of b's. B matches any number of b's followed by the same number of c's.
- S ← &(A !b) a* B !c
&(A !b) ensures that we have any number of a's followed by the exactly the same number of b's. a* consumes all the a's, and then B !c matches any number of b's followed by the exactly the same number of c's.
In other words, A and B overlap: they match the same string of b's. They are more symmetrical that the rule suggests.
The problem is that the A and B rules also match the empty string, so this grammar accepts for n=0, which per the prose, it shouldn't. (Also, since we know there should be at least one a, we can use a+ instead of a*).
I'm going to change the A and B rules not to match the null string, the S rule to use a+, and use math markup for the equation (and n >= 1 rather than n > 0, both for consistency with Context-sensitive grammar).
Malcolm rowe 13:06, 26 September 2005 (UTC)
(Removing my own earlier comment.) QTJ 05:23, 11 October 2006 (UTC)
See also REBOL?
What does REBOL have to do with parsing expression grammars? Using them or providing a mechanism to use them doesn't count.
- I agree, finding a given language in the See also section is surprising. If PEG is central/essential or at least present as standard feature in this language, explain it after this entry. Otherwise, move it to the External links (or remove it!).
- I suggest also that links in the PEG parser generators section are removed from the External links one (duplicates)
- PhiLho 08:53, 3 January 2007 (UTC)
Please clarify N and Σ in the examples
The definition of a parsing expression grammar says it consists of
a finite set N of nonterminal symbols a finite set Σ of terminal symbols that is disjoint from N a finite set P of parsing rules
Please provide clearly and explicitly in all of the examples, what all three of these are.
S <- if C then S else S / if C then S
"if" "then" and "else" are in the finite set of terminal symbols? If so, then say so. And so forth for the rest of the examples.
thank you. --LAM 22.214.171.124 17:04, 15 May 2006 (UTC)
^^ ++! Additionally it does not help to say "if/then/else" would be standard C-style, because in C you don't have "then", this has been confusing.
Another problem that should IMHO be mentioned in the article manifests itself here: The order in which one specifies the choices is very important. E.g. the following rule would not be correct:
S <- if C then S / if C then S else S
First example is incorrect.
The first example (PEG for mathematical formulas that use the basic four operations) is (was, now) incorrect. It allows expressions such as 3.2 + ..2..3..000
- Thanks, that's a good point. I think the best way is just to take out the decimal point, and only allow the example to recognize non-negative integers. If it's supposed to cover real numbers, it needs a negative sign too, and some needlessly complex parsing rules. I've changed the example to just use non-negative integers. rspeer / ɹəədsɹ 23:37, 25 May 2006 (UTC)
- A good solution. Thanks for the prompt reply. --HappyDog 17:07, 26 May 2006 (UTC)
- I think you forgot / Product in the last line of the example. Am I right, or am I missing something? I.e. I understand it that Expr was meant to be either Sum or Product. 126.96.36.199 (talk) —Preceding undated comment added 13:44, 25 July 2011 (UTC).
- Nevermind, please, was looking at the wrong place, the comment above is wrong. — Preceding unsigned comment added by 188.8.131.52 (talk) 13:48, 25 July 2011 (UTC)
- Same commenter still, I realized what was causing the nuisance. Wouldn't it be better if the snippet looked like this:
Value ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Expr ← Product (('+' / '-') Product)*
Because the `Sum' in the existing example isn't really a sum (it contains both sum and product), but later assigning it to `Expr' is a redundancy. Does it make sense? — Preceding unsigned comment added by 184.108.40.206 (talk) 15:20, 25 July 2011 (UTC)
Suggest Merge of Packrat Parser Into This
I suggest that the packrat parser article stub be merged into this article. Since I am a parsing technology author myself, I recuse myself from the discussion beyond this suggestion, to avoid conflicts-of-interest. My rational is that packrat parsing is not widely enough adopted to get a full stub. Packrat parser could then be a redirect to this page, instead.
-- QTJ 00:40, 19 October 2006 (UTC)
I second QTJ's suggestion.
--Junkblocker 00:39, 20 January 2007 (UTC)
I propose that the two are separated again. PEGs are a formalism whilst the packrat parser is an algorithm. I found the merge confusing when reading here. Are the two formally equivalent? If not, they deserve separate articles. Even if they are, I think this is a good idea. We would not combine, e.g. the article on regular expressions with that on finite state automata. The definition of the packrat parser given here is thin. It should be expanded, but probably not within the PEG page.
- I think it's better to keep them together for the moment. Longer articles, possibly spanning several closely related topics, are generally favoured here because they make for better reading. It's not even clear that we will ever have enough information here for a decent-sized article on the combined topic. If you find the article confusing as it is, you might want to give packrat its own section and expand it. But a split would mean that each article must be summarised at the other; and for such small articles this would be a bit funny. --Hans Adler (talk) 08:23, 11 May 2009 (UTC)
Mistake in (* comments *) example?
For (* comments *) the article claims:
- C ← "(*" N* "*)"
- N ← C / (!"(*" Z)
- Z ← any single character
but wouldn't N* eat everything including the closing comment bracket? Shouldn't this be:
- C ← "(*" N* "*)"
- N ← C / (!"*)" Z)
- Z ← any single character
to avoid eating the closing comment backet? The only reason to not allow N ← (* is to generate an error if unbalanced. 220.127.116.11 18:09, 15 April 2007 (UTC)
- You're right that the closing bracket needs to be avoided within N. So does the opening bracket, because you need to be able to balance comment brackets to nest them. I'll change the example. rspeer / ɹəədsɹ 19:59, 15 April 2007 (UTC)
We say this:
Parsing expression grammars look similar to regular expressions or context-free grammars (CFG) in Backus-Naur form (BNF) notation, but have a different interpretation.
It's not clear to me how they have a different interpretation from regular expressions. Seems to me that's what they are, with the possible exception of the & and ! lookahead functionality. Can someone explain how they are different? --Doradus 03:33, 13 August 2007 (UTC)
- The most important difference is that they are recursive. Regular expressions can't refer to themselves or other regular expressions. rspeer / ɹəədsɹ 08:07, 25 October 2007 (UTC)
Problems with the current definition
Under disadvantages it is mentioned that PEG's cannot handle left recursive rules. I believe this statement is correct, but it is not implied by the current definition. ie. The current definition of a peg is deficient.
I think the statement that left recursion is not supported by packrat parsers should be removed, as a paper *mentioned in the article* provides a way to parse left recursive PEGs. --Tetha (talk) 12:06, 10 December 2008 (UTC) Tetha
The definition of a peg doesn't mention grouping of expressions (e), but the example uses it. Which is right?
I suspect the answer is the definition needs to be extended to match the example.
- The present definition defines only the syntax, except that there is a remark on how rewriting differs from rewriting for CFGs. An exact definition of how PEG rules are used should be supplied. Rp (talk) 15:54, 29 January 2009 (UTC)
- The comment from 18.104.22.168 is correct, and the response to it by Rp is a non sequitur. How odd that such a basic mistake in the definition has lasted so long. Also, the a^n b^n c^n example is silly and overly complex: S ← &A a+ B is sufficient (this matches a prefix, not the whole input, but the a^n B^n example does the same, and the !(a/b/c) is ridiculous -- any symbol other than a, b, or c could occur). -- 22.214.171.124 (talk) 03:15, 6 February 2011 (UTC)
Hard to see PEGs as new
Note that ordered choice interpretation of arbitrary context-free grammars is not new. One example of this is the TXL programming language (http://www.txl.ca/), which has existed since the mid 80s. Add that regular expressions can be de-sugared to context free grammars and it is hard to see PEGs as a new kind of formalism. —Preceding unsigned comment added by Thurston51 (talk • contribs) 23:40, 27 November 2007 (UTC)
- PEGs are not the same as CFGs + choice interpretation - they specify an analytic grammar rather than a generative grammar. As for being a 'new formalism', the original paper states that they are basically a refinement of TS/TDPL and gTS/GTDPL from the '70s. They are, however, a good tool for specifying parsers. Mrdardie (talk) 04:35, 1 October 2008 (UTC)
- Analytical grammars don't appear to be a particular formalism or even a particular subset of the context-free grammars, they just seem to be context-free grammars being used in a particular way, and even that isn't completely clear. The article on it appears to confuse the semantics of grammars with parsing algorithms for grammars. The same is happening in this article on PEGs. E.g. it isn't a property of regular expressions that "regular expressions behave by backtracking", that is just a particular way of parsing. The semantics of PEGs are described by sketching some elements of a possible implementation (e.g. "each nonterminal represents a parsing function"). If the semantics are described in the standard way, by defining the language described by a nonterminal in terms of the languages described by the elements in its parsing expression, it becomes easier to compare them with CFGs. For CFGs, regular expressions on the right hand side of a rule are just syntactic sugar. As far as I can see, if predicates are restricted to fixed strings, PEGs are just a notational variant of LL(k) grammars, and most of what is said about them here is traditionally phrased in terms of top-down parsing for CFGs. So the real difference is that predicates may contain nonterminals; I don't know if there is any theory on the consequences. Rp (talk) 11:47, 3 April 2009 (UTC)
PEGS vs CFGs
- The author states in the original paper that he believes PEGs are less expressive than CFGs because PEGs can be implemented in linear time by a memoizing compiler, but that 'a proof has been surprisingly elusive'. Not sure if the problem has been addressed since. Mrdardie (talk) 04:43, 1 October 2008 (UTC)
This is wrong. As the wikipedia-article here demonstrates, PEGs are able to describe certain context sensitive languages, so CFG cannot be a superset of PEG. However, I am still wondering if PEG are more powerful than CFG or of PEG and CFG cannot be compared using subset relations --Tetha (talk) 15:39, 11 December 2008 (UTC)
Imprecise wording of an example
I think, the following is imprecise:
The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b.
The "a" in "a b" will not be matched, though the "a" is not "the first in an arbitrarily-long sequence of a's followed by a b", because the " arbitrarily-long sequence of a's" is empty in this case, so it does not have the first element. I'd propose to change it to "the first in a non-empty sequence of a's followed by b". I won't do the change myself, because I'm a newbie. --Mikon (talk) 09:30, 26 January 2009 (UTC)
- In general, feel free to change things even while being a newbie... but your explanation doesn't make sense to me. The sequence of a's in "a b" isn't empty: it has one a. You can't have a first a in an empty sequence. rspεεr (talk) 11:00, 26 January 2009 (UTC)
Number of alternatives in PEG rules versus CFGs
I read this months ago and it was bad enough I had to correct it:
Now, somebody felt it prudent to put this text back in. This is wrong. This difference is a purely syntactic issue that offers no meaningful insight into the differences between PEGs and CFGs. It's just that instead of writing alternate possibilities as, say
A -> () A -> (A) A -> A A
You write this grammar like this instead:
A -> () | (A) | A A
To further illustrate my argument, I would point to the first page of Shriram Krishnamurthi's book, Programming Languages: Application and Interpretation, which is electronically available under the Creative Commons, and I quote as follows:
The first insignificant attribute is the syntax. Syntaxes are highly sensitive topics, but in the end, they don’t tell us very much about a program’s behavior. For instance, consider the following four fragments:1. a  2. (vector-ref a 25) 3. a  4. a 
Which two are most alike? The first and second, obviously! Why? Because the first is in Java and the second is in Scheme, both of which signal an error if the vector associated with a has fewer than 25 entries; the third, in C, blithely ignores the vector’s size, leading to unspecified behavior, even though its syntax is exactly the same as that of the Java code. The fourth, in ML or Haskell, is an application of a to the list containing just one element, 25: that is, it’s not an array dereference at all, it’s a function appliction!
- The problem is that two distinct rules
A -> B,
A -> Care not the same as the ordered choice
A -> B/C. The following grammer is ambiguous:
S -> A S -> a A -> a
- It is, hence, an important part of the definition that only one rule per non-terminal exists. I would, however, say that there is at most one rule per non-terminal, or?
- Yes, CFGs and PEGs are semantically distinct; in fact I added the notion of "ordered choice" to the article. However, your argument is based solely on notation (i.e. syntax). Rp is exactly right: there is a trivial and natural isomorphism between the two notations, but not the meanings.
- To clarify the relationship to the quote from PLAI, you are, in effect, arguing that the reason Java fragment in (1) and the Scheme fragment in (2) differ is that they look different. Although PEGs and CFGs are very different beasts, it's not because they look different: after all, the Java and Scheme fragments look different, even though then denote almost the same thing. Obscuranym (talk) 04:09, 18 March 2010 (UTC)
- Mattias is right, the other two are wrong. In a PEG, A ← x / y is defined as attempting x and, if it fails, backing up and then attempting y. No such rule exists for A ← x ; A ← y ... of course, one could implement a PEG system that treats the latter as if it were the former, but this article is about the definition of a PEG as it exists in the literature, which some editors seem unfamiliar with and uncaring about. -- 126.96.36.199 (talk) 03:28, 6 February 2011 (UTC)
- You don't appear to understand our objection, which is that Matthias and you confuse two entirely different things:
- whether to regard A ← x / y (for PEGs) or A ← x | y (for CFGs) as one or two rules; this is purely a metasyntactic detail, irrelevant for most purposes, and it has nothing to do with what is actually relevant, namely,
- the difference in interpretation (ordered with no backtracking for PEGs vs. nondeterministic choice for CFGs). Rp (talk) 21:43, 8 February 2011 (UTC)
- You don't appear to understand our objection, which is that Matthias and you confuse two entirely different things:
Expressivity Example Incorrect?
The example given for a language that cannot be recognized by PEGs is: S ← 'x' S 'x' | 'x'. Is this not recognizable through the following PEG? S ← 'x' ('x' 'x')* —Preceding unsigned comment added by 188.8.131.52 (talk) 11:21, 29 December 2009 (UTC)
- True, it recognizes the same language, but generates different parse trees -- you're exploiting the fact that the language has a particularly simple definition "odd number of x'es". There are more complicated (and hence harder to understand in an encyclopedia entry) examples where PEGs not only cannot produce the same parse tree but cannot even recognize the same language. AshtonBenson (talk) 22:16, 2 January 2010 (UTC)
Original Paper Citation
I think this article is remiss in not citing the original paper that introduced PEGs. The full citation is Bryan Ford, "Parsing Expression Grammars: A Recognition Based Syntactic Foundation", Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2004. The online home is: http://portal.acm.org/citation.cfm?id=964001.964011
Since this recent paper introduced the concept of PEGs, I think both that it should be cited and referenced as the source of the idea, but the discussion of the concept itself should probably be derived from this work. 184.108.40.206 (talk) 15:43, 17 March 2010 (UTC)
PEG vs. RegExp
- Regular expressions don't have back references. Sure, Perl added back references and still call them regular expressions, but the article uses the term in the more generally accepted sense. Obscuranym (talk) 12:59, 9 July 2010 (UTC)
- The article on regexes should make this clear. Back references in regular expression patterns (in actual tools) predate Perl, they are an obvious idea given their use in substitution replacements. Rp (talk) 20:47, 9 July 2010 (UTC)
"Thus ordered choice is not commutative, unlike unordered choice as in context-free grammars and regular expressions." I deleted "regular expressions" from that sentences because the choice operator | in regexes performs ordered choice. --220.127.116.11 (talk) 02:04, 23 March 2012 (UTC)
Isn’t this just non-strict LL(1)? It works like LL(1), but conflicts get ignored, we say “first rule has precedence”. Otherwise I would like to have an example for great stuff Packrat can parse. --Chricho (talk) 12:48, 11 September 2010 (UTC)
- I haven't taken the time to study the exact relationship between PEGs and the various classes of grammars, but it'd be something more akin to LL(∞). Obscuranym (talk) 21:36, 13 September 2010 (UTC)
- Okay, & and ! are not possible to simulate with LL(k) or any CFL-parser. However, the actual difference between a parser calling itself Packrat-parser and a parser calling itself LL(1) is very small, the LL(1) parsers often provide a roll-back-feature (LL(*), not LL(∞)) which is more than enough for real-life programming languages and not slower than Packrat’s & and !. Packrat certainly does not run in O(n), it depends on the grammar, choose this poor, pathologic example: S ← a (&S) S b | b, it will run in O(2^n), that is not a real-life-example, but in real life even LL(*)-backtracking does not have O(2^n). --Chricho (talk) 00:00, 15 September 2010 (UTC)
Example for CFL\L(PEG)
I think the anti-LL(1) example works for Packrat, too:
S ← A|B A ← a A b | ε B ← a B b b | ε
S ← &(a b) a b | &(a c) a c
accept the input ac?
S ← a b | a c
would certainly not accept it. Or wait: “The choice operator e1 / e2 first invokes e1, and if e1 succeeds, returns its result immediately. Otherwise, if e1 fails, then the choice operator backtracks to the original input position at which it invoked e1, but then calls e2 instead, returning e2's result.” – that cannot be true. I think it makes the decision based on the first-set, it will not backtrack. --Chricho (talk) 00:17, 15 September 2010 (UTC)
- Instead of talking about what you think, refer to the literature. -- 18.104.22.168 (talk) 03:33, 6 February 2011 (UTC)
use of 'memorization'
The linked topic doesn't have anything relevant. Either that's an invented term, inapt (caching comes to mind as an alternative), or the linked topic needs some sort of disambiguation to support the link TEDickey (talk) 11:36, 23 November 2010 (UTC)
- The correct term is memoization, as in, creating memos. Whoever made this change probably thought they were correcting a typo. Msnicki (talk) 17:33, 23 November 2010 (UTC)
Is PEG Implemented in Any Programming Language?
I couldn't find one, but there are mentions of real parsers in the article. Would anyone post a link to a program / library where they are used? Thanks 22.214.171.124 (talk) 17:18, 25 August 2013 (UTC)
- You mean something like this? http://www.inf.puc-rio.br/~roberto/lpeg/ Hans Adler 19:20, 25 August 2013 (UTC)
- Yup, it would be good to add it. Not sure in what way, but the motivation for it is that PEG is, well, at least some times, mentioned in the context of programming as an alternative to regular expressions (I happen to believe that it would be a great alternative). Being able to try it "live" (I don't know Lua, but it's an easy to pick up language with interactive interpreter) should help by providing as many examples as one wishes. 126.96.36.199 (talk) 23:18, 25 August 2013 (UTC)
- I only took a glance, but I think that library has a misleading name: it doesn't support nonterminals, so it cannot express certain uses of recursion in PEGs. What remains is in essence assertions, which Perl regular expressions have been providing for ages (albeit with limitations on the assertion expression).
- Now what to do about it? Assertions in regular expressions deserve mention in Wikipedia, but do they deserve their own article? Neither Regular expression nor Assertion (computing) mentions them, and they are indeed a big step towards PEGs, so they can be described here - as long as they are not described as being the same thing as PEGs. Rp (talk) 10:32, 26 August 2013 (UTC)
- I've found this: https://github.com/nikodemus/esrap through further searching. I know Common Lisp enough to tell that is does implement all that is mentioned in the article, but it does also things on top of that, like, for example, invoking arbitrary Lisp function on parsed text. So, I don't know. I'd also imagine there must be some code in Haskell that uses the parser from a paper linked here to create some usable tools, but I don't know Haskell well enough to tell. 188.8.131.52 (talk) 17:18, 26 August 2013 (UTC)
- This one looks more strict implementation: http://common-lisp.net/project/cl-peg/ 184.108.40.206 (talk) 18:51, 11 September 2013 (UTC)
- These maybe? Comparison of parser generators (see the section on Parsing Expression Grammars) Bruce IV (talk) 15:56, 5 October 2013 (UTC)
The statement under the heading "Subtle grammatical errors" is pushing the definition of "disadvantage" a little. It's a pretty ridiculous leap of logic to say that PEGs are flawed because it takes effort to translate ambiguities from a lesser formalism. Those ambiguities are a flaw in the original grammar, not in the PEG translation. In fact, I'm going to remove this silly and bizarre statement, since it's completely flawed reasoning and adds nothing insightful to the article anyway. — Preceding unsigned comment added by 220.127.116.11 (talk) 14:46, 24 September 2013 (UTC)