Talk:Parsing expression grammar

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Mid

This article has been rated as Mid-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Linguistics Low‑importance

	Linguistics portal This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LinguisticsWikipedia:WikiProject LinguisticsTemplate:WikiProject LinguisticsLinguistics articles
Low	This article has been rated as Low-importance on the project's importance scale.

Incorrect example[edit]

I'm not convinced that the grammar listed for $\{a^{n}b^{n}c^{n}:n>0\}$ is correct.

A ← (a A b)?

B ← (b B c)?

A matches any number of a's followed by the same number of b's. B matches any number of b's followed by the same number of c's.

S ← &(A !b) a* B !c

&(A !b) ensures that we have any number of a's followed by the exactly the same number of b's. a* consumes all the a's, and then B !c matches any number of b's followed by the exactly the same number of c's.

In other words, A and B overlap: they match the same string of b's. They are more symmetrical that the rule suggests.

The problem is that the A and B rules also match the empty string, so this grammar accepts for n=0, which per the prose, it shouldn't. (Also, since we know there should be at least one a, we can use a+ instead of a*).

I'm going to change the A and B rules not to match the null string, the S rule to use a+, and use math markup for the equation (and n >= 1 rather than n > 0, both for consistency with Context-sensitive grammar).

Malcolm rowe 13:06, 26 September 2005 (UTC)[reply]

The solution given by Malcolm rowe above is correct, but as of this edit, the example in the article is broken. The rule as written:

S ← &A 'a'+ B
A ← 'a' A? 'b'
B ← 'b' B? 'c'

actually matches $\{a^{n}b^{m}c^{m}:m\geq n>0\}$ .

I'm changing this to:

S ← (&A !'b') 'a'+ B !'c'
A ← 'a' A? 'b'
B ← 'b' B? 'c'

Mike Segal (talk) 01:31, 15 June 2015 (UTC)[reply]

(Removing my own earlier comment.) QTJ 05:23, 11 October 2006 (UTC)[reply]

Please clarify N and Σ in the examples[edit]

The definition of a parsing expression grammar says it consists of

a finite set N of nonterminal symbols
a finite set Σ of terminal symbols that is disjoint from N
a finite set P of parsing rules

Please provide clearly and explicitly in all of the examples, what all three of these are.

S <- if C then S else S / if C then S

"if" "then" and "else" are in the finite set of terminal symbols? If so, then say so. And so forth for the rest of the examples.

thank you. --LAM 66.28.244.66 17:04, 15 May 2006 (UTC)[reply]

^^ ++! Additionally it does not help to say "if/then/else" would be standard C-style, because in C you don't have "then", this has been confusing.

Another problem that should IMHO be mentioned in the article manifests itself here: The order in which one specifies the choices is very important. E.g. the following rule would not be correct:

S <- if C then S / if C then S else S

--92.202.58.200 (talk) 18:10, 16 December 2008 (UTC)[reply]

First example is incorrect.[edit]

The first example (PEG for mathematical formulas that use the basic four operations) is (was, now) incorrect. It allows expressions such as 3.2 + ..2..3..000

Only a single period should be allowed. —Preceding unsigned comment added by HappyDog (talk • contribs)

Thanks, that's a good point. I think the best way is just to take out the decimal point, and only allow the example to recognize non-negative integers. If it's supposed to cover real numbers, it needs a negative sign too, and some needlessly complex parsing rules. I've changed the example to just use non-negative integers. rspeer / ɹəədsɹ 23:37, 25 May 2006 (UTC)[reply]

A good solution. Thanks for the prompt reply. --HappyDog 17:07, 26 May 2006 (UTC)[reply]

I think you forgot / Product in the last line of the example. Am I right, or am I missing something? I.e. I understand it that Expr was meant to be either Sum or Product. 62.219.142.171 (talk) —Preceding undated comment added 13:44, 25 July 2011 (UTC).[reply]

Nevermind, please, was looking at the wrong place, the comment above is wrong. — Preceding unsigned comment added by 62.219.142.171 (talk) 13:48, 25 July 2011 (UTC)[reply]

Same commenter still, I realized what was causing the nuisance. Wouldn't it be better if the snippet looked like this:

Value ← [0-9]+ / '(' Expr ')' Product ← Value (('*' / '/') Value)* Expr ← Product (('+' / '-') Product)*

Because the `Sum' in the existing example isn't really a sum (it contains both sum and product), but later assigning it to `Expr' is a redundancy. Does it make sense? — Preceding unsigned comment added by 62.219.142.171 (talk) 15:20, 25 July 2011 (UTC)[reply]

Suggest Merge of Packrat Parser Into This[edit]

I suggest that the packrat parser article stub be merged into this article. Since I am a parsing technology author myself, I recuse myself from the discussion beyond this suggestion, to avoid conflicts-of-interest. My rational is that packrat parsing is not widely enough adopted to get a full stub. Packrat parser could then be a redirect to this page, instead.

-- QTJ 00:40, 19 October 2006 (UTC)[reply]

I second QTJ's suggestion.

--Junkblocker 00:39, 20 January 2007 (UTC)[reply]

I propose that the two are separated again. PEGs are a formalism whilst the packrat parser is an algorithm. I found the merge confusing when reading here. Are the two formally equivalent? If not, they deserve separate articles. Even if they are, I think this is a good idea. We would not combine, e.g. the article on regular expressions with that on finite state automata. The definition of the packrat parser given here is thin. It should be expanded, but probably not within the PEG page.

Your thoughts? --winterstein (talk) 07:39, 11 May 2009 (UTC)[reply]

I think it's better to keep them together for the moment. Longer articles, possibly spanning several closely related topics, are generally favoured here because they make for better reading. It's not even clear that we will ever have enough information here for a decent-sized article on the combined topic. If you find the article confusing as it is, you might want to give packrat its own section and expand it. But a split would mean that each article must be summarised at the other; and for such small articles this would be a bit funny. --Hans Adler (talk) 08:23, 11 May 2009 (UTC)[reply]

Mistake in (* comments *) example?[edit]

For (* comments *) the article claims:

C ← "(*" N* "*)"

N ← C / (!"(*" Z)

Z ← any single character

but wouldn't N* eat everything including the closing comment bracket? Shouldn't this be:

C ← "(*" N* "*)"

N ← C / (!"*)" Z)

Z ← any single character

to avoid eating the closing comment backet? The only reason to not allow N ← (* is to generate an error if unbalanced. 59.112.63.188 18:09, 15 April 2007 (UTC)[reply]

You're right that the closing bracket needs to be avoided within N. So does the opening bracket, because you need to be able to balance comment brackets to nest them. I'll change the example. rspeer / ɹəədsɹ 19:59, 15 April 2007 (UTC)[reply]

Different interpretation[edit]

We say this:

Parsing expression grammars look similar to regular expressions or context-free grammars (CFG) in Backus-Naur form (BNF) notation, but have a different interpretation.

It's not clear to me how they have a different interpretation from regular expressions. Seems to me that's what they are, with the possible exception of the & and ! lookahead functionality. Can someone explain how they are different? --Doradus 03:33, 13 August 2007 (UTC)[reply]

The most important difference is that they are recursive. Regular expressions can't refer to themselves or other regular expressions. rspeer / ɹəədsɹ 08:07, 25 October 2007 (UTC)[reply]

Problems with the current definition[edit]

Under disadvantages it is mentioned that PEG's cannot handle left recursive rules. I believe this statement is correct, but it is not implied by the current definition. ie. The current definition of a peg is deficient.

—Preceding unsigned comment added by 202.37.96.11 (talk) 22:54, 24 October 2007 (UTC)[reply]

Not having left recursion is not a disadvantage. Left recursion is replaced by the loop opetator.

expr = term (('+'/'-') tetm)*;

One programs a parser and therefore I think that parsing expression grammar could mean a language of exprrssions in which a parser is codes. The expressions are named recursive functions. If we consider a rule to be a named parsing function in a metalanguage program then it makes no sense to call itself when the input has not been advanced. What would happen in a c++ function if the first thing it does is call itself. For an example look at the CWIC example in metacompiler. Steamerandy (talk) 18:57, 24 November 2015 (UTC)[reply]

I think the statement that left recursion is not supported by packrat parsers should be removed, as a paper *mentioned in the article* provides a way to parse left recursive PEGs. --Tetha (talk) 12:06, 10 December 2008 (UTC) Tetha[reply]

The definition of a peg doesn't mention grouping of expressions (e), but the example uses it. Which is right?

I suspect the answer is the definition needs to be extended to match the example.

—Preceding unsigned comment added by 202.37.96.11 (talk) 22:47, 24 October 2007 (UTC)[reply]

The present definition defines only the syntax, except that there is a remark on how rewriting differs from rewriting for CFGs. An exact definition of how PEG rules are used should be supplied. Rp (talk) 15:54, 29 January 2009 (UTC)[reply]

The comment from 202.37.96.11 is correct, and the response to it by Rp is a non sequitur. How odd that such a basic mistake in the definition has lasted so long. Also, the a^n b^n c^n example is silly and overly complex: S ← &A a+ B is sufficient (this matches a prefix, not the whole input, but the a^n B^n example does the same, and the !(a/b/c) is ridiculous -- any symbol other than a, b, or c could occur). -- 98.108.225.240 (talk) 03:15, 6 February 2011 (UTC)[reply]

Hard to see PEGs as new[edit]

Note that ordered choice interpretation of arbitrary context-free grammars is not new. One example of this is the TXL programming language (http://www.txl.ca/), which has existed since the mid 80s. Add that regular expressions can be de-sugared to context free grammars and it is hard to see PEGs as a new kind of formalism. —Preceding unsigned comment added by Thurston51 (talk • contribs) 23:40, 27 November 2007 (UTC)[reply]

PEGs are not the same as CFGs + choice interpretation - they specify an analytic grammar rather than a generative grammar. As for being a 'new formalism', the original paper states that they are basically a refinement of TS/TDPL and gTS/GTDPL from the '70s. They are, however, a good tool for specifying parsers. Mrdardie (talk) 04:35, 1 October 2008 (UTC)[reply]

Analytical grammars don't appear to be a particular formalism or even a particular subset of the context-free grammars, they just seem to be context-free grammars being used in a particular way, and even that isn't completely clear. The article on it appears to confuse the semantics of grammars with parsing algorithms for grammars. The same is happening in this article on PEGs. E.g. it isn't a property of regular expressions that "regular expressions behave by backtracking", that is just a particular way of parsing. The semantics of PEGs are described by sketching some elements of a possible implementation (e.g. "each nonterminal represents a parsing function"). If the semantics are described in the standard way, by defining the language described by a nonterminal in terms of the languages described by the elements in its parsing expression, it becomes easier to compare them with CFGs. For CFGs, regular expressions on the right hand side of a rule are just syntactic sugar. As far as I can see, if predicates are restricted to fixed strings, PEGs are just a notational variant of LL(k) grammars, and most of what is said about them here is traditionally phrased in terms of top-down parsing for CFGs. So the real difference is that predicates may contain nonterminals; I don't know if there is any theory on the consequences. Rp (talk) 11:47, 3 April 2009 (UTC)[reply]

A grammar is analytical if, given the grammar and a string, you can decide whether that string is in the language recognised by that grammar — a grammar could be strictly context-sensitive (i.e., not context-free) and still analytical. A grammar is generative if it allows you to effectively enumerate the strings in the language described by the grammar. For parsing, you're mainly interested in grammars being analytical. CFGs are trivially generative, but proving them analytical is more work. Conversely PEGs are trivially analytical, but not generative: it is algorithmically undecidable whether the language of a general PEG is nonempty (because you can encode undecidable decision problems as collections of PEG rules — see Ford's paper). 88.129.117.158 (talk) 12:53, 25 December 2023 (UTC)[reply]

PEGS vs CFGs[edit]

How do PEGs compare to CFGs in expressive power? Rp (talk) 10:42, 30 November 2007 (UTC)[reply]

The author states in the original paper that he believes PEGs are less expressive than CFGs because PEGs can be implemented in linear time by a memoizing compiler, but that 'a proof has been surprisingly elusive'. Not sure if the problem has been addressed since. Mrdardie (talk) 04:43, 1 October 2008 (UTC)[reply]

This is wrong. As the wikipedia-article here demonstrates, PEGs are able to describe certain context sensitive languages, so CFG cannot be a superset of PEG. However, I am still wondering if PEG are more powerful than CFG or of PEG and CFG cannot be compared using subset relations --Tetha (talk) 15:39, 11 December 2008 (UTC)[reply]

Imprecise wording of an example[edit]

I think, the following is imprecise:

The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b.

The "a" in "a b" will not be matched, though the "a" is not "the first in an arbitrarily-long sequence of a's followed by a b", because the " arbitrarily-long sequence of a's" is empty in this case, so it does not have the first element. I'd propose to change it to "the first in a non-empty sequence of a's followed by b". I won't do the change myself, because I'm a newbie. --Mikon (talk) 09:30, 26 January 2009 (UTC)[reply]

In general, feel free to change things even while being a newbie... but your explanation doesn't make sense to me. The sequence of a's in "a b" isn't empty: it has one a. You can't have a first a in an empty sequence. rspεεr (talk) 11:00, 26 January 2009 (UTC)[reply]

Number of alternatives in PEG rules versus CFGs[edit]

I read this months ago and it was bad enough I had to correct it:

Unlike in a context-free grammar or other generative grammars, in a parsing expression grammar there must be exactly one rule in the grammar having a given nonterminal on its left-hand side. [...]

Now, somebody felt it prudent to put this text back in. This is wrong. This difference is a purely syntactic issue that offers no meaningful insight into the differences between PEGs and CFGs. It's just that instead of writing alternate possibilities as, say

A -> ()
A -> (A)
A -> A A

You write this grammar like this instead:

A -> () | (A) | A A

To further illustrate my argument, I would point to the first page of Shriram Krishnamurthi's book, Programming Languages: Application and Interpretation, which is electronically available under the Creative Commons, and I quote as follows:

The first insignificant attribute is the syntax. Syntaxes are highly sensitive topics, but in the end, they don’t tell us very much about a program’s behavior. For instance, consider the following four fragments:
   1. a [25]
   2. (vector-ref a 25)
   3. a [25]
   4. a [25]
Which two are most alike? The first and second, obviously! Why? Because the first is in Java and the second is in Scheme, both of which signal an error if the vector associated with a has fewer than 25 entries; the third, in C, blithely ignores the vector’s size, leading to unspecified behavior, even though its syntax is exactly the same as that of the Java code. The fourth, in ML or Haskell, is an application of a to the list containing just one element, 25: that is, it’s not an array dereference at all, it’s a function appliction!

http://www.cs.brown.edu/~sk/Publications/Books/ProgLangs/2007-04-26/

Obscuranym (talk) 04:01, 21 April 2009 (UTC)[reply]

The problem is that two distinct rules A -> B, A -> C are not the same as the ordered choice A -> B/C. The following grammer is ambiguous:

 S -> A
 S -> a
 A -> a

It is, hence, an important part of the definition that only one rule per non-terminal exists. I would, however, say that there is at most one rule per non-terminal, or?

Mattias Ulbrich 141.3.27.22 (talk) 11:00, 26 February 2010 (UTC)[reply]

I disagree. The difference is that the alternatives in a PEG are ordered. They don't form a single rule any more or less than the alternatives in a CFG do. Rp (talk) 22:41, 26 February 2010 (UTC)[reply]

Yes, CFGs and PEGs are semantically distinct; in fact I added the notion of "ordered choice" to the article. However, your argument is based solely on notation (i.e. syntax). Rp is exactly right: there is a trivial and natural isomorphism between the two notations, but not the meanings.

To clarify the relationship to the quote from PLAI, you are, in effect, arguing that the reason Java fragment in (1) and the Scheme fragment in (2) differ is that they look different. Although PEGs and CFGs are very different beasts, it's not because they look different: after all, the Java and Scheme fragments look different, even though then denote almost the same thing. Obscuranym (talk) 04:09, 18 March 2010 (UTC)[reply]

Mattias is right, the other two are wrong. In a PEG, A ← x / y is defined as attempting x and, if it fails, backing up and then attempting y. No such rule exists for A ← x ; A ← y ... of course, one could implement a PEG system that treats the latter as if it were the former, but this article is about the definition of a PEG as it exists in the literature, which some editors seem unfamiliar with and uncaring about. -- 98.108.225.240 (talk) 03:28, 6 February 2011 (UTC)[reply]

You don't appear to understand our objection, which is that Matthias and you confuse two entirely different things:

whether to regard A ← x / y (for PEGs) or A ← x | y (for CFGs) as one or two rules; this is purely a metasyntactic detail, irrelevant for most purposes, and it has nothing to do with what is actually relevant, namely,
the difference in interpretation (ordered with no backtracking for PEGs vs. nondeterministic choice for CFGs). Rp (talk) 21:43, 8 February 2011 (UTC)[reply]

It is not obvious, but ordered choice is very different from unordered choice when you get into the theoretical details. Best way to understand this is probably to work through Section 4 of Ford's POPL'04 paper: how to eliminate lookahead constraints. You can rewrite any PEG with lookahead constraints into an equivalent (but larger) PEG not using lookahead constraints, by exploiting the ordered aspect of / in sneaky ways. 88.129.117.158 (talk) 13:15, 25 December 2023 (UTC)[reply]

Expressivity Example Incorrect?[edit]

The example given for a language that cannot be recognized by PEGs is: S ← 'x' S 'x' | 'x'. Is this not recognizable through the following PEG? S ← 'x' ('x' 'x')* —Preceding unsigned comment added by 69.251.13.250 (talk) 11:21, 29 December 2009 (UTC)[reply]

True, it recognizes the same language, but generates different parse trees -- you're exploiting the fact that the language has a particularly simple definition "odd number of x'es". There are more complicated (and hence harder to understand in an encyclopedia entry) examples where PEGs not only cannot produce the same parse tree but cannot even recognize the same language. AshtonBenson (talk) 22:16, 2 January 2010 (UTC)[reply]

I think one such esample would be S<- 'x' S 'x' | 'x' | 'q'. Haven't thought too hard about it though. AshtonBenson (talk) 22:17, 2 January 2010 (UTC)[reply]

I think S ← 'q' / 'x' S 'x' / 'x' ('x' 'x')* recognizes the same language. — Preceding unsigned comment added by 190.31.46.132 (talk) 15:16, 4 September 2011 (UTC)[reply]

It can be written as:

  S ← 'x' ( S 'x' | ε)

Steamerandy (talk) 13:13, 30 October 2015 (UTC)[reply]

Original Paper Citation[edit]

I think this article is remiss in not citing the original paper that introduced PEGs. The full citation is Bryan Ford, "Parsing Expression Grammars: A Recognition Based Syntactic Foundation", Proceedings of the 31st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2004. The online home is: http://portal.acm.org/citation.cfm?id=964001.964011

Since this recent paper introduced the concept of PEGs, I think both that it should be cited and referenced as the source of the idea, but the discussion of the concept itself should probably be derived from this work. 128.173.236.168 (talk) 15:43, 17 March 2010 (UTC)[reply]

I agree. I have added the reference. Rp (talk) 17:03, 30 July 2010 (UTC)[reply]

PEG vs. RegExp[edit]

"Compared to regular expressions, PEGs are strictly more powerful" - Unlike PEG, regexes have back-references, so how PEG can be "strictly more powerful"? Enerjazzer (talk) 12:58, 7 July 2010 (UTC)[reply]

Regular expressions don't have back references. Sure, Perl added back references and still call them regular expressions, but the article uses the term in the more generally accepted sense. Obscuranym (talk) 12:59, 9 July 2010 (UTC)[reply]

The article on regexes should make this clear. Back references in regular expression patterns (in actual tools) predate Perl, they are an obvious idea given their use in substitution replacements. Rp (talk) 20:47, 9 July 2010 (UTC)[reply]

OK, changed the text to make this clear. Enerjazzer (talk) 12:50, 10 July 2010 (UTC)[reply]

"Thus ordered choice is not commutative, unlike unordered choice as in context-free grammars and regular expressions." I deleted "regular expressions" from that sentences because the choice operator | in regexes performs ordered choice.[1] --88.73.20.200 (talk) 02:04, 23 March 2012 (UTC)[reply]

You've fundamentally confused the unordered nature of choice as alternation in the formal definition of regular expressions, with the implementation semantics of extended regular expressions in a specific programming language. Nevertheless, that section of the PEG article is better off without reference to unordered choice in regular expressions because CFGs are the primary point of comparison. AHMartin (talk) 16:51, 31 January 2017 (UTC)[reply]

LL(1)?[edit]

Isn’t this just non-strict LL(1)? It works like LL(1), but conflicts get ignored, we say “first rule has precedence”. Otherwise I would like to have an example for great stuff Packrat can parse. --Chricho (talk) 12:48, 11 September 2010 (UTC)[reply]

I haven't taken the time to study the exact relationship between PEGs and the various classes of grammars, but it'd be something more akin to LL(∞). Obscuranym (talk) 21:36, 13 September 2010 (UTC)[reply]

The ANDing of alternatives is different. Rp (talk) 12:43, 14 September 2010 (UTC)[reply]

Okay, & and ! are not possible to simulate with LL(k) or any CFL-parser. However, the actual difference between a parser calling itself Packrat-parser and a parser calling itself LL(1) is very small, the LL(1) parsers often provide a roll-back-feature (LL(*), not LL(∞)) which is more than enough for real-life programming languages and not slower than Packrat’s & and !. Packrat certainly does not run in O(n), it depends on the grammar, choose this poor, pathologic example: S ← a (&S) S b | b, it will run in O(2^n), that is not a real-life-example, but in real life even LL(*)-backtracking does not have O(2^n). --Chricho (talk) 00:00, 15 September 2010 (UTC)[reply]

Sorry, I probably missed something, but why shouldn't a packrat parser run in linear time on "S ← a (&S) S b | b"? Because the (&S) is ultimately superfluous, if it accepts, the parser will move on to S, which will reuse the result of (&S) due to lazyness. If it does not accept, neither would an S alone. 188.29.165.128 (talk) 07:28, 22 October 2014 (UTC)[reply]

Example for CFL\L(PEG)[edit]

I think the anti-LL(1) example works for Packrat, too:

 S ← A|B
 A ← a A b | ε
 B ← a B b b | ε

or did I get anything wrong with the & operator and it is possible to express this language with a PEG? --Chricho (talk) 00:11, 15 September 2010 (UTC)[reply]

I wish someone would answer this. Rp (talk) 21:51, 8 February 2011 (UTC)[reply]

I think (!Ab)A / B works with your definitions of A and B. --Avakar452 (talk) 18:15, 9 October 2016 (UTC)[reply]

A question[edit]

Does:

 S ← &(a b) a b | &(a c) a c

accept the input ac?

 S ← a b | a c

would certainly not accept it. Or wait: “The choice operator e1 / e2 first invokes e1, and if e1 succeeds, returns its result immediately. Otherwise, if e1 fails, then the choice operator backtracks to the original input position at which it invoked e1, but then calls e2 instead, returning e2's result.” – that cannot be true. I think it makes the decision based on the first-set, it will not backtrack. --Chricho (talk) 00:17, 15 September 2010 (UTC)[reply]

Instead of talking about what you think, refer to the literature. -- 98.108.225.240 (talk) 03:33, 6 February 2011 (UTC)[reply]

use of 'memorization'[edit]

The linked topic doesn't have anything relevant. Either that's an invented term, inapt (caching comes to mind as an alternative), or the linked topic needs some sort of disambiguation to support the link TEDickey (talk) 11:36, 23 November 2010 (UTC)[reply]

The correct term is memoization, as in, creating memos. Whoever made this change probably thought they were correcting a typo. Msnicki (talk) 17:33, 23 November 2010 (UTC)[reply]

Is PEG Implemented in Any Programming Language?[edit]

I couldn't find one, but there are mentions of real parsers in the article. Would anyone post a link to a program / library where they are used? Thanks 109.64.130.34 (talk) 17:18, 25 August 2013 (UTC)[reply]

You mean something like this? http://www.inf.puc-rio.br/~roberto/lpeg/ Hans Adler 19:20, 25 August 2013 (UTC)[reply]

Yup, it would be good to add it. Not sure in what way, but the motivation for it is that PEG is, well, at least some times, mentioned in the context of programming as an alternative to regular expressions (I happen to believe that it would be a great alternative). Being able to try it "live" (I don't know Lua, but it's an easy to pick up language with interactive interpreter) should help by providing as many examples as one wishes. 109.64.130.34 (talk) 23:18, 25 August 2013 (UTC)[reply]

I only took a glance, but I think that library has a misleading name: it doesn't support nonterminals, so it cannot express certain uses of recursion in PEGs. What remains is in essence assertions, which Perl regular expressions have been providing for ages (albeit with limitations on the assertion expression).

Now what to do about it? Assertions in regular expressions deserve mention in Wikipedia, but do they deserve their own article? Neither Regular expression nor Assertion (computing) mentions them, and they are indeed a big step towards PEGs, so they can be described here - as long as they are not described as being the same thing as PEGs. Rp (talk) 10:32, 26 August 2013 (UTC)[reply]

I've found this: https://github.com/nikodemus/esrap through further searching. I know Common Lisp enough to tell that is does implement all that is mentioned in the article, but it does also things on top of that, like, for example, invoking arbitrary Lisp function on parsed text. So, I don't know. I'd also imagine there must be some code in Haskell that uses the parser from a paper linked here to create some usable tools, but I don't know Haskell well enough to tell. 109.64.130.34 (talk) 17:18, 26 August 2013 (UTC)[reply]

This one looks more strict implementation: http://common-lisp.net/project/cl-peg/ 109.65.104.130 (talk) 18:51, 11 September 2013 (UTC)[reply]

I think I finally found a good example (no or very basic programming skill is required, available for use online with the requirement of JavaScript being enabled) http://pegjs.majda.cz/ (this is a JavaScript library that implements PEG, it also has a "try online" feature). 109.65.104.130 (talk) 22:46, 29 September 2013 (UTC)[reply]

These maybe? Comparison of parser generators (see the section on Parsing Expression Grammars) Bruce IV (talk) 15:56, 5 October 2013 (UTC)[reply]

Yes, D has a library for generating PEG parsers called Pegged. It is very powerful and supports all kinds of left-recursion without rewriting rules. It even supports an optional longest match alternator operator (breaking with PEG principles) thereby increasing its applicability to a wider set of languages. Bastiaan Veelo (talk) 19:05, 19 April 2019 (UTC)[reply]

CPython switched to using PEG so the interpreter not only implements PEG but also Python is now a prime example of a mainstream programming language using PEG. See PEP 617 – New PEG parser for CPython. Wikipedia PEG article only mentioned Python in "See also", linking to Wikipedia Python article which does not mention PEG at all so this "see also" made sense but a really well hidden one. I tried to fix this.--Vaclav.hanzl (talk) 09:46, 16 January 2023 (UTC)[reply]

Disadvantages[edit]

The statement under the heading "Subtle grammatical errors" is pushing the definition of "disadvantage" a little. It's a pretty ridiculous leap of logic to say that PEGs are flawed because it takes effort to translate ambiguities from a lesser formalism. Those ambiguities are a flaw in the original grammar, not in the PEG translation. In fact, I'm going to remove this silly and bizarre statement, since it's completely flawed reasoning and adds nothing insightful to the article anyway. — Preceding unsigned comment added by 82.9.176.129 (talk) 14:46, 24 September 2013 (UTC)[reply]

Agreed. However, I think it's as much of an error to assume that ambiguities in grammars are always a mistake than to assume the opposite. Rp (talk) 06:55, 25 September 2013 (UTC)[reply]

The claim that the running time of the Bellman-Ford algorithm of $\Theta (|V|*|E|)$ is quadratic (in $|V|$ ) is true only if the graph is sparse (i.e. $|E|=O(|V|)$ ). A dense graph in which $|E|=\Omega (|V|^{2})$ would yield the same asymptotic running time as the Floyd-Warshall algorithm (which is cubic, not quadratic). Newtsjm (talk) 06:39, 22 August 2015 (UTC)[reply]

Not in

|V|

, in “the size of the input”, i.e.

|E|+|V|

. At least that's how I understand the sentence you are referring to. Mm (talk) 10:52, 14 October 2016 (UTC)[reply]

Question on a specific parser[edit]

Here is the deal. This isn't a parser but a programming language for writing compilers that generates code directly as written. The current method is that one uses a parser generator that usually generates a table for an already written parser. But being a programming language the programmer is writing the parser in a grammar specifically designed for the purpose. This was published in an ACM paper in 1970. It is close to a GTDPL (Generalized Top Down Parsing Language). Probably in it's own category an ETDPL (Extended Top Down Parsing Language). Extended because it explicitly control backtracking and may back track any number of rules. It also provides general look ahead features.

After looking over the PEG topic here I was thinking it to be a PEG. Because CWIC^[1] was developed and an ACM paper published in 1970 doesn't mean that it was ever classified as other then a metacompiler. Most of the parser classification (types) were not named at that time. So what if PEG is not a new idea? But in a way I an not so sure that CWIC is a PEG. These meta compilers were said to have recursive functions. One of the problems I have with categorizing some parsers as recursive decent. Is that there may not be any recursion. It may be that I am just old school but in my day recursion is a function calling it's self. Such as the recursive definition of a factorial. To me allowing recursion and using recursion are not the same thing. For example using the grammar we define an arithmetic expression as:

expr = term $(('+'|'-') term);  // This rule looks like EBNF. I know. But in CWIC it would not produce any output of any kind except success or failure.

That specific rule is not recursive. At some point recursion would be used for a parenthesized expression '(' expr ')'. In this language that rule does not produce output or actions. If to the above rule we add control for building lists and tree structures, recursion can be used to control the form of a tree.

expr = term $(('+':ADD|'-':SUB) term!2);    // builds a binary tree of the form ADD[<something>,<something>] or SUB[<something>,<something>]

Given the expression a+b-c we get first a tree of the form ADD[a, b] and loop to make a SUB[ADD[a,b],c] A left handed tree. using looping. Redefining expr in a recursion form:

expr = term (('+':ADD|'-':SUB) expr!2|.EMPTY);

We get a right handed tree ADD[a,SUB[b,c]] because the subtraction tree was built by calling expr recursively and then the ADD tree is built upon return.

Again if we redefine expr to make a list and only mark substraction:

expr = +[ term $('+' term | '-' term:MNS!1) ]+;

We get a list: [a, b, MNS[c]]

In this parser recursion is controlled by the programmer. Recursion is necessary for nested language constructs such as parenthesized expression (logical or arithmetic), nested block structures etc. The language is compiled into executable code. Each rule is a function returning success or failure. It specifically controls backtracking using a different alternate operator. limited backtracking is automatic when parsing strings and tokens. Only the input stream is backtracked. Controlled backtracking resets the state, undoing trees, nodes and lists. And may be nested. Controlled backtracking may bail out of any number of nested rules.

These examples are based on a compiler compiler that was working in 1970. It is a top down analytic parser no question. But is it really recursive decent? The example below is from an updated version defined to be more like c or c++.

Below 'program' is top driver rule. A program is zero or more declarations.

 program =  $ declarations;

 declarations = '#' directive
		| comment
		| global         DECLAR(*1)                 
		|(id (grammar    PARSER(*1)
		    | sequencer  GENERATOR(*1)
		    | optimizer  ISO(*1)
		    | pseudo_op  PRODUCTION(*1) 
		    | emitter_op MACHOP(*1)))
		|| ERRORX("!Grammer error") garbol);

 grammar =      ( ":"  class	:CLASS   // parse character class rule
		| ".." token	:TOKEN   // parse token rule
		| "==" syntax	:BCKTRAK // parse a ankered grammar rule
		| "="  syntax	:GRAMMAR // this grammar can long fail.
		)";"!2			 // Combine name and rule tree
			$(-break -"/*" .any);

A declarations can be a directive which is recognized by the '#' character. Or a declarations could simply be a comment.

comment      =  "//" $(-.break .any) | "/*" $(-"*/" .any) "*/");  // Defines a c or c++ stile comment such as this comment.

Or declarations could be a global declaration. Getting to declarations that start with an identifier. An id may be followed by a grammar, sequencer, optimizer, pseudo_op, or an emitter_op. It is the grammar I am interested on discerning the classification of here. The double bar alternate is a backtrack alternative that catches an error, Calls ERRORX and garbol. garbol simply looks for a ; and returns when matched. This is a case were backtracking can bail any number of stacked rules. A garbol rule recovers from an error and skips to the end of a rule:

garbol = $ (-';' .any) ';';   // zero or more, match not ';' loop not matching ';' (skipping .any characters not a ';') and then match ';'. skips to end of rule in error.

Below are some character class and token making rule examples:

bin:         '0'|'1';                                   // Binary character class can be a 0 or 1.
oct:         bin | '2' | '3' | '4' | '5' | '6' | '7';   // the oct class references the bin class
dgt:         oct | '8' | '9';                           // previous defined classes can be referenced.
hex:         dgt | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'    // '<character>' is a single character string.
                 | 'a' | 'b' | 'c' | 'd' | 'e' | 'f';

chr_const  .. "''''" , "'" | ''' any ''' MAKSTR();        // match a single character string.
str_const  .. '"' $(~'"' | """""" ,'"') '"'  MAKSTR();    // matches a string of characters. Double "" are matched, single " is placed in string

The class defining rule is:

class       =  +[ character $('|' character) ]+ ';';      // id ':' class :CLASS!2 PARSER(*1); CLASS{<id>,[<character list>]]
character   =  $comment (chr_const | integer | id classcheck()) $comment;

MAKESTR[] is a function that intercedes the making of a symbol and creates a string object. The sequence "id classcheck()" checks that id is defined as a class. It can not be the current class as it's id is not yet defined as a class.

The '..' token defining rules and ':' character classes are definitely not recursive. TOKEN rules could only use string constants and character classes. Character classes are not functions. In the implementation character classes are used in generating a class map table that was indexed by the character code. Bits at the indexed location indicated the characters classes memberships. Optimizing testing class membership with a single instruction on many computers.

token        =   chr_pattern (maker:INTERCEPT!2 ('|' token:ALT!2 |--) |--);
chr_pattern  =  +[basic $ basic]+;
basic       == '$' basic :$!1 |'(' tokenxpr ')' | str_const | char_match | insert; 
tokenxpr     =  chr_pattern ('|' (+"--" | tokenxpr) :ALT!2 | -- );
char_match   =  ('+':RETAIN|'-':NEG|'~':NOT) (str_const | chr_const)!1 | char_test;
char_test   ==  +"any" | +"ANY" | (chr_const | id) ('*' :$!1 | --);
insert       =  ',':INSERT (chr_const | str_const)!1
maker       ==  procid '(' +[ (m_arg $(',' m_arg)) |-- ]+ ')' :CALL!2;
m_arg        =  str_const | integer | character | id;

The above is thesyntax of the token defining rule.

Character class rules and token rules can not be recursive. Character classes can reference only preexisting character classes.

Token rules can only reference character classes and match stings. and call interceding functions as the last thing of an outermost alternate change. The integer rule of the compiler compiler is an example that parses an integer.

bin:         '0'|'1';
oct:         bin | '2' | '3' | '4' | '5' | '6' | '7';
dgt:         oct | '8' | '9';
hex:         dgt | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
                 | 'a' | 'b' | 'c' | 'd' | 'e' | 'f';
integer    ..  ("0b"|"0B")           bin bin* MAKEBIN()  // binary integer
              |("0o"|"0O")           oct oct* MAKEOCT()  // octal integer
              |("0x"|"0X"|"0h"|"0H") hex hex* MAKEHEX()  // hex integer
              |                      dgt dgt* MAKEINT(); // decmal integer

I am looking for opinions for classifying the beast. It could produce textual or binary machine code as output. The list or tree constructs were passed by calling function from the parser rules. A complete compiler could be written in this language. Except for run-time libraries for programs produced by compilers written in this compiler compiler. The examples are a new syntax that more resembles current languages. CWIC used / and \ for alternates. It's character set was limited to punch card codes available. The token rules are extended to allow multiple forms of the same type token. Integer in different bases as above. CWIC would have used a separate rule for each base and an integer rule

There is an example of a simple interpreter written in CWIC over on the metacompiler talk page.

Hope this helps.

But even if this is ultimately decided to be a PEG language it shouldn't nullify Ford's work. Most of the compiler terminology of today was not invented when CWIC was developed.

--Steamerandy (talk) 09:59, 27 October 2014 (UTC)[reply]

References

^ Book, Erwin; Dewey Val Schorre; Steven J. Sherman (June 1970). "The CWIC/36O system, a compiler for writing and implementing compilers". ACM SIGPLAN Notices. 5 (6): 11–29. doi:10.1145/954344.954345.

I removed the claim that LR parsers require tokenization.[edit]

The claim has been flagged with [citation needed] since 2011, and is false. Many tools for generating LR parsers expect the input to be output from a tokenizer, but that isn't a requirement. It's just a matter of defining the alphabet of terminals to be bytes or Unicode codepoints or whatever instead of tokens. ~ LukeShu (talk) 03:57, 13 November 2015 (UTC)[reply]

I have put a modified version of this claim back in, because it is a valuable point of contrast between PEGs and CFGs, at least as far as common practice for their use is concerned. This modified version explains how a CFG can be character-level, but also points out why that is cumbersome: in a context-free formalism, the decision regarding whether to require whitespace between two tokens has to be taken at the level of the last nonterminal to encompass both! General CFG parsers can certainly do that, but it is not at all clear to me that an LR or LL parser with bounded lookahead can manage the same. 88.129.117.158 (talk) 13:17, 27 December 2023 (UTC)[reply]

Hyperlink for Packrat parser redirects to same section[edit]

Under "Implementing parsers from parsing expression grammars", it says "It is possible to obtain better performance for any parsing expression grammar by converting its recursive descent parser into a packrat parser". "packrat parser". redirects to that same section. If someone wants to fix that somehow, idk. — Preceding unsigned comment added by 70.131.44.130 (talk) 02:07, 7 March 2019 (UTC)[reply]

Bringing reverted citations back in[edit]

I am going to argue for reviving the citations that were removed in revision 812879723, see https://en.wikipedia.org/w/index.php?title=Parsing_expression_grammar&type=revision&diff=812879723&oldid=812750522. The reason for removal by User:Melcous was legitimate (linking to own website) but the article has since been published (https://doi.org/10.1145/2814251.2814265) and is available from an independent open access provider (https://arxiv.org/pdf/1509.02439v1.pdf). Both Medeiros' and Laurent's contributions improve on the existing citation of Warth et al., and both were instrumental for me in adding left-recursion support to Pegged. I am therefore considering to bring both citations back in, and maybe add a sentence about their individual contributions. Please comment if there are objections. Bastiaan Veelo (talk) 22:12, 19 April 2019 (UTC)[reply]

I have no objections to the published article being added by an independent editor, particularly now it has been published. Melcous (talk) 23:54, 19 April 2019 (UTC)[reply]

Incorrect grammar in "Indirect left recursion" section[edit]

The grammar given in the "Indirect left recursion" section is incorrect. It would parse "1+2*3+4" as ((1+2)*3)+4.

Boolean context-free grammars[edit]

Since this edit, the article has likended the lookahead constraints ! and & of PEGs to those of boolean grammars. That's a weird point of reference, considering that boolean grammars (i) appears to be a more recent development (main publication 2010, as compared to 2004 for PEG) and (ii) are more complicated to interpret (require solving a system of functional equations). If one wishes to compare the PEG lookaheads to anything, it would make more sense to reference those of advanced regular expressions — (?=…) and (?!…) as described for example in [2] — but of course any comparison with regular expressions opens up the Pandora's box of readers getting confused by not-at-all regular features available in major "regular expression" engines. Therefore I'll just undo that edit, moving the link instead to the "See also" section.

To elaborate on the regular expression option: that comparison would be apt, in that adding or removing lookaheads as features will not change the underlying class of languages — regular languages and parsing expression languages are still the same. On the PEG side Ford shows this in his paper, by the transformation to eliminate predicates. On the regular side it's probably easiest to do in terms of automata: take the cartesian product of the automaton for the lookahead expression and the automaton for the main expression (minus the lookahead), then modify set of accepting states so both have to accept (there are some additional fine details to deal with, which depend on whether you do it in terms of DFAs or NFAs, but mainly you runs multiple automata in parallel).

Moreover, I'm somewhat skeptical towards the original claim. The lookaheads in regular expressions and PEGs are notably not constrained to a particular subexpression — that is essential for how they are used in the grammar for $a^{n}b^{n}c^{n}$ — but boolean grammars rather seem like all patterns involved in the definition of a nonterminal are matched against the same range of input (though that article is still pretty much a stub, so I could be getting it wrong). Thus even if both kinds of grammar might have "lookahead constraints" that can be eliminated without changing the class of languages, it is likely that those constraints do not behave in the same way. 88.129.117.158 (talk) 15:48, 25 December 2023 (UTC)[reply]

Abstract versus concrete syntax[edit]

The start of the article introduces PEGs using abstract syntax, e.g. $e_{1}/e_{2}$ for ordered choice, but later the exposition switches to concrete syntax, e.g. E1 / E2 and 'x' E1 , without comment or explanation. That is likely confusing if you're not already familiar with these grammars, and should be clarified. 88.129.117.158 (talk) 14:50, 28 December 2023 (UTC)[reply]

Eliminating left-recursion[edit]

The article gives Aho–Sethi–Ullman (1986) as one source for the claim that left recursion can be eliminated. It seems dubious that the techniques in this book would suffice for PEGs; the CFG technique of repeatedly expanding an initial nonterminal works because if the initial item is not a nonterminal then it is a terminal and thus not zero length, but in a PEG there are zero-length items which don't just disappear. Consider

  S <- !(.'x') S / 'wx'

88.129.117.158 (talk) 12:30, 5 January 2024 (UTC)[reply]

Confirming that the techniques in Aho–Sethi–Ullman — relevant pages are 176–178, plus some exercises — are specifically for context-free grammars (and even if that could be lifted, for generative grammars). They rely heavily on order of choices being irrelevant. 37.197.59.228 (talk) 11:30, 13 January 2024 (UTC)[reply]

[CWIC-1] Book, Erwin; Dewey Val Schorre; Steven J. Sherman (June 1970). "The CWIC/36O system, a compiler for writing and implementing compilers". ACM SIGPLAN Notices. 5 (6): 11–29. doi:10.1145/954344.954345.

[1]