Talk:LR parser

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer science (Rated High-importance)
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 ???  This article has not yet received a rating on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.
 

Contents

[edit] Article length

It's a bit long now. Perhaps there should be a separate article about LR(0) parsers that deals with the LR(0) algorithm. -- Jan Hidders 10:01 Aug 18, 2002 (PDT)

[edit] This article is useless for a beginner

Two key notions used in the article "reduction" and "derivation" are not explained. Without these notions the article is useless for a beginner. —Preceding unsigned comment added by 82.240.122.196 (talk) 19:08, 25 April 2009 (UTC)

[edit] Parse position indicator

This article is one of the better explanations of LR parsing I have seen. My only complaint is the dual use of the "*" character as (1) the current position in a in the LR(0) item, and (2) as the multiplication operator. That leads to items like "E * * B" (an actual example from the text) which is confusing. Perhaps we could just use a period? --Doradus 22:35, 23 Aug 2003 (UTC)

(Scratch that; apparently Konqueror renders · as an asterisk. --Doradus 23:16, 23 Aug 2003 (UTC))

[edit] SLR and LALR overlap

Looks like the "note on SLR and LALR" overlaps a bit with the section on conflicts. Perhaps they could be combined. --Doradus 01:38, 26 Oct 2003 (UTC)

[edit] E and B nonterminals

It's a really minor thing, but could we maybe change the grammars which use nonterminals "E" and "B" to use some other letters? They often look awfully similar on my screen. Marnanel 05:53, 4 Feb 2004 (UTC)

[edit] Programming Languages not LR

I'm looking for a reference for the claim that Perl and C cannot be parsed by an LR parser. Does somebody know one? -- Jan Hidders 14:12, 21 Jul 2004 (UTC)

The one thing that I know of is the dangling else problem. Because of this problem, not only can C not be parsed by a (pure) LR parser, it is not even an unambiguous grammar. I'm sure a Google search for dangling else will produce a large number of references, and if you're lucky, some of them may even have references with regards to other reasons C cannot be parsed by an LR parser (if there are indeed other reasons; I would imagine there are, but I don't know). CyborgTosser (Only half the battle) 07:51, 17 Apr 2005 (UTC)
It's possible to create unambiguous grammar for languages with if-else construction, but requires introducing a new nonterminal symbol: all_instructions_but_if_with_else. However such grammar becomes unnatural and less readible and that's why most parsers use another ambiguousness handling techniques, like associativity declarations.
Typedefs are another issue with C. They make the language context-dependent. It's discussed a little bit here [1], but I think there's a better reference somewhere. --Werdnagreb 19:06, 5 May 2005 (UTC)
The article now refers to Perl and C++. C++ is probably a better example than C, as I know C parsers get around both of the points noted above with fairly inocuous kludges, whereas C++ is trickier. It might be worth mentioning that early versions of Fortran were even bigger offenders, although the problems go beyond difficulty of parsing. CyborgTosser (Only half the battle) 01:07, 8 May 2005 (UTC)
Perl is worse, says Perl Cannot Be Parsed: A Formal Proof, but I haven't read it. Rp (talk) 09:50, 10 February 2009 (UTC)

[edit] Minor improvements to the article

This is, indeed, a great explanation of the LR(0) algorithm. Nevertheless, I'd like to suggest a few changes that may make it better:

a) Change the term "begin state" to "start state", as it seems to be the most correct form to reference the, err, start state of a grammar.

b) the section "Conflicts in the constructed tables" needs some editing, as it seems. For example, it begins with the phrase "The automaton that is constructed is by the way it is constructed always a deterministic automaton", which, although correct, sounds confusing (at least to me). A better wording might be: "The steps presented so far always construct a deterministic automaton." or "The automaton contructed by looking for transitions between item sets will always be a deterministic one."

c) Besides (still in the section referred above), we find several references to the grammar not being LL(0). Because we're discussing LR(0), it looks like a typo. It would be better if an explanation of how the non-LL(0)-ness of the grammar afftects the LR(0) parser was given.

d) Finally -- and maybe this is just in my system, I guess that using • instead of · to represent the dot would make it more visble.

Great article. --Branco 13:39, 23 October 2005 (UTC)

Go for it. --Doradus 22:57, 13 July 2007 (UTC)

[edit] Discussion of LR(1) Construction Needed

Although LR(0) item set construction is important bacause it is the basis for the SLR(1) parser construction algorithm as well as most LALR(1) construction algorithms, it should be pointed out that very few parsers are based on LR(0) itself. Indeed, LALR(1) parsers such as those produced by Yacc and GNU bison seem to be the most commonly used LR(k)-based parsers. Discussing LR(1) item set constuction would not only describe the LR(k) methodology when lookahead is a component of an item (as it always is except for the trivial case of k = 0), it would provide the basis for understanding the LALR(1) construction method and why an LR(1) grammar may not be LALR(1) or why an LALR(1) grammar may not be SLR(1).

It might also be pointed out that although a grammar might be LR(k) for some k > 1, there will always be LR(1), LALR(1), and SLR(1) grammars for the same language. --Raaronson 16:58, 27 December 2005 (UTC)

[edit] Discussion of construction of grammars for LR parsers?

It might be worthwhile to have a section about what sort of grammars work best with LR parsers: yes, they can parse all LR grammars which is a strictly-defined set, but some work better than others. E.g.:

 list ::= item ',' list
 list ::= item '.' 

is an LR(1) grammar that would need to shift the entire input onto the stack before reducing it, whereas:

 list ::= list-items '.'
 list-items ::= item
 list-items ::= list-items ',' item

is a grammar for the same language that allows the parser to reduce after each item. I just added a note to left recursion that mentions this, although not in as much detail. The same reference [2] is probably useful here. JulesH 08:06, 4 August 2006 (UTC)

[edit] removed sentence

I removed from section 1.2:

In practice the parser will usually try to recover from a syntax error by ignoring unexpected symbols and/or inserting missing symbols, and continue parsing, but this is outside the scope of this article.

This paragraph is like taken from a tutorial. Well, I imagine it's only because of the "outside of scope" thing. Whatever. --euyyn 22:14, 4 September 2006 (UTC)

It's also not outside the scope of this article. If someone wants to describe common techniques for error recovery, that's perfectly on-topic. --Doradus 22:55, 13 July 2007 (UTC)

[edit] Web resource for LR(k) parser construction

These are what I found when searching on the Internet for my project on implementing a LR(1) yacc:

- Practical LR(k) Parser Construction [3]. By David R. Tribble, 12/12/2004. This is a long and complete introduction, although many places are labeled as "+INCOMPLETE".

- The Honalee LR(k) Algorithm [4]. By David R. Tribble, 4/27/2006.

BTW, in "E → E + B", maybe B can be changed to T so it looks different from E. "Item" is called "Configuration" in some early papers in the 1970s.

-- oxc 10:04, 7 December 2006 (UTC)

Yes, the Honalee algorithm is an LR(k) parsing algorithm I invented a while back. It creates LR parsing tables for what I called Merged LR (MLR) grammars, which lie somewhere between LALR and LR grammars in complexity. I have since discovered that it does not, however, handle all LR grammars. There are LR grammars that have mutually conflicting item sets that cannot be merged, but which my algorithm is not smart enough to detect. At some point when I have more time, I'd like to go back and complete the web pages. In the mean time, feel free to paraphrase portions of it for these WP articles. | David R. Tribble / Loadmaster (talk) 15:23, 10 February 2009 (UTC)

[edit] Closure of item sets

I am not convinced that the example is entirely consistent with the given definition of closure.

In particular, in construction of the item sets the rule "S → E •" is added to Item set 3, and two others are added to complete the closure. However this rule has empty follow set, so closure as defined (as I understand) will not add a rule such as "E → E • * B" to the item set.

It is entirely possible that I have misunderstood some part of this -- if so perhaps this is an indicator that the explanation needs clearing up a little (or that I should pay attention in lectures :p ), but a comment either way by someone more knowledgable than I would be appreciated. JoeKearney 19:52, 6 April 2007 (UTC)

With a bit more thought, I believe that what were trying to say is this:

If we're at the state of "S → E •" then we want (i.e. the closure includes) all rules to parse anything in the follow-set of E.

In particular it includes rules 1 and 2: "E → E • * B" and "E → E • + B".
I believe the definition of the closure doesn't fully account for this, but I'm not certain. If I have a spark of inspiration, or (more likely) if someone would care to point one way or the other I'll edit to make this more clear. JoeKearney 20:19, 6 April 2007 (UTC)

[edit] Parser is NOT a finite state machine

The article states that "The parser is a finite state machine". This statement is false, as the parser has a stack, which means it has a potentially infinite number of states. Perhaps the parser is a Pushdown Automaton, but i am not sure enough to change it. Someone who does know for sure, please change this. 62.131.189.89 07:55, 22 May 2007 (UTC)

I don't think your assertion is correct. A state within a state machine is a node in a directed graph that represents the progress of the finite state machine up to that point. Unless the machine has an infinite number of nodes, it is indeed a finite state machine. — Loadmaster 20:07, 11 July 2007 (UTC)
No, he's right. An LR parser requires a pushdown automaton (PDA), which is a FSM plus a stack. A PDA can't be modelled by an FSM. It would need an "infinite state machine". --Doradus 22:54, 13 July 2007 (UTC)
My take on this has always been a little different. It is true that a straightforward execution of an FSM cannot be used to recognize sentences of an LR language (except for the special case of a regular language). But the parser does use an FSM to recognize viable prefixes of right sentential forms. When a final state of this FSM has been reached, the last N symbols scanned are the handle of the right sentential form and are replaced by a new (non-terminal) symbol creating a new right sentential form (the next step in the derivation in reverse). This replacement transformation is determined by tables that have been generated for the grammar. Conceptually, the FSM is then re-executed and the newly transformed sentential form re-scanned from the beginning looking for the next handle. This scanning and transforming is repeated until the transformed input consists only of the goal symbol. If you take a closer look at what happens with two successive executions of the FSM, you realize that the restarted FSM will make the same state transitions as the prior execution until it encounters the non-terminal symbol that was just replaced for the handle detected during the prior execution of the FSM. As an efficiency, then, the entire re-scan can be avoided by remembering (by using a stack) what those state transitions had been. So an LR language can be recognized by repeated executions of an FSM with input transformation performed between the executions (which, admittedly, is an altogether different thing than a simple application of an FSM). The use of a stack can be viewed as an optimization. -- Raaronson (talk) 13:01, 13 March 2008 (UTC)
The process you describe is not a useful one. Transforming and re-parsing the entire input after each reduction step would render the parsing process uselessly inefficient, requiring unlimited memory and O(n^2) time. Parsing a CFG can't be done online with a DFA or NFA. --Doradus (talk) 23:29, 11 February 2010 (UTC)
The LR parsing algorithm is driven by a parser table, which is a data structure with finite states and state transitions. The parser uses a stack during its operation to keep track of the states it has passed through, so it is a push-down automata (PDA). I think "finite-state machine" (FSM) is a more general term that can refer to an NFA, DFA or PDA. Paul B Mann

[edit] Recursive ascent parsing

Prefaced the description of the stack machine with a remark about the alternative, functional, recursive ascent, view on LR parsing, and added a corresponding reference 81.5.0.80 18:45, 11 July 2007 (UTC).

[edit] Dangling else and typedef

I made a few corrections. A parser does not deal with a grammar. It reads computer languages. It's the parser generator that "deals" with a grammar.

About the "dangling else" problem ... A BNF grammar that contains the dangling else problem is ambiguous and NO parser can handle it, unless the ambiguity is resolved !!! An LR paser can and does handle the dangling else problem just fine, because the ambiguity (or conflict) gets resolved at parser generation time by chosing to shift the "else" onto the parse stack instead of making a reduction before accepting the "else". So, it is a misconception to say that an LR parser cannot handle the dangling else problem.

About C++ and typedef ... A simple or pure LR parser cannot handle the context sensitive problems with C++ and the typedef in C. However, a smart LR parser which uses the symbol table to classify <identifier>s as "typedef" or just <identifier> can successfully parse C++ and C typedef situations. So, again, one should not generalize that LR parsers cannot handle C++, because in practice it is possible. Theoretical computer scientists might disagree, but then changing definition of LR parser can decide the argument if you want to get mathematical. I just don't like to put limitations on LR parsing, besides what other techniques are you going to use -- recursive descent (what a nightmare)?

Paul B Mann —Preceding undated comment added 00:36, 10 September 2009 (UTC).

[edit] History

This article is sorely lacking a History section. Does anyone know enough to add one? --Doradus (talk) 16:46, 3 February 2010 (UTC)

I have the "Dragon Book" (Aho, Sethi, Ullman) on my bookshelf. I will consult it later to see if there is sufficient information there to provide a history. It's been years since I used it, so I don't remember right off the top of my head. Dead Horsey (talk) 01:38, 4 February 2010 (UTC)
I checked my Dragon Book, and there isn't enough information there to write a history without violating WP:NOR. Dead Horsey (talk) 23:35, 21 February 2010 (UTC)
See History of compiler construction Paul Foxworthy (talk) 12:47, 11 March 2012 (UTC)

I belive Knuth invented LR parsers. At least that information should be in the article and a reference to the original publication. 89.153.132.2 (talk) 23:22, 8 May 2011 (UTC)

Yep, he did, in 1965. The LALR parser article states that:
LR parsing was invented by Donald Knuth in 1965 in a paper, "On the Translation of Languages from Left to Right".
So I added it to this article. — Loadmaster (talk) 16:33, 9 May 2011 (UTC)

[edit] LR(k) == LR(1)?

Article says "Every LR(k) grammar for k > 1 can be mechanically transformed into an LR(1) grammar", then later it says "most programming languages can be expressed with LR(k) grammars, where k is a small constant (usually 1)." Surely, given the first quote, 'usually 1' makes no sense? —Preceding unsigned comment added by 80.0.167.213 (talk) 17:27, 3 August 2010 (UTC)


I addressed this issue and clarified and referenced this part of the article.

Espadrine (talk) 12:56, 29 May 2011 (UTC)

[edit] Arbitrary context-free-languages without penalty?

“LR parsing can be generalized as arbitrary context-free language parsing without a performance penalty, even for LR(k) grammars.” What? This sentence does not make sense, I thought, LR(k) ⊂ CFL, and in the following sentences a performance-penalty is mentioned. --Chricho (talk) 12:07, 11 September 2010 (UTC)

Personal tools
Namespaces

Variants
Actions
Navigation
Interaction
Toolbox
Print/export