Talk:Top-down parsing

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The article is about LL parsers, but doesn't explain clearly what a top-down parser is. IMHO, it should be rewritten.

What is a top-down chart? IMHO, this article should answer this question. --67.83.133.218 01:24, 3 February 2006 (UTC)

what means "w.r.t."? —Preceding unsigned comment added by 217.84.40.79 (talk) 12:02, 18 June 2008 (UTC)

with respect to

No input is specified in the "Programming language application" section. How are the the productions being matched? —Preceding unsigned comment added by 63.199.81.214 (talk) 03:48, 26 November 2008 (UTC)

"While LR can be said to be better for languages it is also backwards to specify" - This sentance doesn't make sense. — Preceding unsigned comment added by 140.168.135.1 (talk) 00:07, 22 May 2014 (UTC)

And now for something completely different[edit]

There may be errors in the following. Not the bast at typing when thinking. One common problem is teh for the. Seams my hands have a race condition and the left beats the right. Will get back to this as time allows.

Top down parsers may have nothing to do with formal grammars. A top down parser may expresses a top down analysis approach. A rule based parsing program expressed by early metacompilers is an example. If we have a formal grammar of a language it must be rewritten in an analytical form. One may design a language in analytical from from scratch. It is easier to do an example then talk about. The parser is defined using rules. A rule is a function that returns success or failure. A rule is written using tests (returning success or failure) that operate on the input stream and is itself a test. A string "..." is a test. "=" would try and match the input stream against the string '='. Rules generate code directly. They may call other rules or functions. The function they may call are special function that also return success or failure. Function have parameters distinguishing them from grammar rules. Function in the metaprogram are a type of rule that analyzes the structures produced by the grammar rules. The programming model has three stacks and a token buffer. The three stacks are call stack, parse stack and node stack. The grammar rules are function and use the call stack to save return points in the grammar rule. The parse and node stacks are used in building tree structures that are passed to procedural function that operate on trees and lists structures. The token buffer holds matched by a token rule and is automatically entered into a symbol unless a function call intercedes and creates a different type of object. The language is an object programming language. The parser transforms the input character stream into objects that are then processed by functions having a lisp 2 like grammar with input parameter unparsing rules. This example is about the parsing programming. A top down parser not based on a formal grammar:

The analytical rule parser begins execution with a driving rule. A program defined by rules in this manner is defined by the driving rule:

We start at the top and write a rule defining what a program is in the language we intend to compile.

  program = $declaration;              // A program is a sequence of declaration.

The above driving rule is simply declaring a program to be a series of declaration. The $ operator is specifying that zero or more declaration are to be parsed. If we needed to require at least one declaration the rule would be written:

  program = declaration $declaration;  // A program is a sequence of one or more declaration.

A more complicated driving rule for the COBOL language might look like:

  program =  IDENTIFICATION_Division ENVIRONMENT_Division DATA_Division PROCEDURE_Division;

The rules above are an LR description of the entire program. They directly generate code. Rule (1) above in assembly is straight forward. A rule is a function that returns a condition of success or failure. The $, zero or more operator, is a programed loop that applies to the expression following it. The expression code is generated inclosed in a loop construct. Do <something> while successful. So that means we need a label for the start of the loop, generate code for the <something> and generate code to jmp back to the labeled start of the loop on success. On failure of the something the $ loop operator is successful. So in the program rule we have code to return success.

C++ procedure are used in the examples here. The actual code is IA86 assembly language encapsulated in a "__declspec(naked) void" function.
//   program = $declaration;
__declspec(naked) void program(){  _asm {
l1:     call    declaration
        je      l1
        cmp     al,al
        ret
}}

The implementation of the rule in this case is vary simple. The __declspec(naked) creates a c++ function having no preamble or exit code. The _asm { ... } is declaring the contained code to be assembly language. Above the entire function is assembly code. There is no additional code being generated by the C++ compiler in the above example.

When the rule "program" is called it is entered at the first assembly code line of the program function:

l1:     call    declaration

The "call" is an assembly instruction that calls a function. The l1: is a label used in addressing that line of code. A grammar rule returns success or failure in the processor status register zero flag. The __declspec(naked) allows programming outside the C++ context. We are not using C calling conventions writing in assembly. The void program() declaration is only providing a label for our rule. It also allows returning of success or failure in the processor flag register. There is no entry or exit code generated by the compiler that could effect the processor status. So on return from declaration,

        je      l1

tests the processor flag and if a equal condition is true it jumps to the l1 tag calling declaration repeatedly as long as declaration is successful. The $declaration is zero or more declaration. If declaration fails the status flag will indicate a NE condition. and the je l1 will instead proceed to the next instruction cmp al,al (comparing the contents of register al to itself) seting the equal condition as the $ zero or more operator always succeeds. The rule returns an eq condition to it's caller. The alternate rule, program = declaration $declaration; requiring at least one declaration with comments:

__declspec(naked) void program(){  _asm { // program is declared to be a naKed function so there is only the assembly code.
        call    declaration               // call the declaration rule
        je      l1                        // Test success
        ret                               // return failure
l1:     call    declaration               // calls declaration, another naked void function that parses a declaration.           
        je      l1                        // declaration will be called repeatedly until ne condition in the processor status
        cmp     al,al                     // comparing al to it self always sets the EQ condition in the status register.
        ret                               // returns success an EQ condition in the status register.
                                          //  comparing esp to 0 is a sure way of seting the failure condition.
}}                                        //  rules have no local variables values values returned.

The next step would be defining declaration. The example here very simple to illustrate the top analysis of the input. The language has three types of rules: Character class rules define characters that are used in higher level rules. They generally do not generate code directly. The most efficient implementation is character classes are bits in a character class map table indexed by the character code. The class map table is tested usually by a single machine instruction for a character membership in a class. Token making rules define the tokens of the language. white spacing is automatic when recognizing a token. The skipclass character class may be defined or the default used.

  program = $declaration;              // A program is a sequence of declaration.

// define character classes, dgt, let, and alphanum, used in defining tokens

dgt       : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';
let       : 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' |
            'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' |
            'u' | 'v' | 'w' | 'x' | 'y' | 'z' |
            'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' |
            'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' |
            'U' | 'V' | 'W' | 'X' | 'Y' | 'Z';
alphanum  : let | dgt;

// define token of the language

number    .. dgt $dgt MAKENUM();

// define the language syntax

id        .. let $alphanum;

declaration = "quit" EXIT() | (assignment || output || ERROR("invalid input") clear_input());
assignment  = id '=':store expression ';' !2 eval(*1);
output      = expression ':' eval(*1) print(*1);

expression = term $(('+':ADD|'-':SUB) term!2);
term       = factor $(('*':MPY|'/':DIV) factor!2);
factor     = '(' expr ')' | number | id ;

The above an analytical top down grammar of a simple language for an interactive interpreter. The above is a program for a top down parser written in an analytical parsing language. A single '|' is an alternant operator. 'A' or 'B' written as 'A' | 'B'. Backtracking is expressed using the "||" alternate operator. A token rule automaticly backtracks to the state upon token entry and return failure. In '=' rules backtracking is controlled by the "||" alternant operator. literal strings are also tokens and automaticly backtrack to the state upon entry to their matching routine. In the above the "||" is only used in the top driving rule declaration. Where is "quit" is matched exit is called and terminates the program. If "quit" is not matched the alternant (assignment || output || ERROR("invalid input") clear_input()) is tried. First assignment is called. where say an id is matched. If an '=' is not recognized assignment fails returning to declaration where the state is reset to the before assignment was called and the next alternate output is called. output is going to look for an expression. An expression can be as simple as an id or an arithmetic expression. If during matching an expression a failure occurs the state would be reset to when output was called in declaration and the next backtracking alternant tried. Where ERROR("invalid input") is called and the input cleared by a call to clear_input()). If expression succeeds eval is called. eval is a generator function having a different syntax. eval may be programed to fail is an id has not been assigned a value. Or it may just evaluate what numerics it can and return a partial evaluated tree and print programed to display an expression. In the SYNTAX above the ":" operator makes a tree node and the "!2" makes a 2 element tree combining the node and the top 2 parse stack entries. The language three stacks. A call stack, parse stack, and a node stack. As entries entries are matched they are placed on the parse stack. Matched literal strings "..." are not normally put on the parse stack. A '+' operator may be used to place a matched literal on the parse stack. Strings may be matched using the ' or " character. ' may be used only to match a single character string. 'a' is a valid 'ed string 'xyz' is not. A " character must be used to match a multi character string. Only ' strings are allowed in character class rules.

The above grammar program is a top down analytical definition of a language. It can parse many context sensitive constructs. Backtracking is controlled by the grammar programming. In the above backtracking is used to resolve the ambiguity of outputting an id and an assignment statements. The order of evaluation is important as the assignment is tried first as an expression in the output rule may be an id. And last the backtracking alternant is used to recover from an error. Of note is that eval can fail if an id has not been assigned a value. The language the above is written is a metacompiler metaprogramming language. There is a symbol table and symbol may have attributes. A value attribute may be assigned the evaluation of an expression. The pressensance of the value attribute can be tested by the eval function.

This doesn't exactly fit the description given in the article here on Top-down parsing.

--Steamerandy (talk) 18:00, 15 November 2014 (UTC)