Talk:Lexical analysis

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computer science  
WikiProject icon This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 ???  This article has not yet received a rating on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.


This article is not clear enough! It needs more examples!

Robert A.

The link to ocaml-ulex in 'Links' is broken.

Frank S.

Robert, this article is actually very poorly articulated. It seems that its authors do not have enough clarity of how to explain it, for it to be comprehensible. And the article should probably be marked that way - that is requires further clarity.Stevenmitchell (talk) 16:08, 25 January 2014 (UTC)

Types of tokens[edit]

Can someone explain to me what types of tokens there are? (Keywords, identifiers, literals? Or are there more) And do each of these types go into a symbol table of that type? i.e. an identifier table, keyword table and literal table? Or are they just stored into one uniform symbol table? —Dudboi 02:17, 6 November 2006 (UTC)

It really depends. For instance, for the example language (PL/0 - see the appropriate for example source code) in the article, here are the available token types:

multi-character operators: '+', '-', '*', '/', '=', '(', ')', ':=', '<', '<=', '<>', '>', '>=' language required punctuation: ',', ';', '.', literal numbers - specifically, integers identifiers: a-zA-Z {a-zA-Z0-9} keywords: "begin", "call", "const", "do", "end", "if", "odd", "procedure", "then", "var", "while"

Other languages might have more token types. For instance, in the C language, you would have string literals, character literals, floating point numbers, hexadecimal numbers, directives, and so on.

I've seen them all stored in a single table, and I've also seen them stored in multiple tables. I don't know if there is a standard, but from the books I've read, and the source code I've seen, multiple tables appears to be popular.

second task performed during lexical analysis is to make entry of tokens into a symbol table if it is there.Some other task performed during lexical analysis are: remove all comments,tab,blank spaces and machin characters. produce error massages occerrd in a source programs.

See the following links for simple approachable compiler sources:

See for more information on PL/0. 18:07, 13 November 2006 (UTC)

merger and clean up[edit]

I've ``merged`` token (parser) here. The page is a bit of a mess now though. The headings I've made should help sort that out. The examples should be improved so that they take up less space. The lex file should probably be moved to the flex page. --MarSch 15:37, 30 April 2007 (UTC)

done the move of the example. --MarSch 15:39, 30 April 2007 (UTC)


It would be nice to see what is involved in the next step, or at least see a link to the page describing the whole process of turning high level code into low level code

-Dusan B.

you mean compiling? --MarSch 10:53, 5 May 2007 (UTC)

Lexical Errors[edit]

There's a mention that scanning fails on an "invalid token". It doesn't seem particularly clear what constitutes a lexical error other than a string of garbage characters. Any ideas? -- (talk) 04:54, 27 November 2007 (UTC)

If the lexer finds an invalid token, it will report an error.
The comment needs some context. Generally speaking, a lexical analyzer may report an error. But that is usually (for instance in Lex programming tool) under control of the person who designs the rules for the analysis. The analyzer itelf may reject the rules because they're inconsistent. On the other hand, the parser is more likely to have built-in behavior -- yacc for instance fails on any mismatch, requires the developer to specify how to handle errors (updating the article to reflect these comments requires citing reliable sources of course - talk pages aren't a source) Tedickey (talk) 13:56, 27 November 2007 (UTC)

Lexer, Tokenizer, Scanner, Parser[edit]

There should be some more clearly explaned what exactly differs those modules, what are their particular jobs and how they're composed together (maybe some illustration?) —Preceding unsigned comment added by Sasq777 (talkcontribs) 16:06, 11 June 2008 (UTC)

I agree, the article does not make it clear which is what, what comes first etc. In fact, there seems to be a few contradictory statements. A block diagram would be ideal, but a clear explanation is urgently needed. Thanks. (talk) 01:03, 21 June 2008 (UTC)

Something like page 4 of this document, which incidentally can be reproduced easily here (see Preface). (talk) 08:50, 21 June 2008 (UTC)

In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of tokens to determine their grammatical structure with respect to a given (more or less) formal grammar.

Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin. The term parsing comes from Latin pars (ōrātiōnis), meaning part (of speech).[1][2] —Preceding unsigned comment added by (talk) 13:09, 25 September 2008 (UTC)

There are two different definitions of tokenization, one in the Tokens, the other in the Tokenization subsection. —Preceding unsigned comment added by (talk) 14:24, 28 March 2011 (UTC)

Regarding "Lexical analyzer generators" section[edit]

It seems to be a duplicate of List of parser generators. Probably can be removed? (talk) 02:57, 6 December 2009 (UTC)

It's not a duplicate (as the casual reader will observe). Tedickey (talk) 12:00, 6 December 2009 (UTC)

table-driven vs directly coded[edit]

I dont think the table-driven approach is the problem - see 'control table' article - flex appears to be inefficient and is not using a trivial hash function.—Preceding unsigned comment added by (talkcontribs)

Editor comments that lex/flex doesn't use hashing (that may be relevant). My understanding of the statement is that state-tables can grow very large when compared to hand-coded parsers. Agreeing that they're simpler to implement, there's more than one aspect of efficiency. Tedickey (talk) 20:37, 2 May 2010 (UTC)
It really depends what is meant by a "table-driven" approach. If you are simply talking about tables of "raw" specifications that have to be parsed/interpreted before execution v. embedded hand coded If statements, the original text may be correct.
If however you are talking about a really well designed "execution ready" control table - that is then used to input an already compacted and well designed "table of specifications" (v. verbose text based instructions), it can be much superior in algorithmic efficiency, maintainability and just about every other way. In other words, the phrase "table-driven" is maybe used rather ambiguously in this context, without reference to the much deeper "table-driven programming" approach. —Preceding unsigned comment added by (talk) 10:53, 7 May 2010 (UTC)

SICP book example[edit]

I am not familiar with the code examples in SICP, can someone please provide the section/example name/number and possibly a link to the example. Thanks, Ckoch786 (talk) 01:43, 7 February 2014 (UTC)

recursion vs regular expressions[edit]

A recent edit, using the sense that recursion is good, and popular means the same thing equated popular with support for recursive regular expressions. We need a reliable source of information discussing that aspect and which design features directly support recursion TEDickey (talk) 11:17, 1 January 2017 (UTC)

Many popular regular expression libraries support that which the article expressly says they "are not powerful enough to handle". The author of that statement is making an inaccurate opinion and it should be removed. I am taking no other position on the merits of regular expression versus parser capability other than to state the article is misleading in these facts I mentioned. I have provided the following references for the merit of revising the article. Hrmilo (talk) 21:30, 1 January 2017 (UTC) [1] [2]

The MSDN topic doesn't mention recursion; the other link mentions it without ever providing a definition (other than circular references), equates it to balancing groups, and doesn't relate that to any of the senses of recursion as used by others, much less explain why the page's author thought it was recursion. Implying that all regular expression parsers do this is misleading, particularly since it precedes a discussion of lex. If you want to expand on that, you might consider reading the POSIX definitions (e.g., regular expressions, and lex. (Keep in mind also that a single source doesn't warrant adding its terminology to this topic). TEDickey (talk) 14:59, 2 January 2017 (UTC)

  1. ^
  2. ^