Lexical analysis

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function that performs lexical analysis is called a lexical analyzer, lexer, tokenizer,^[1] or scanner. A lexer often exists as a single function which is called by a parser or another function, or can be combined with the parser in scannerless parsing.

Token

A token is a string of one or more characters that is significant as a group. The process of forming tokens from an input stream of characters is called tokenization.

Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens include: regular expressions, specific sequences of characters known as a flag, specific separating characters called delimiters, and explicit definition by a dictionary. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages.

Tokens are often categorized by character content or by context within the data stream. Categories are defined by the rules of the lexer. Categories often involve grammar elements of the language used in the data stream. Programming languages often categorize tokens as identifiers, operators, grouping symbols, or by data type. Written languages commonly categorize tokens as nouns, verbs, adjectives, or punctuation. Categories are used for post-processing of the tokens either by the parser or by other functions in the program.

A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser. For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each "(" is matched with a ")".

Consider this expression in the C programming language:

sum = 3 + 2;

Tokenized and represented by the following table:


Lexeme	Token type
sum	Identifier
=	Assignment operator
3	Integer literal
+	Addition operator
2	Integer literal
;	End of statement

Tokens are chothoo defined by regular expressions, which are understood by a lexical analyzer generator such as lex. The lexical analyzer (either generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling.

Lexical grammar

The specification of a programming language often includes a set of rules which defines the lexer. These rules usually consist of regular expressions, and they define the set of possible character sequences that are used to form individual tokens or lexemes. A Lexical program recognizes strings. For each kind of string found the lexical program takes an action.

In programming languages that delimit blocks with tokens (e.g., "{" and "}"), as opposed to off-side rule languages that delimit blocks with indentation, white space is also defined by a regular expression and influences the recognition of other tokens but does not itself contribute tokens. White space is said to be non-significant in such languages.

Scanner

The first stage, the scanner, is usually based on a finite-state machine (FSM). It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an integer token may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch rule, or longest match rule). In some languages, the lexeme creation rules are more complicated and may involve backtracking over previously read characters. For example, in C, a single 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal.

Tokenization

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

Take, for example,

The quick brown fox jumps over the lazy dog

The string isn't implicitly segmented on spaces, as an English speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e. matching the string " " or regular expression /\s{1}/).

The tokens could be represented in XML,

<sentence>
  <word>The</word>
  <word>quick</word>
  <word>brown</word>
  <word>fox</word>
  <word>jumps</word>
  <word>over</word>
  <word>the</word>
  <word>lazy</word>
  <word>dog</word>
</sentence>

Or an s-expression,

 (sentence
   (word The)
   (word quick)
   (word brown) 
   (word fox)
   (word jumps)
   (word over) 
   (word the)
   (word lazy)
   (word dog))

A lexeme, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments.)

For example, in the source code of a computer program, the string

net_worth_future = (assets - liabilities);

might be converted (with whitespace suppressed) into the lexical token stream:

NAME "net_worth_future" 
EQUALS 
OPEN_PARENTHESIS 
NAME "assets" 
MINUS 
NAME "liabilities" 
CLOSE_PARENTHESIS 
SEMICOLON

Though it is possible and sometimes necessary, due to licensing restrictions of existing parsers or if the list of tokens is small, to write a lexer by hand, lexers are often generated by automated tools. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production rule in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state table for a finite-state machine (which is plugged into template code for compilation and execution).

Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, a NAME token might be any English alphabetical character or an underscore, followed by any number of instances of any ASCII alphanumeric character or an underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." They are not capable of keeping count, and verifying that n is the same on both sides — unless you have a finite set of permissible values for n. It takes a full-fledged parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end. (see example^[2] in the SICP book).

The Lex programming tool and its compiler is designed to generate code for fast lexical analysers based on a formal description of the lexical syntax. It is not generally considered sufficient for applications with a complicated set of lexical rules and severe performance requirements; for instance, the GNU Compiler Collection (gcc) uses hand-written lexers.

Lexer generator

Lexical analysis can often be performed in a single pass if reading is done a character at a time. Single-pass lexers can be generated by tools such as flex.

The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach.^{[dubious – discuss]} With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools like re2c and Quex have proven (e.g. RE2C - A More Versatile Scanner Generator (1994)^[3]) to produce engines that are between two to three times faster than flex produced engines.^{[citation needed]} It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.

The simple utility of using a scanner generator should not be discounted, especially in the developmental phase, when a language specification might change daily. The ability to express lexical constructs as regular expressions facilitates the description of a lexical analyzer. Some tools offer the specification of pre- and post-conditions which are hard to program by hand. In that case, using a scanner generator may save a lot of development time, at least where extreme performance optimality is not a concern.

Lexical analyzer generators

ANTLR - Can generate lexical analyzers and parsers.
Flex - Alternative variant of the classic "lex" (C/C++).
JFlex - A rewrite of JLex.
Ragel - A state machine and lexical scanner generator with output support for C, C++, C#, Objective-C, D, Java, Go and Ruby source code.

The following lexical analysers can handle Unicode:

JavaCC - JavaCC generates lexical analyzers written in Java.
JLex - A lexical analyzer generator for Java.
Quex (or "Queχ") - A fast universal lexical analyzer generator for C and C++.

Notes

References

Compiling with C# and Java, Pat Terry, 2005, ISBN 032126360X
Algorithms + Data Structures = Programs, Niklaus Wirth, 1975, ISBN 0-13-022418-9
Compiler Construction, Niklaus Wirth, 1996, ISBN 0-201-40353-6
Sebesta, R. W. (2006). Concepts of programming languages (Seventh edition) pp. 177. Boston: Pearson/Addison-Wesley.

External links

Word Mention Segmentation Task analysis page
On the applicability of the longest-match rule in lexical analysis
The Art of Tokenization IBM developerWorks

[1] www.cs.man.ac.uk

[2] tpress.mit.edu

[3] teseerx.ist.psu.edu

[1]

[2]

[3]