Maximal munch: Difference between revisions

Content deleted Content added

Inline

Revision as of 21:25, 31 March 2010

In computer programming and computer science, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed. It is similar, though not entirely identical, to the concept of a greedy algorithm.

Application

For instance, the lexical syntax of many programming languages requires that tokens be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used regular expressions such as [a-z]+ (one or more lower-case letters).^[1]

The term has also been used in the field of compilers to describe a process of "tiling" — determining how a structured tree representing a program in an intermediate language should be converted into linear machine code. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles," each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch."^[2]

Drawbacks

In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For example, in C++, the "angle bracket" characters < and > are used in the syntax for template specialization; but two > characters in a row are interpreted as the right-shift operator >>.^[3] Therefore, the following C++ code will produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens.

    // This is not valid syntax in standard C++.
    std::vector<std::vector<int>> incorrect;
    // Adding whitespace between the brackets yields the correct parse.
    std::vector<std::vector<int> > correct;

The upcoming C++0x standard is expected to amend the grammar so that a right-shift token is accepted as synonymous with a pair of right-angle brackets, as is done in Java, which complicates the grammar but allows the continued use of the maximal munch principle.

Alternatives

Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions," which instead of directly taking the longest match will put some restrictions on what characters can follow a valid match. For example, stipulating that strings matching [a-z]+ cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression.^[4] Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (e.g., the right-shift token in C++ or Java would not be matched in the context of a template expression, where it is syntactically invalid).^[5]

References

^ Aho et al., 168.
^ Page, 470.
^ Vandevoorde.
^ Van den Brand et al., 26.
^ Van Wyk et al., 63.

Bibliography

Aho, Alfred V.; Lam, Monica S.; Sethi, Ravi; Ullman, Jeffrey D. (2007). Compilers: Principles, Techniques & Tools (2nd ed.). Boston: Addison-Wesley. ISBN 978-0-321-48681-1. {{cite book}}: Check |isbn= value: checksum (help)
Page, Daniel (2009). Practical Introduction to Computer Architecture. London: Springer. ISBN 978-1-84882-255-9.
Van den Brand, Mark G.J.; Scheerder, Jeroen; Vinju, Jurgen J.; Visser, Eelco (2002). "Disambiguation Filters for Scannerless Generalized LR Parsers". Lecture Notes in Computer Science. 2304/2002. Berlin/Heidelberg: Springer: 21–44. doi:10.1007/3-540-45937-5_12. ISSN 0302-9743.
Vandevoorde, Daveed (14 January 2005). "Right Angle Brackets". Retrieved 31 March 2010.
Van Wyk, Eric; Schwerdfeger, August (2007). "Context-Aware Scanning for Parsing Extensible Languages". GPCE '07: Proceedings of the 6th international conference on Generative programming and component engineering. New York: ACM: 63–72. doi:http://doi.acm.org/10.1145/1289971.1289983. {{cite journal}}: Check |doi= value (help); External link in |doi= (help)

[1] Aho et al., 168.

[2] Page, 470.

[3] Vandevoorde.

[4] Van den Brand et al., 26.

[5] Van Wyk et al., 63.

[1]

[2]

[3]

[4]

[5]

@@ Line 1: / Line 1: @@
+In [[computer programming]] and [[computer science]], '''"maximal munch"''' or '''"longest match"''' is the principle that when creating some construct, as much of the available [[input]] as possible should be consumed. It is similar, though not entirely identical, to the concept of a [[greedy algorithm]].
-{{Unreferenced|date=October 2009}}
+==Application==
-In [[computer programming]], the '''"maximal munch"''' or '''"longest match"''' principle is the rule that as much of the [[input]] as possible should be processed when creating some construct.
-For instance, the [[lexical grammar]] for many [[programming languages]] requires that the next [[Token (parser)|token]] be built from the maximum number of characters from the input stream (eg, in the [[C (programming language)|C programming language]], the three-character sequence <code><<=</code> is the single token '<tt>&lt;&lt;=</tt>', and not the three tokens '<tt>&lt;</tt>', '<tt>&lt;</tt>' and '<tt>=</tt>', or the two tokens '<tt>&lt;&lt;</tt>' and '<tt>=</tt>').
+For instance, the [[lexical grammar|lexical syntax]] of many [[programming languages]] requires that [[Token (parser)|tokens]] be built from the maximum possible number of characters from the input stream. This is done to resolve the problem of inherent ambiguity in commonly used [[regular expression]]s such as <code>[a-z]+</code> (one or more lower-case letters).<ref>Aho ''et al.'', 168.</ref>
+The term has also been used in the field of [[compilers]] to describe a process of "tiling" — determining how a structured tree representing a program in an [[intermediate language]] should be converted into linear [[Machine code|machine code]]. An entire subtree might be converted into just one machine instruction, and the problem is how to split the tree into non-overlapping "tiles," each representing one machine instruction. An effective strategy is simply to make a tile of the largest subtree possible at any given point, which is called "maximal munch."<ref>Page, 470.</ref>
-In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For example, in [[C++]], the "angle bracket" characters <code>&lt;</code> and <code>&gt;</code> are used in the syntax for [[template (programming)|template specialization]]; but two <code>&gt;</code> characters in a row are interpreted as the [[bit shift|right-shift]] operator <code>&gt;&gt;</code>. Therefore, the following C++ code will produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens.
+==Drawbacks==
+In some situations, "maximal munch" leads to undesirable or unintuitive outcomes. For example, in [[C++]], the "angle bracket" characters <code>&lt;</code> and <code>&gt;</code> are used in the syntax for [[template (programming)|template specialization]]; but two <code>&gt;</code> characters in a row are interpreted as the [[bit shift|right-shift]] operator <code>&gt;&gt;</code>.<ref>Vandevoorde.</ref> Therefore, the following C++ code will produce a parse error, because the right-shift operator token is encountered instead of two right-angle-bracket tokens.
 <source lang="Cpp">
-    std::vector<std::vector<int>> incorrect;   // This is not valid syntax in standard C++.
+    // This is not valid syntax in standard C++.
-    std::vector<std::vector<int> > correct;    // Adding whitespace between the brackets yields the correct parse.
+    std::vector<std::vector<int>> incorrect;
+    // Adding whitespace between the brackets yields the correct parse.
+    std::vector<std::vector<int> > correct;
 </source>
-The upcoming [[C++0x]] standard is expected to amend the grammar so that a right-shift token is accepted as synonymous with a pair of right-angle-brackets, which complicates the grammar but preserves the maximal munch rule.
+The upcoming [[C++0x]] standard is expected to [[C++0x#Angle bracket|amend the grammar]] so that a right-shift token is accepted as synonymous with a pair of right-angle brackets, as is done in [[Java (programming language)|Java]], which complicates the grammar but allows the continued use of the maximal munch principle.
+==Alternatives==
+Programming languages researchers have also responded by replacing or supplementing the principle of maximal munch with other lexical disambiguation tactics. One approach is to utilize "follow restrictions," which instead of directly taking the longest match will put some restrictions on what characters can ''follow'' a valid match. For example, stipulating that strings matching <code>[a-z]+</code> cannot be followed by an alphabetic character achieves the same effect as maximal munch with that regular expression.<ref>Van den Brand ''et al.'', 26.</ref> Another approach is to keep the principle of maximal munch but make it subordinate to some other principle, such as context (''e.g.'', the right-shift token in C++ or Java would not be matched in the context of a template expression, where it is syntactically invalid).<ref>Van Wyk ''et al.'', 63.</ref>
+==References==
+{{reflist}}
+==Bibliography==
-Another area where the term is used is [[compilers]], here the pattern that matches the largest number of intermediate code instructions is preferred to those matching fewer instructions because it appears to result in higher quality machine code being generated.
+* {{Cite book
-== External links ==
+   |last=Aho
-* Article "[http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1757.html Right Angle Brackets] by [[Daveed Vandevoorde]]
+   |first=Alfred V.
-* Article "[http://www.codeproject.com/KB/recipes/maximummunch.aspx Maximum Munch Principle]" by [[Zeeshan Amjad]]
+   |last2=Lam
-* Article "[http://www.cis.nctu.edu.tw/~wuuyang/papers/applicability2002.ps On the applicability of the longest-match rule in lexical analysis]
+   |first2=Monica S.
-* [http://dinosaur.compilertools.net/lex/index.html lex manual]
+   |last3=Sethi
+   |first3=Ravi
+   |last4=Ullman
+   |first4=Jeffrey D.
+   |title=[[Compilers: Principles, Techniques, and Tools#Second edition|Compilers: Principles, Techniques & Tools]]
+   |edition=2nd
+   |place=Boston
+   |publisher=Addison-Wesley
+   |year=2007
+   |isbn=978-0-321-48681-1}}
+* {{Cite book
+   |last=Page
+   |first=Daniel
+   |title=Practical Introduction to Computer Architecture
+   |publisher=Springer
+   |location=London
+   |year=2009
+   |isbn=978-1-84882-255-9
+   |url=http://www.springerlink.com/content/j10237532887k344/}}
+* {{Cite journal
+   |last=Van den Brand
+   |first=Mark G.J.
+   |last2=Scheerder
+   |first2=Jeroen
+   |last3=Vinju
+   |first3=Jurgen J.
+   |last4=Visser
+   |first4=Eelco
+   |title=Disambiguation Filters for Scannerless Generalized LR Parsers
+   |journal=Lecture Notes in Computer Science
+   |volume=2304/2002
+   |pages=21-44
+   |publisher=Springer
+   |location=Berlin/Heidelberg
+   |date=2002
+   |url=http://www.springerlink.com/content/03359k0cerupftfh/
+   |issn=0302-9743
+   |doi=10.1007/3-540-45937-5_12}}
+* {{Cite web
+   |last=Vandevoorde
+   |first=Daveed
+   |title=Right Angle Brackets
+   |date=14 January 2005
+   |accessdate=31 March 2010
+   |url=http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1757.html}}
+* {{Cite journal
+   |last=Van Wyk
+   |first=Eric
+   |last2=Schwerdfeger
+   |first2=August
+   |title=Context-Aware Scanning for Parsing Extensible Languages
+   |journal=GPCE '07: Proceedings of the 6th international conference on Generative programming and component engineering
+   |date=2007
+   |pages=63-72
+   |publisher=ACM
+   |location=New York
+   |url=http://portal.acm.org/citation.cfm?id=1289983&dl=GUIDE,
+   |doi=http://doi.acm.org/10.1145/1289971.1289983}}
 [[Category:Compilers]]
+[[Category:Computer science]]