Matching wildcards

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

In computer science, an algorithm for matching wildcards (also known as globbing) is useful in comparing text strings that may contain wildcard syntax.[1] Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell[2] or Microsoft Windows command-line[3] or text editor or file manager, as well as the interfaces for some search engines[4] and databases.[5] Wildcard matching is a subset of the problem of processing of regular expressions and pattern matching in general.[6]

History[edit]

Early algorithms for matching wildcards often relied on recursion, but the technique was criticized on grounds of performance[7] and reliability[8] considerations. Non-recursive algorithms for matching wildcards have gained favor in light of these considerations.

Among both recursive and non-recursive algorithms, strategies for performing the pattern matching operation vary widely, as evidenced among the variety of example algorithms referenced below. Test case development and performance optimization techniques have been demonstrably brought to bear on certain algorithms, particularly those developed by critics of the recursive algorithms.

Recursive algorithms[edit]

The recursion generally happens on matching * when there is more suffix to match against. This is a form of backtracking, also done by some regular expression matchers.

The general form of these algorithms are the same. On recursion the algorithm slices the input into substrings, and considers a match to have happened when ONE of the substrings return a positive match. For dowild("*X", "abcX"), it would call dowild("X", "abcX"), dowild("X", "bcX"), dowild("X", "cX") and dowild("X", "X"). They usually differ by less-important things like support for features and by more important factors such as minor but highly effective optimizations. Some of them include:

  • The ABORT signal against over-recursion (Lars Mathiesen 1991). While it is correct to naively recurse by the entire rest of the strings (pattern and text) on * and making sure that ONE of the substrings return a positive match, the running time becomes exponential for rejecting a match with many * in the text. Lars Mathiesen changes the return to three classes, match, no-match, and ABORT (no match possible at all for asterisk recursion.) The ABORT value is returned when the text is consumed too early or when another asterisk match has failed, guaranteeing a linear performance with respect to the number of asterisks. (The overall complexity is additionally quadratic to the number of characters left to match.)[9] Git/Rsync's wildmatch ABORT also covers invalid inputs.[16] The new INN uwildmat does the same.[17]
  • Asterisk advancement in recursion. This wildmatch tweak is relatively more minor. It applies to when the recursion wants to match "*X" on "abcX": when an asterisk is followed by a literal like "X", it is obvious that only the last comparison with equal lengths would have a chance of producing a match.[16] This is seen earlier in uwildmat in 2000[17] and more implicitly in van Rossum's fnmatch for FNM_PATHNAME.

Martin Richter's algorithm is an exception to this pattern, although the overall operation is equivalent. On * it recurses into increasing either of the indexes, following the dynamic programming formulation of the problem. The "ABORT" technique is applicable to it as well.[11]

The recursive algorithms are in general easier to reason about, and with the ABORT modification they perform acceptably in terms of worst-case complexity. On strings without * they take linear-to-string-size time to match since there is a fixed one-to-one relation.

Non-recursive algorithms[edit]

The following are developed by critics of the recursive algorithms:

The following is not:

  • Jack Handy's algorithm[21]

In addition, the problem of wildcard matching can be converted into regular expression matching using a naive text-replacement approach. Although non-recursive regular expression matchers such as Thompson's construction are less used in practice due to lack of backreference support, wildcard matching in general does not come with a similarly rich set of features. (In fact, many of the algorithms above only has support for ? and *.) The Russ Cox implementation of Thompson NFA can be trivially modified for such.[22] Gustavo Navarro's BDM-based nrgrep algorithm provides a more steamlined implementation with emphasis on efficient suffixes.[23] See also regular expression § implementations.

Dynamic programming[edit]

Wildcard matching is sometimes used as a problem in competitive programming, with Leetcode question 44 being an example. Programmers in such a field are familiar to solving this problem by converting recursion into dynamic programming.

This class of solutions generally take O(mn) space and O(mn) time, with one dimension of the table storing pattern index and the other storing text index. They are usually not used in general-purpose code due to the need for allocations. It can be done top-down or bottom-up. The problem is described as the following in DP (one-based string indices and zero-based matrix indices):

[24][25]

The order of accessing the string does not matter. This paradigm can be adapted to match [...] and regular expression versions of * and ?.

As with the problem for Levenshtein distance, it is possible to store the table in linear space. Other implementation tricks like preprocessing for common literal prefixes and suffixes are also applicable. With the use of bit arrays and stack-based memory allocation (alloca/malloca) it is possible to reduce the allocation overhead.

See also[edit]

References[edit]

  1. ^ "Wildcard characters". ScienceDirect. 2018.
  2. ^ Quigley, Ellie (2005). "UNIX Shell Programming QuickStart". InformIT.com.
  3. ^ "MS-DOS and Windows Wildcard Characters". Microsoft Developer Network Library.
  4. ^ "Apache Lucene - Query Parser Syntax". Apache Lucene 2.9.4 Documentation. 2006.
  5. ^ "SQL Wildcards". W3Schools. 2018.
  6. ^ Goyvaerts, Jan (2018). "Welcome to Regular-Expressions.info". RegularExpressions.info.
  7. ^ Cantatore, Alessandro (2003). "Wildcard matching algorithms".
  8. ^ a b Krauss, Kirk (2008). "Matching Wildcards: An Algorithm". Dr. Dobb's Journal. Cite journal requires |journal= (help)
  9. ^ a b Salz, Rich (1991). "wildmat.c". GitHub.
  10. ^ Filip (2014). "Compare strings with wildcard". Stack Overflow.
  11. ^ a b Deadlock (2015). "Wildcard Matching Recursive Algorithm C++". Stack Overflow.
  12. ^ Murugesan, Vignesh (2014). "WildCard Matching algorithm".
  13. ^ van Rossum, Guido (20 November 2019). "freebsd/lib/libc/gen/fnmatch.c". Retrieved 21 November 2019.
  14. ^ "fnmatch.c". opensource.apple.com. 1999.
  15. ^ "fnmatch_internal.c". Beren Minor's Mirrors. 21 November 2019.
  16. ^ a b "git/git: wildmatch.c". GitHub.
  17. ^ a b "uwildmat.c in trunk/lib – INN". inn.eyrie.org. Retrieved 27 November 2019.
  18. ^ Krauss, Kirk (2018). "Matching Wildcards: An Improved Algorithm for Big Data". Develop for Performance.
  19. ^ Cantatore, Alessandro (2003). "Wildcard matching algorithms".
  20. ^ Siler (2013). "Recursive solutions for glob pattern matching". Stack Overflow.
  21. ^ Handy, Jack (2005). "Wildcard string compare (globbing}". Code Project.
  22. ^ Cox, Ross. "Regular Expression Matching Can Be Simple And Fast".
  23. ^ Navarro, Gonzalo (10 November 2001). "NR‐grep: a fast and flexible pattern‐matching tool" (PDF). Software: Practice and Experience. 31 (13): 1265–1312. doi:10.1002/spe.411.
  24. ^ "Super easy c++ DP solution". Leetcode. Retrieved 27 November 2019.
  25. ^ "Python AC Easy top-down recursion with memoization". Leetcode. Retrieved 27 November 2019.

External links[edit]