Jump to content

Draft:String sorting algorithms

From Wikipedia, the free encyclopedia

In computer science, string sorting algorithms are a special case of sorting algorithms, where the input is an array of strings with characters chosen from an alphabet Σ.

An example for sorting strings

Unlike traditional sorting algorithms that deal with atomic keys, string sorting encounters unique challenges. Sorting strings using conventional atomic sorting algorithms, which treat keys as indivisible objects, is inefficient because comparing entire strings can be costly and must be performed numerous times. Efficient string sorting algorithms, in contrast, inspect most characters of the input only once during the entire sorting process and they examine only those characters that are necessary to establish the correct ordering. Another challenge is that strings are represented as arrays of pointers. This representation results in indirect access to string characters, leading to cache faults during the access, even when scanning an array of strings. This is in contrast to sorting atomic keys, where scanning is notably cache efficient. The efficiency of string sorting algorithms depens upon multiple factors, including the size of the dataset (), the distinguishing prefix size of (), which is the minimal number of characters that need to be examined to sort the strings, the number of subproblems (σ), into which the algorithm breaks down the problem, and the underlying hardware. This indicates that no singular algorithm is universally optimal.

Sequential Methods

[edit]

Multikey quicksort

[edit]
An example for Mulitkey Quicksort for sorting strings

Developed by Bentley and Sedgewick in 1997, this algorithm is an adaptation of traditional quicksort, tailored for string sorting.[1] It uses the character with a common prefix of length h as a splitter, organizing the strings into three distinct arrays based on their th character's relation to . The algorithm recurses until the termination condition is met: if termination with . With Insertion Sort as a base case sorter for constant input sizes, multikey quicksort has a complexity of .

Most significant digit (MSD) radix sort

[edit]
An example for MSD-Radix Sort for sorting strings

Most significant digit (MSD) radix sort is especially efficient for sorting large datasets, particularly when the alphabet size is small[2]. The algorithm initiates sorting by examining the (h+1)-th character of each string with as the common prefix, subsequently dividing the dataset into σ distinct subproblems. Each subproblem is then recursively sorted with the common prefix length . This strategy, which is a natural approach to string sorting, has been subject to numerous refinements and improvements across various studies in the literature [3] [4] [5]. The time complexity is O(D) plus the time required for sorting the base cases. For example, with multikey quicksort as the base case sorter MSD radix sort has a complexity of O(D + n log σ).

Burstsort

[edit]

Burstsort uses a trie-based structure with containers at the leaves for sorting the strings.[6][7] Upon reaching a predefined threshold, these containers "burst", redistributing the strings into new containers based on their next character. These new containers are then attached to the appropriate child nodes of the trie. The sorting process involves traversing the trie and individually sorting the small containers. Key factors influencing the runtime efficiency of Burstsort include the trie implementation, the design of the containers, the burst threshold, and the chosen base algorithm for sorting the containers. Sinha and Zoble used an array for each trie node and unordered dynamic arrays of string pointers for the leaf containers, with a bursting threshold set at 8192.[8] With this configuration and multikey quicksort for sorting the leaves, burstsort has a complexity of O(D + n log σ).

LCP-mergesort

[edit]

LCP-mergesort is an adaptation of the traditional merge sort algorithm, which stores and reuses the longest common prefixes (LCPs) of consecutive strings in the sorted subproblems [9]. This strategy enhances the efficiency of string comparisons. In the conventional method the strings and must be compared character-by-character. However, with the LCP information for and relative to another string of similar or smaller size allows the preliminary use of the LCP. If the LCP between and is shorter than that between and , it follows that precedes in lexicographical order due to and sharing a shorter common prefix than and . This also applies symmetrically. LCP-Mergesort has a worst-case time complexity of .

Insertion sort

[edit]

Insertion sort is frequently used as the base case sorter for small sets of strings.[10] The algorithm stores an ordered array and inserts the unsorted items into their appropriate positions through linear scanning. This method treats strings as atomic units, necessitating full string comparisons during the linear scan to ensure the correct order. It has a worst-case time complexity of . So it is particularly good for small and , due to the cache-efficient manner in which strings are scanned.

Parallel methods

[edit]

The exploration of parallel string sorting algorithms remains limited, yet it is the only way to get performance out of Moore's Law.[11] The scalability of an algorithm in a parallel computing environment depends on various factors, similar to those affecting sequential methods. Many of the algorithms discussed in the sequential context can be adapted for parallel execution.

  1. ^ BENTLEY, Jon L.; SEDGEWICK, Robert. Fast algorithms for sorting and searching strings. In: Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. 1997. S. 360-369.
  2. ^ KÄRKKÄINEN, Juha; RANTALA, Tommi. Engineering radix sort for strings. In: International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. S. 3-14.
  3. ^ KÄRKKÄINEN, Juha; RANTALA, Tommi. Engineering radix sort for strings. In: International Symposium on String Processing and Information Retrieval. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. S. 3-14.
  4. ^ NG, Waihong; KAKEHI, Katsuhiko. Cache efficient radix sort for string sorting. IEICE transactions on fundamentals of electronics, communications and computer sciences, 2007, 90. Jg., Nr. 2, S. 457-466.
  5. ^ ANDERSSON, Arne; NILSSON, Stefan. Implementing radixsort. Journal of Experimental Algorithmics (JEA), 1998, 3. Jg., S. 7-es.
  6. ^ SINHA, Ranjan; ZOBEL, Justin. Efficient trie-based sorting of large sets of strings. In: ACSC. 2003. S. 11-18.
  7. ^ HEINZ, Steffen; ZOBEL, Justin; WILLIAMS, Hugh E. Burst tries: a fast, efficient data structure for string keys. ACM Transactions on Information Systems (TOIS), 2002, 20. Jg., Nr. 2, S. 192-223.
  8. ^ SINHA, Ranjan; ZOBEL, Justin. Cache-conscious sorting of large sets of strings with dynamic tries. Journal of Experimental Algorithmics (JEA), 2004, 9. Jg., S. 1.5-es.
  9. ^ NG, Waihong; KAKEHI, Katsuhiko. Merging string sequences by longest common prefixes. IPSJ Digital Courier, 2008, 4. Jg., S. 69-78.
  10. ^ MCCLELLAN, Michael T.; MINKER, Jack. The art of computer programming, vol. 3: sorting and searching. 1974.
  11. ^ MOORE, Gordon E. Cramming more components onto integrated circuits. Proceedings of the IEEE, 1998, 86. Jg., Nr. 1, S. 82-85.