External sorting

External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory, usually a hard disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.

External sorting algorithms generally fall into two types, distribution sorting, which resembles quicksort, and external merge sort, which resembles merge sort. The latter typically uses a hybrid sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file.

Model

External sorting algorithms can be analyzed in the external memory model. In this model, a cache or internal memory of size $M$ and an unbounded external memory are divided into blocks of size $B$ , and the running time of an algorithm is determined by the number of memory transfers between internal and external memory. Like their cache-oblivious counterparts, asymptotically optimal external sorting algorithms achieve a running time (in Big O notation) of $O\left({\tfrac {N}{B}}\log _{\tfrac {M}{B}}{\tfrac {N}{B}}\right)$ .

External merge sort

One example of external sorting is the external merge sort algorithm, which is a K-way merge algorithm. It sorts chunks that each fit in RAM, then merges the sorted chunks together.^[1]^[2]

The algorithm first sorts $M$ items at a time and puts the sorted lists back into external memory. It then recursively does a ${\tfrac {M}{B}}$ -way merge on those sorted lists. To do this merge, $B$ elements from each sorted list are loaded into internal memory, and the minimum is repeatedly outputted.

For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:

Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
Write the sorted data to disk.
Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
Read the first 10 MB (= 100MB / (9 chunks + 1)) of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

Historically, instead of a sort, sometimes a replacement-selection algorithm^[3] was used to perform the initial distribution, to produce on average half as many output chunks of double the length.

Additional passes

The previous example is a two-pass sort: first sort, then merge. The sort ends with a single k-way merge, rather than a series of two-way merge passes as in a typical in-memory merge sort. This is because each merge pass reads and writes every value from and to disk.

The limitation to single-pass merging is that as the number of chunks increases, memory will be divided into more buffers, so each buffer is smaller. This causes many smaller reads rather than fewer larger ones. Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single merge pass isn't efficient: the disk seeks to fill the input buffers with data from each of the 500 chunks (we read 100MB / 501 ~ 200KB from each chunk at a time) take up most of the sort time. Using two merge passes solves the problem. Then the sorting process might look like this:

Run the initial chunk-sorting pass as before.
Run a first merge pass combining 25 chunks at a time, resulting in 20 larger sorted chunks.
Run a second merge pass to merge the 20 larger sorted chunks.

Like in-memory sorts, efficient external sorts require O(n log n) time: linear increases in data size require logarithmic increases in the number of passes, and each pass takes a linear number of reads and writes. Using the large memory sizes provided by modern computers the logarithmic factor grows very slowly. Under reasonable assumptions at least 500 GB of data can be sorted using 1 GB of main memory before a third pass becomes advantageous, and many times that much data can be sorted before a fourth pass becomes useful.^[4] Low-seek-time media like solid-state drives (SSDs) also increase the amount that can be sorted before additional passes improve performance.

Main memory size is important. Doubling memory dedicated to sorting halves the number of chunks and the number of reads per chunk, reducing the number of seeks required by about three-quarters. The ratio of RAM to disk storage on servers often makes it convenient to do huge sorts on a cluster of machines^[5] rather than on one machine with multiple passes.

External distribution sort

External distribution sort is analogous to quicksort. The algorithm finds approximately ${\tfrac {M}{B}}$ pivots and uses them to divide the $N$ elements into approximately equally sized subarrays, each of whose elements are all smaller than the next, and then recurse until the sizes of the subarrays are less than the block size. When the subarrays are less than the block size, sorting can be done quickly because all reads and writes are done in the cache, and in the external memory model requires $O(1)$ operations.

However, finding exactly ${\tfrac {M}{B}}$ pivots would not be fast enough to make the external distribution sort asymptotically optimal. Instead, we find slightly less pivots. To find these pivots, the algorithm splits the $N$ input elements into ${\tfrac {N}{M}}$ chunks, and takes every ${\sqrt {\tfrac {M}{16B}}}$ elements, and recursively uses the median of medians algorithm to find ${\sqrt {\tfrac {M}{B}}}$ pivots.^[6]

There is a duality, or fundamental similarity, between merge- and distribution-based algorithms.^[7]

Performance

The Sort Benchmark, created by computer scientist Jim Gray, compares external sorting algorithms implemented using finely tuned hardware and software. Winning implementations use several techniques:

Using parallelism
- Multiple disk drives can be used in parallel in order to improve sequential read and write speed. This can be a very cost-efficient improvement: a Sort Benchmark winner in the cost-centric Penny Sort category uses six hard drives in an otherwise midrange machine.^[8]
- Sorting software can use multiple threads, to speed up the process on modern multicore computers.
- Software can use asynchronous I/O so that one run of data can be sorted or merged while other runs are being read from or written to disk.
- Multiple machines connected by fast network links can each sort part of a huge dataset in parallel.^[9]
Increasing hardware speed
- Using more RAM for sorting can reduce the number of disk seeks and avoid the need for more passes.
- Fast external memory like solid-state drives can speed sorts, either if the data is small enough to fit entirely on SSDs or, more rarely, to accelerate sorting SSD-sized chunks in a three-pass sort.
- Many other factors can affect hardware's maximum sorting speed: CPU speed and number of cores, RAM access latency, input/output bandwidth, disk read/write speed, disk seek time, and others. "Balancing" the hardware to minimize bottlenecks is an important part of designing an efficient sorting system.
- Cost-efficiency as well as absolute speed can be critical, especially in cluster environments where lower node costs allow purchasing more nodes.
Increasing software speed
- Some Sort Benchmark entrants use a variation on radix sort for the first phase of sorting: they separate data into one of many "bins" based on the beginning of its value. Sort Benchmark data is random and especially well-suited to this optimization.
- Compacting the input, intermediate files, and output can reduce time spent on I/O, but is not allowed in the Sort Benchmark.
- Because the Sort Benchmark sorts long (100-byte) records using short (10-byte) keys, sorting software sometimes rearranges the keys separately from the values to reduce memory I/O volume.

References

^ Donald Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.248–379.
^ Ellis Horowitz and Sartaj Sahni, Fundamentals of Data Structures, H. Freeman & Co., ISBN 0-7167-8042-9.
^ Donald Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.254–ff.
^ Assume a single disk with 200 MB/s transfer, 20 ms seek time, 1 GB of buffers, 500 GB to sort. The merging phase will have 500 buffers of 2M each, need to do 250K seeks and read then write 500 GB. It will spend 5,000 sec seeking and 5,000 sec transferring. Doing two passes as described above would nearly eliminate the seek time but add an additional 5,000 sec reading and writing, so this is approximately the break-even point between a two-pass and three-pass sort.
^ Chris Nyberg, Mehul Shah, Sort Benchmark Home Page (links to examples of parallel sorts)
^ Aggarwal, Alok; Vitter, Jeffrey (1988). "The input/output complexity of sorting and related problems". Communications of the ACM. 31 (9): 1116–1127. doi:10.1145/48529.48535.
^ J. S. Vitter, Algorithms and Data Structures for External Memory, Series on Foundations and Trends in Theoretical Computer Science, now Publishers, Hanover, MA, 2008, ISBN 978-1-60198-106-6.
^ Nikolas Askitis, OzSort 2.0: Sorting up to 252GB for a Penny
^ Rasmussen et al., TritonSort

External links

[1] Donald Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.248–379.

[2] Ellis Horowitz and Sartaj Sahni, Fundamentals of Data Structures, H. Freeman & Co., ISBN 0-7167-8042-9.

[3] Donald Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.254–ff.

[4] Assume a single disk with 200 MB/s transfer, 20 ms seek time, 1 GB of buffers, 500 GB to sort. The merging phase will have 500 buffers of 2M each, need to do 250K seeks and read then write 500 GB. It will spend 5,000 sec seeking and 5,000 sec transferring. Doing two passes as described above would nearly eliminate the seek time but add an additional 5,000 sec reading and writing, so this is approximately the break-even point between a two-pass and three-pass sort.

[5] Chris Nyberg, Mehul Shah, Sort Benchmark Home Page (links to examples of parallel sorts)

[Aggarwal88-6] Aggarwal, Alok; Vitter, Jeffrey (1988). "The input/output complexity of sorting and related problems". Communications of the ACM. 31 (9): 1116–1127. doi:10.1145/48529.48535.

[7] J. S. Vitter, Algorithms and Data Structures for External Memory, Series on Foundations and Trends in Theoretical Computer Science, now Publishers, Hanover, MA, 2008, ISBN 978-1-60198-106-6.

[8] Nikolas Askitis, OzSort 2.0: Sorting up to 252GB for a Penny

[9] Rasmussen et al., TritonSort

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]