External sorting

From Wikipedia, the free encyclopedia

  (Redirected from External sort)
Jump to: navigation, search

External sorting is a term for a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory (usually a hard drive). The typical external sorting algorithm uses a sort-merge strategy, which starts by sorting small subfiles. The basic algorithm consist of two phases: the sorting phase and the merging phase. In the sorting phase, the subfiles can fit in the available buffer space are read into main memory, sorted using an internal sorting algorithm, and written back to disk as temporary sorted subfiles. In the merging phase, the sorted subfiles are merged during one or more passes.

Carefully implemented, external sorting can be done in-place (with no additional disk space required).

[edit] External mergesort

One example of external sorting is the external mergesort algorithm.[1][2] For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:

  1. Read 100 MB of the data in main memory and sort by some conventional method (usually quicksort).
  2. Write the sorted data to disk.
  3. Repeat steps 1 and 2 until all of the data is sorted in 100 MB chunks, which now need to be merged into one single output file.
  4. Read the first 10 MB of each sorted chunk (call them input buffers) in main memory (90 MB total) and allocate the remaining 10 MB for output buffer.
  5. Perform a 9-way merging and store the result in the output buffer. If the output buffer is full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk or otherwise mark it as exhausted if there is no more data in the sorted chunk and do not use it for merging.

This algorithm can be generalized by assuming that the amount of data to be sorted exceeds the available memory by a factor of K. Then, K chunks of data need to be sorted and a K-way merge has to be completed. If X is the amount of main memory available, there will be K input buffers and 1 output buffer of size X/(K+1) each. Depending on various factors (how fast the hard drive is, what is the value of K) better performance can be achieved if the output buffer is made larger (for example twice as large as one input buffer).

In the example, a single-pass merge was used. If the ratio of data to available main memory is particularly large, a multi-pass sorting is preferable. For example, merge only the first half of the sorted chunks, then the other half and now the problem has been reduced to merging just two sorted chunks. The exact number of passes depends on the above mentioned ratio, as well as the physical characteristics of the hard drive (transfer rate and seeking time). As a rule of thumb, it is inadvisable to perform a more-than-20-to-30-way merge.[citation needed]

Multiple disk drives can be used in parallel in order to further improve bandwidth and reduce sorting time. The powerful notion of duality between merging and distribution can be exploited to obtain state-of-the-art sorting algorithms.[3]

[edit] External links

[edit] References

  1. ^ Donald Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998, ISBN 0-201-89685-0, Section 5.4: External Sorting, pp.248–379.
  2. ^ * Ellis Horowitz and Sartaj Sahni, Fundamentals of Data Structures, H. Freeman & Co., ISBN 0-716-78042-9.
  3. ^ J. S. Vitter, Algorithms and Data Structures for External Memory, Series on Foundations and Trends in Theoretical Computer Science, now Publishers, Hanover, MA, 2008, ISBN 978-1-60198-106-6.