||This article needs attention from an expert in Computer science. (April 2009)|
Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems. Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Samplesort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing m -1 elements from the result. These elements (called splitters) then divide the sample into m equal-sized buckets. Samplesort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W D Frazer and A C McKellar. In recent years, the algorithm has been adapted to implement randomized sorting on parallel computers.
Samplesort can be broken down into 3 parts
- Find splitters, values that break up the data into buckets, by sampling the data.
- Use the sorted splitters to define buckets and place data in appropriate buckets.
- Sort each of the buckets
Find the splitters.
Send to buckets.
for reading all node for broadcasting for binary search for all keys to send keys to bucket
where c is complexity of underlying sequential sorting method
Sampling the Data
The data may be sampled through different methods. Some methods include:
- Pick evenly spaced samples.
- Pick randomly selected samples.
The oversampling ratio determines how many data elements to pull as samples. The goal is to get a good representation of the distribution of the data. If the data values are widely distributed, in that there are not many duplicate values, then a small sampling ratio is needed. In other cases where there are many duplicates in the distribution, a larger oversampling ratio will be necessary.
Selecting the Splitters
The ideal is to pick splitters that separate the data into j buckets of size n/j, where n is the number of elements to be sorted. This is to achieve an even distribution among the buckets, this way no one bucket takes longer than others to be sorted. This can be accomplished by selecting splitters in the sample by stepping through the sorted sample using a/j. Where sample size is a, and bucket size is j such that the values are a/j, 2a/j, ... (j - 1)a/j.
Uses in Parallel Systems
Samplesort is often used in parallel systems. This is done by splitting the sorting for each processor, where the number buckets is equal to the number of processor. Sample sort is efficient in parallel systems because each processor receives approximately the same bucket size. Since the buckets are sorted concurrently, the processors will complete the sorting at approximately the same time, thus not having a processor wait for others.
- "Samplesort using the Standard Template Adaptive Parallel Library" (PDF). Texas A&M University.
- "Bucket and Samplesort".
Frazer and McKellar's samplesort and derivatives:
- Frazer and McKellar's original paper
Adapted for use on parallel computers: