Samplesort

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems.[1] Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Sample sort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing m -1 elements from the result. These elements (called splitters) then divide the sample into m equal-sized buckets.[2] Sample sort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W D Frazer and A C McKellar. In recent years, the algorithm has been adapted to implement randomized sorting on parallel computers.

Algorithm[edit]

Sample sort can be broken down into 3 parts

  1. Find splitters, values that break up the data into buckets, by sampling the data.
  2. Use the sorted splitters to define buckets and place data in appropriate buckets.
  3. Sort each of the buckets

Complexity[edit]

Find the splitters.

  O(n/P + log(P))

Send to buckets.

  P for reading all node
  log(P) for broadcasting
  n/P * log(P) for binary search for all keys
  n/P to send keys to bucket

Sort Buckets

  O(n/P)

Sampling the Data[edit]

The data may be sampled through different methods. Some methods include:

  1. Pick evenly spaced samples.
  2. Pick randomly selected samples.

Over Sampling[edit]

The over sampling ratio determines how many data elements to pull as samples. The goal is to get a good representation of the distribution of the data. If the data values are widely distributed, in that there are not many duplicate values, then a small sampling ratio is needed. In other cases where there are many duplicates in the distribution, a larger over sampling ratio will be necessary.

Selecting the Splitters[edit]

The ideal is to pick splitters that separate the data into j buckets of size n/j, where n is the number of elements to be sorted. This is to achieve an even distribution among the buckets, this way no one bucket takes longer than others to be sorted. This can be accomplished by selecting splitters in the sample by stepping through the sorted sample using a/j. Where sample size is a, and bucket size is j such that the values are a/j, 2a/j, ... (j - 1)a/j.

Uses in Parallel Systems[edit]

Sample sort is often used in parallel systems. This is done by splitting the sorting for each processor, where the number buckets is equal to the number of processor. Sample sort is efficient in parallel systems because each processor receives approximately the same bucket size. Since the buckets are sorted concurrently, the processors will complete the sorting at approximately the same time, thus not having a processor wait for others.

See also[edit]

References[edit]

External links[edit]

Frazer and McKellar's samplesort and derivatives:

Adapted for use on parallel computers: