# Samplesort

Samplesort is a sorting algorithm that is a divide and conquer algorithm often used in parallel processing systems.[1] Conventional divide and conquer sorting algorithms partitions the array into sub-intervals or buckets. The buckets are then sorted individually and then concatenated together. However, if the array is non-uniformly distributed, the performance of these sorting algorithms can be significantly throttled. Sample sort addresses this issue by selecting a sample of size s from the n-element sequence, and determining the range of the buckets by sorting the sample and choosing m -1 elements from the result. These elements (called splitters) then divide the sample into m equal-sized buckets.[2] Sample sort is described in the 1970 paper, "Samplesort: A Sampling Approach to Minimal Storage Tree Sorting", by W D Frazer and A C McKellar. In recent years, the algorithm has been adapted to implement randomized sorting on parallel computers.

## Algorithm

Sample sort can be broken down into 3 parts

1. Find splitters, values that break up the data into buckets, by sampling the data.
2. Use the sorted splitters to define buckets and place data in appropriate buckets.
3. Sort each of the buckets

### Complexity

Find the splitters.

```  $O(n/P + log(P))$
```

Send to buckets.

```  $P$ for reading all node
$log(P)$ for broadcasting
$n/P * log(P)$ for binary search for all keys
$n/P$ to send keys to bucket
```

Sort Buckets

```  $O(n/P)$
```

## Sampling the Data

The data may be sampled through different methods. Some methods include:

1. Pick evenly spaced samples.
2. Pick randomly selected samples.

### Over Sampling

The over sampling ratio determines how many data elements to pull as samples. The goal is to get a good representation of the distribution of the data. If the data values are widely distributed, in that there are not many duplicate values, then a small sampling ratio is needed. In other cases where there are many duplicates in the distribution, a larger over sampling ratio will be necessary.

## Selecting the Splitters

The ideal is to pick splitters that separate the data into j buckets of size n/j, where n is the number of elements to be sorted. This is to achieve an even distribution among the buckets, this way no one bucket takes longer than others to be sorted. This can be accomplished by selecting splitters in the sample by stepping through the sorted sample using a/j. Where sample size is a, and bucket size is j such that the values are a/j, 2a/j, ... (j - 1)a/j.

## Uses in Parallel Systems

Sample sort is often used in parallel systems. This is done by splitting the sorting for each processor, where the number buckets is equal to the number of processor. Sample sort is efficient in parallel systems because each processor receives approximately the same bucket size. Since the buckets are sorted concurrently, the processors will complete the sorting at approximately the same time, thus not having a processor wait for others.