User:AbuSH1993/sandbox

Sequential Quicksort[edit]

Algorithm[edit]

The nativ Quicksort algorithm is also called sequential quicksort. It uses the divide and conquer method to do the sorting. Sequential quicksort consists of three main phases. Given an array A[0, ..., N-1] with N elements, array A can be sorted by using sequential quicksort with the following three steps.

Select an value c, we call it pivot.
Then partition the array A into two subarrays. The left subarray $A_{L}$ is the union of all A[i] < c and the right subarray $A_{R}$ consists of all A[i] $\geq$ c.
Apply step 1 and 2 recursively to subarray $A_{L}$ and right subarray $A_{R}$ untill the length of the array is less than 2.

Pseudocode[edit]

algorithm Sequential_Quicksort(A, length)
    if length=1 then return
        select a pivot c       
        Sequential_Quicksort(A[0, ..., k-1], k)
        Sequential_Quicksort(A[k+1, ..., length-1], length-1-k)

For more details, you could visit Quicksort.

Parallel quicksort with distributed memory[edit]

Algorithm[edit]

This is an example of parallel quicksort with distributed memeory by Peter Sanders. There are three PEs. 4 is chosen as the pivot. So the number of the elements <4 is $n_{a}=4$ , which ara all sent to the left subarray, and the rest are all $\leq 4$ and sent to the right subarray. According to the partition method, $k=1$ . That means, we partition the three PEs into two parts: PE 1 as the first part; PE 2 and PE 3 as the second part. Recursively exectuting the algorithm in this two part. Finally, all elements are sorted in an increasing way.

Peter Sanders has in his work^[1] compared the performance of parallel quicksort with grid sort, shear sort, bitonic sort, radix sort and sample sort. We've seen that quicksort is generally the fastest sorting algorithm for most circumstances. So it's worthy to parallize the quicksort algorithm with efficient method. The sequential quicksort algorithm is simple to implement and efficient as a sequential sorting algorithm. However, quicksort algorithm could be parallized due to its recursive partition and the independence of the two partitioned arrays $A_{L}$ and $A_{R}$ . In this section, all computations are assumed to be executed on a distributed memory MIMD-machine with $P$ processors. Given $n$ elements as the input array with $n=NP$ ( $N$ is the maximal number of elements in a processor). These $P$ processors are numbered as $PE_{0},PE_{1},...,PE_{P-1}$ . After the parallel quicksort, all elements on $PE_{i}$ will be locally sorted, which means: $PE_{i}[j]<PE_{i}[j+1]$ for all $0\leq j<N-1$ . All the $n$ elements will be globally sorted as all elements on $PE_{i}$ will be smaller than or equal to any element on $PE_{k}$ when $i\leq j$ . The following pseudocode describes how parallel quicksort works with distributed memory.

algorithm ParQSort_Dist(s: input array of n elements, i: start index, j: end index)
    length:=j-i+1
    if length=1 then // sort locally and return
         QuickSort(s)   
         return s
    c:=pickPivot(s, i, j) // select a pivot 
     $S_{l}$ := $\{e\in s:e\leq c\}$  // all elements less or equal than c is assigned to an array
     $S_{r}$ := $\{e\in s:e>c\}$  // all elements larger than c is assigned to another array
     $N_{l}:=|S_{l}|$  // the number of the elements in the left array
     $N_{r}:=|S_{r}|$  // the number of the elements in the right array
    // partition the processors
     $K':={\frac {N_{l}}{N_{l}+N_{r}}}$ 
     $K:=\lfloor K'\rfloor or\lceil K'\rceil$  so that  $min\{max\{\lceil {\frac {N_{l}}{K}}\rceil ,\lceil {\frac {N_{r}}{length-K}}\rceil \}\}$  
    // redistribute the elements
    send the elements of  $S_{l}$  to  $PE_{i,i+1,...,i+K-1}$  (each processor is assigned with less than  $\lceil {\frac {N_{l}}{K}}\rceil$  elements)
    send the elements of  $S_{r}$  to  $PE_{i+K,i+K+1,...,j}$  (each processor is assigned with less than  $\lceil {\frac {N_{r}}{length-K}}\rceil$  elements)
    if  $i_{PE}<i+K$  then
         ParQSort_Dist(s, i, i+K-1)
    else
         ParQSort_Dist(s, i+K, j)

A roundoff error may occur since some processors could get more data than others, which causes imbalance. Stopping the recursion wisely and then use another sorting algorithm, i.e. merge splitting sort, instead of quicksort for the recursion for processor of size between 2 and 4 will make the algorithm more efficient and less imbalance.

Optimization[edit]

Local sorting[edit]

For large $n$ , imbalance grows with the depth of the recursion and increases the computation time for local sorting. Initial local sorting before any communication(broadcast, reduction, prefix sum) can solve this issue. This measure also causes extra computation for merging data during data redistribution in order to keep the local sequence sorted.

If $P$ is power of 2, explicit indexing technique could help improve the efficiency, i.e. row- and column-major order, snake-like order, Hilbert curve, shuffle.

Pivot selection[edit]

Pivot selection makes a great impact on quicksort algorithm. Choosing appropriate pivot could save the times of comparing the elements with the pivot.

The simplest pivot choosing method is the local median of $PE_{i}[0]$ , $PE_{i}[N/2]$ and $PE_{i}[N-1]$ ( $PE_{i}$ is the ith processor). Another pivot selection method using median is the medianof the local median of all processors: $PE_{0}$ , $PE_{1}$ , ..., $PE_{P-1}$ . For small $n$ , the local median is appropriate. But for large $n$ , the global median is much more efficient.

In the basic one-pivot choosing case, the input array is partitioned into two subarrays. However, with $k$ pivots, the input array is splitted into $k+1$ parts. if $k=P$ , then the partition method works the same way as Samplesort. Generally, $k$ is much smaller than $P$ . When $k$ increases, the corresponding recursion depth reduces. This trick seems to save cost on recursion, but it also causes more computation on data transfer. With larger $k$ , the data have to travel in longer path in smaller packets. According the work of Peter Sanders and Thomas Hansch^[1], when $k=4$ , the best performance can be achieved with the shuffle indexing which was refered before.

Parallel quicksort with shared memory[edit]

All parallel quicksort operations in this part are based on the assumption of distributed memory machine with cache coherence technique.

3+1 phase algorithm[edit]

Philippas Tsigas and Yi Zhang describes the 3+1 phase algorithm, which is a kind of parallel quicksort algorithm in the paper "Fast Parallel Implementation of Quicksort And Its Performance Evaluation on SUN Enterprise 10000"^[2]. This algorithm consists of four phases: parallel partition of the data, sequential partition of the data, process partition and sequential sorting with helping.

Phase 1: Parallel partition of the data[edit]

This is the first phase of 3+1 phase algorithm. The algorithm deals with data in consecutive blocks of size $B$ . $B$ is decided by the size of cache, since two blocks should be fit into one cache. Given $n$ elements as the input of the algorithm and $P$ processors indexed from $P_{0}$ to $P_{P-1}$ . Then the number of blocks should be ${\frac {n}{B}}$ . After this phase, all $neutrilized$ leftblocks placed between $[0,LN-1)$ and all $neutrilized$ rightblocks placed between $[RN,N-1)$ are correctly placed. $LN$ and $RN$ are the right end side of leftblocks and left start side of rightblocks respectively. Howevever, $unneutrilized$ could be at anywhere in the array and the $neutrilized$ blocks between $[LN-1,RN-1]$ need further operations. This phase can be summarized as the following steps.

Pick a $pivot$ at processor $P_{0}$ .
For each processor, assign two blocks $leftblock$ and $rightblock$ .
Apply the function $neutrilize$ with $leftblock$ , $rightblock$ and $pivot$ . Return which side or both sides are $neutrilized$ (If the elements in $leftblock$ are all samller than or equal to the $pivot$ , then the $leftblock$ is $neutrilized$ . If the elements in $rightblock$ are all larger than or equal to the $pivot$ , we call the $rightblock$ $neutrilized$ .).
For each processor, if the $leftblock$ is $neutrilized$ , it will get a new block from the left side of the array; if the $rightblock$ is $neutrilized$ , it will get a new block from the right side of the array.
Keep executing step 2~4 until the $leftblock$ or $rightblock$ is empty.
Update the remaining $unneutrilized$ blocks in the shared memory and the number of $neutrilized$ leftblocks and rightblocks.

SIDE neutrilize(Data *leftblock, Data *rightblock, Data pivot)
    int i, j;
    do {
         for (i=0; i<BlockSize; ++i)
              if (leftblock[i] > pivot)
                   break;
         for (j=0; j<BlockSize; ++j)
              if (rightblock[j] < pivot)
                   break;
         if (i == BlockSize || j == BlockSize)          
              break;
         SWAP(leftblock[i], rightblock[j])
         i++; j++;
    }while (i < BlockSize && j < BlockSize) 
     
    if (i == BlockSize && j == BlockSize) 
         return BOTH
    if (i == BlockSize) 
         return LEFT
    else
         return RIGHT

Phase 2: Sequential partition of the data[edit]

In phase two, the goal is to place the remaining data from phase one correctly in the array of blocks by comparing with the $pivot$ selected in phase one. The steps of this phase are:

Sort the remaining blocks.
Pick two blocks from the left and right end on processor $P_{0}$ . Apply neutrilize function with these two blocks and the $pivot$ .
Swap the remaining blocks so that all $unneutrilized$ blocks lie between $[LN,RN]$
Use sequential quicksort to partition the blocks between $[LN,RN]$ .

After phase 2, the input array is partitioned into two subarrays. The left subarray contains all elements smaller than or equal to the $pivot$ and all elements larger than the $pivot$ are in the right subarray.

Phase 3: Process partition[edit]

Give each processor a subarray from phase two and apply the same paralle partition method. Recursively execute phase 1~3 until only one processor is left.

Phase 4: Sequential sorting with helping[edit]

Each processor does sequential quicksort lokally. Lock-free stacks are used in this phase as a helping scheme. That means, after doing its own work, a processor can help other processors' work since they all have an entry to the shared stacks.

Time complexity analysis[edit]

We've known that sequential quicksort has the worst case of time complexity $O(n^{2})$ and both the best case and average case of time complexity $O(nlog(n))$ . Since 3+1 phase algorithm also has a divide-conquer structure, it takes same number of comparison and swapping operations as the sequential one, but the time for the sorting phase is greatly optimized. Phase 1 takes $O({\frac {n}{P}})$ time. Phase 2 and 3 takes $O(B*(P+1))$ time. So the whole time complexity of the partition is $O({\frac {n}{P}}\ +\ B*(P+1))$ .

Parallel partition using fetch-and-add[edit]

The work by Clyde P. Kruskal^[3] has shown that the parallel partition operation can be done using fetch-and-add. We introduce a parallel partition technique in this part using fetch-and-add from the work of P. Heidelberger^[4]. First, to understand parallel partition, we need to first introduce a partition function G. Partition function G allocates the elements from the source array A into the corresponding target array. For example, $G(a[i])=j$ means the element $a[i]$ will be assigned to the target array j. In the case of quicksort, there are only two target arrays: left subarray $A_{L}$ and right subarray $A_{R}$ . Then a partition function executed by fetch-and-add could be considered as:

G(a[i])=1,\ \ if\ \ a[i]<c

G(a[i])=2,\ \ if\ \ a[i]\geq c

$G(a[i])=1$ means that $a[i]$ is set to left subarray $A_{L}$ . By $G(a[i])=2$ , $a[i]$ is partitioned to right subarray $A_{R}$ , which corresponds to the basic idea of quicksort.

Dual copy partition algorithm[edit]

This is the simplest parallel partition algorithm using fetch-and-add. It's called "dual copy" because it needs two copies for pivot c and a local variable t respectively. The algorithm is described in the following pseudocode. do_parallel means that the operations are done simultaneously. The array b is a copy of the input elements.

algorithm Dual_copy(s, n)
    do_parallel i = 1 to n
         t := s[i]   // copy s[i] into local variable
         if (t < c) then
              b[fetch_and_add(L, 1)] := t   // put t into the left subarray
         else    
              b[fetch_and_add(R, -1)] := t    // put t into the right subarray
    end do_parallel

In-place partition algorithm[edit]

In-place algorithm partition is an extension of dual copy partition algorithm. It uses swapping technique to do the parallel partition. Two indices $L$ and $R$ point to the left and right end of the input array. So if the input array s has $n$ elements, $L$ is initialized as 0; $R$ is initialized as n-1. By comparing the elements s[ $L$ ] and s[ $R$ ] with pivot c, s[ $L$ ] and s[ $R$ ] will be swapped if s[ $L$ ] $\leq$ c and meanwhile s[ $R$ ] > c. At the end of this parallel swapping, $L$ should be equal to $R-1$ . However, a clean-up process needs to be executed, since s[ $L$ ] may not be able to find a s[ $R$ ] to swap. The number of fetch-and-adds in clean-up process is up to 7 $P$ ( $P$ is the number of the processors).

Parallel partition algorithm with a reduced number of fetch-and-adds[edit]

In the dual copy partition algorithm and in-place partition algorithm, we use fetch_and_add(L, 1) and fetch_and_add(R, 1) to do partitioning and swapping. However, this would take too many steps for fetch_and_add operation. To renduce the number of fetch_and_adds, fetch_and_add(L, K) and fetch_and_add(L, -K) are introduced in this algorithm instead of fetch_and_add(L, 1) and fetch_and_add(R, 1). Similar to the in-place partition algorithm, this algorithm also needs clean-up operations, but with much fewer fetch_and_adds.

References[edit]

^[1] ^[2] ^[3] ^[4]

^ ^a ^b ^c Sanders, Peter (1997). "Efficient massively parallel quicksort". International Symposium on Solving Irregularly Structured Problems in Parallel: 13–24. doi:10.1007/3-540-63138-0_2.
^ ^a ^b ^c ^d Tsigas, Philippas (2003). "A Simple, Fast Parallel Implementation of Quicksort And Its Performance Evaluation on SUN Enterprise 10000". Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings.: 372–381. doi:10.1109/EMPDP.2003.1183613.
^ ^a ^b Kruskal, C. P. (1982). "Algorithms for replace-add based paracomputers". IEEE: 219–223.
^ ^a ^b Heidelberger, P. (1990). "Parallel Quicksort using fetch-and-add". IEEE Computer Soc.Press: 133–138. doi:10.1109/12.46289.

[thomas-1] Sanders, Peter (1997). "Efficient massively parallel quicksort". International Symposium on Solving Irregularly Structured Problems in Parallel: 13–24. doi:10.1007/3-540-63138-0_2.

[philippas-2] Tsigas, Philippas (2003). "A Simple, Fast Parallel Implementation of Quicksort And Its Performance Evaluation on SUN Enterprise 10000". Eleventh Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2003. Proceedings.: 372–381. doi:10.1109/EMPDP.2003.1183613.

[kruskal-3] Kruskal, C. P. (1982). "Algorithms for replace-add based paracomputers". IEEE: 219–223.

[heidel-4] Heidelberger, P. (1990). "Parallel Quicksort using fetch-and-add". IEEE Computer Soc.Press: 133–138. doi:10.1109/12.46289.

[1]

[2]

[3]

[4]