The Wayback Machine - https://web.archive.org/web/20071102070431/http://jea.acm.org:80/ARTICLES/Vol5Nbr3/node4.html

4 Cache-Effective Quicksort

We first briefly assess the merits and limits of the two existing quicksort algorithms, especially considering their cache locality. We present two new quicksort alternatives for improving memory performance further. Experimental results will be reported in the next section.

4.1 Memory-tuned quicksort and multiquicksort

LaMarca and Ladner in the same paper [4] present two quicksort algorithms for cache optimization. The first one is called memory-tuned quicksort, which is a modification of the base quicksort [9]. Instead of saving small subarrays to sort in the end, the memory-tuned quicksort sorts these subarrays when they are first encountered in order to reuse the data elements in the cache.

The second algorithm is called multiquicksort. This algorithm applies a single pass to divide the full data set into multiple subarrays, with the hope that each subarray will be smaller than the cache capacity.

The performance gain of these two algorithms from experiments reported in [4] is modest. We implemented the two algorithms on simulated machines and on various high-end workstations and obtained consistent performance. We also found that the performance of quicksort and its cache-optimized alternatives are very sensitive to the types of the data set being used. These algorithms were not efficient on unbalanced data sets.

4.2 New quicksort alternatives

In practice, the quicksort algorithms exploit cache locality well on balanced data. A challenge is to make the quicksort perform well on unbalanced data sets. We present two cache-optimized quicksort alternatives that work well on both balanced and unbalanced data sets.

4.2.1 Flash Quicksort

Flashsort [6] is extremely fast for sorting balanced data sets. The maximum and minimum values are first identified in the data set to identify the data range. The data range is then evenly divided into classes to form subarrays. The algorithm consists of three steps: ``classification" to determine the size of each class, ``permutation" to move each element into its class by using a single temporary variable to hold the replaced element, and ``straight insertion" to sort elements in each class by using Sedgewick's insertion sort [9]. This algorithm works very well on balanced data sets because the sizes of the subarrays after the first two steps are similar and are small enough to fit in the cache. This makes flashsort highly effective (O(N)). However, when the data set is not balanced, the sizes of the generated subarrays are disproportionate, causing ineffective usage of the cache, and making flashsort as slow as insertion sort (O(N²)) in the worst case.

In comparison with the pivoting process of quicksort, the classification step of flashsort is more likely to generate balanced subarrays, which favors better cache utilization. On the other hand, quicksort outperforms insertion sort on unbalanced subarrays. By combining the advantages of flashsort and quicksort, we present a new quicksort alternative, flash quicksort, where the first two steps are the same as in flashsort and the last step uses quicksort to sort the elements in each class.

4.2.2 Inplaced flash quicksort

To further improve overall performance, we employ another cache optimization to improve temporal locality in flash quicksort. We call this alternative inplaced flash quicksort. In this algorithm, the first and third steps are the same as in flash quicksort. In the second step, an additional array is used as a buffer to hold the permuted elements. In the original flashsort, a single temporary variable is used to hold the replaced element. A cache line normally holds more than one element. The data structure of the single variable minimizes the chance of data reusage. Using the additional array, we attempt to reuse elements in a cache line before their replacement and to reduce the instruction count for copying data elements. Although this approach increases the required memory space, it improves both cache and overall performance.

4.3 Simulation results

Figure 6 shows the instruction counts (left figure) and the L1 misses (right figure) of memory-tuned quicksort, flashsort, flash quicksort, and inplaced flash quicksort, on the Unbalanced data set on the simulated Pentium III memory architecture, which has a higher memory latency and a larger L2 cache (512 KBytes) than the Pentium II.

**Figure 6:** Simulation comparisons of the instruction counts (left figure) and the L1 misses (right figure) of the quicksort algorithms on the *Unbalanced* data set on the simulated Pentium III. (The instruction count curve of the flashsort was too high to be presented in the left figure).
$\begin{figure}\centerline{\psfig{figure=quick-sim-ic.ps,width=3.0in,height=2.5in... ... \psfig{figure=quick-sim-l1.ps,width=3.0in,height=2.5in}}%% %% %% \end{figure}$

The instruction count curve of flashsort was too high to be presented in the left figure of Figure 6. The same figure shows that the instruction count of memory-tuned quicksort also increases rapidly as the data set size grows. In contrast, the instruction counts of flash quicksort and inplaced flash quicksort change little as the data set size increases. The simulation also shows that the number of L1 misses increases much more rapidly as the size of the data set grows in the memory-tuned quicksort and flashsort than in the flashsort and inplaced flashsort algorithms. The simulation results are consistent with our algorithm analysis, and show the effectiveness of our new quicksort alternatives on unbalanced data sets.

Oct	NOV	Dec
	02
2006	2007	2010