high performance comparison-based sorting algorithm on ...our sorting algorithm a new bitonic-based...
TRANSCRIPT
![Page 1: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/1.jpg)
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs
Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne
Key Laboratory of Computer System and ArchitectureICT, CAS, China
![Page 2: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/2.jpg)
OutlineGPU computation modelOur sorting algorithm
A new bitonic-based merge sort, named WarpsortExperiment resultsconclusion
![Page 3: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/3.jpg)
GPU computation modelMassively multi-threaded, data-parallel many-core architectureImportant features:
SIMT execution modelAvoid branch divergence
Warp-based schedulingimplicit hardware synchronization among threads within a warp
Access patternCoalesced vs. non-coalesced
![Page 4: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/4.jpg)
Why merge sort ?Similar case with external sorting
Limited shared memory on chip vs. limited main memory
Sequential memory accessEasy to meet coalesced requirement
![Page 5: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/5.jpg)
Why bitonic-based merge sort ?Massively fine-grained parallelism
Because of the relatively high complexity, bitonic network is not good at sorting large arraysOnly used to sort small subsequences in our implementation
Again, coalesced memory access requirement
![Page 6: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/6.jpg)
Problems in bitonic networknaïve implementation
Block-based bitonic networkOne element per thread
Some problemsin each stage
n elements produce only n /2 compare-and-swap operationsForm both ascending pairs and descending pairs
Between stagessynchronization
block
thread
Too many branch divergences and synchronization operations
![Page 7: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/7.jpg)
What we use ?Warp-based bitonic network
each bitonic network is assigned to an independent warp, instead of a block
Barrier-free, avoid synchronization between stagesthreads in a warp perform 32 distinct compare-and-swap operations with the same order
Avoid branch divergencesAt least 128 elements per warp
And further a complete comparison-based sorting algorithm: GPU-Warpsort
![Page 8: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/8.jpg)
Overview of GPU-WarpsortDivide input seq into small tiles, and each followed by a warp-
based bitonic sort
Merge, until the parallelism is insufficient.
Split into small subsequences
Merge, and form the output
![Page 9: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/9.jpg)
Step1: barrier-free bitonic sort divide the input array into equal-sized tilesEach tile is sorted by a warp-based bitonic network
128+ elements per tile to avoid branch divergenceNo need for __syncthreads() Ascending pairs + descending pairsUse max () and min () to replace if-swap pairs
![Page 10: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/10.jpg)
Step 2: bitonic-based merge sort t -element merge sort
Allocate a t -element buffer in shared memoryLoad the t /2 smallest elements from seq A and B , respectivelyMergeOutput the lower t /2 elementsLoad the next t /2 smallest elements from A or B
t = 8 in this example
![Page 11: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/11.jpg)
Step 3: split into small tiles Problem of merge sort
the number of pairs decreases geometricallyCan not fit this massively parallel platform
MethodDivide the large seqs into independent small tiles which satisfy:
![Page 12: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/12.jpg)
Step 3: split into small tiles (cont.)How to get the splitters?
Sample the input sequence randomly
![Page 13: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/13.jpg)
Step 4: final merge sortSubsequences (0,i ), (1,i ),…, (l -1,i ) are merged into Si Then,S0, S1,…, Sl are assembled into a totally sorted array
![Page 14: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/14.jpg)
Experimental setup Host
AMD Opteron880 @ 2.4 GHz, 2GB RAMGPU
9800GTX+, 512 MB Input sequence
Key-only and key-value configurations32-bit keys and values
Sequence size: from 1M to 16M elementsDistributions
Zero, Sorted, Uniform, Bucket, and Gaussian
![Page 15: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/15.jpg)
Performance comparisonMergesort
Fastest comparison-based sorting algorithm on GPU (Satish, IPDPS’09)Implementations already compared by Satish are not included
QuicksortCederman, ESA’08
RadixsortFastest sorting algorithm on GPU (Satish, IPDPS’09)
WarpsortOur implementation
![Page 16: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/16.jpg)
Performance resultsKey-only
70% higher performance than quicksortKey-value
20%+ higher performance than mergesort30%+ for large sequences (>4M)
![Page 17: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/17.jpg)
Results under different distributionsUniform, Bucket, and Gaussian distribution almost get the same performanceZero distribution is the fastestNot excel on Sorted distribution
Load imbalance
![Page 18: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/18.jpg)
ConclusionWe present an efficient comparison-based sorting algorithm for many-core GPUs
carefully map the tasks to GPU architectureUse warp-based bitonic network to eliminate barriers
provide sufficient homogeneous parallel operations for each threadavoid thread idling or thread divergence
totally coalesced global memory accesses when fetching and storing the sequence elements
The results demonstrate up to 30% higher performance Compared with previous optimized comparison-based algorithms
![Page 19: High Performance Comparison-Based Sorting Algorithm on ...Our sorting algorithm A new bitonic-based merge sort, named Warpsort Experiment results conclusion. ... Sorted, Uniform, Bucket,](https://reader030.vdocument.in/reader030/viewer/2022041120/5f346511a69a667e7c66adeb/html5/thumbnails/19.jpg)
Thanks