engineering a cache-oblivious sorting algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · frigo,...
TRANSCRIPT
![Page 1: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/1.jpg)
Engineering a Cache-ObliviousSorting Algorithm
Gerth Stølting BrodalUniversity of Aarhus
Rolf FagerbergUniversity of Aarhus
Kristoffer VintherSystematic Software Engineering
ALENEX 2004, New Orleans, LA, January 10, 2004 1
![Page 2: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/2.jpg)
Motivation
• Memory hierarchy has become a fact of life
• Accessing non-local storage may take a very long time
• Good locality is important to achieving high performance
Latency Relativeto CPU
Register 0.5 ns 1
L1 cache 0.5 ns 1-2
L2 cache 3 ns 2-7
DRAM 150 ns 80-200
TLB 500+ ns 200-2000
Disk 10 ms 10
�
Engineering a Cache-Oblivious Sorting Algorithm 2
![Page 3: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/3.jpg)
Cache-Oblivious ModelFrigo, Leiserson, Prokop, Ramachandran, FOCS’99
• Program in the RAM model
• Analyze in the I/O model for
CPU
Memory
B
M
I/O
cache
arbitrary B and M
• Optimal off-line cachereplacement strategy
• Optimal on arbitrary level ⇒ optimal on all levels
• Portability
DiskCPU L1 L2 AR
M
Increasingaccess timeand space
Engineering a Cache-Oblivious Sorting Algorithm 3
![Page 4: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/4.jpg)
Cache-Oblivious ModelFrigo, Leiserson, Prokop, Ramachandran, FOCS’99
• Program in the RAM model
• Analyze in the I/O model for
CPU
Memory
B
M
I/O
cache
arbitrary B and M
• Optimal off-line cachereplacement strategy
• Optimal on arbitrary level ⇒ optimal on all levels
• Portability
DiskCPU L1 L2 AR
M
Increasingaccess timeand space
Engineering a Cache-Oblivious Sorting Algorithm 3
![Page 5: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/5.jpg)
Sorting — I/O Bounds
QuickSort ?Binary MergeSort ?
O
(
N
B· log2
N
M
)
Θ(
M
B
)
-way MergeSort ? O (SortM,B(N))Aggarwal and Vitter 1988
(Lazy) Funnelsort ? ? O (SortM,B(N))Frigo, Leiserson, Prokop and Ramachandran 1999
Brodal and Fagerberg 2002
? cache-aware ? cache-oblivious ? requires M ≥ B1+ε
SortM,B(N) =N
B· logM/B
N
BEngineering a Cache-Oblivious Sorting Algorithm 4
![Page 6: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/6.jpg)
Lazy Funnelsort
Divide input in N1/3 segments of size N
2/3
Recursively Funnelsort each segmentMerge sorted segments by an lazy N
1/3-merger
k
N1/3
N2/9
N4/27
...
2
Brodal and Fagerberg 2002
Engineering a Cache-Oblivious Sorting Algorithm 5
![Page 7: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/7.jpg)
Lazy k-merger
B1
· · ·
· · ·
· · ·
M1 M√k
Mtop
B√k
Buffer size α ·√
kd
Brodal and Fagerberg 2002
Engineering a Cache-Oblivious Sorting Algorithm 6
![Page 8: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/8.jpg)
Engineering Lazy Funnelsort
• Recursive implementation beats iterative implementation
• 4- or 5-way merge beats 2-way merge
• Standard memory allocator beats hand-coded allocator
• Pointer based van Emde Boas layout beats implicit layouts
• Nodes and buffers stored separately beats one layout
• Straightforward beats hand-coded branch elimination in core loop
• d = 2.5 and α = 16
• Reuse merger data structures
• For k-mergers of height < 2 switch to Quicksort
Engineering a Cache-Oblivious Sorting Algorithm 7
![Page 9: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/9.jpg)
Evaluating Funnelsort
• 2-way and 4-way Funnelsort
• Quicksort– STL GCC
– STL Intel C++
– Sedgewick
– Bentley & MacIlroy with pivot tuning
• TPIE - optimized for external memory
• Rmerge - optimized for registers
• Mergesort by LaMarca and Ladner
Engineering a Cache-Oblivious Sorting Algorithm 8
![Page 10: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/10.jpg)
Hardware SpecificationsPentium 4 Pentium III MIPS 10000 AMD Athlon Itanium 2
Architecture type Modern CISC Classic CISC RISC Modern CISC EPIC
Operation system Linux v. 2.4.18 Linux v. 2.4.18 IRIX v. 6.5 Linux 2.4.18 Linux 2.4.18
Clock rate 2400MHz 800MHz 175MHz 1333 MHz 1137 MHz
Address space 32 bit 32 bit 64 bit 32 bit 64 bit
Pipeline stages 20 12 6 10 8
L1 data cache size 8 KB 16 KB 32 KB 128 KB 32 KB
L1 line size 128 B 32 B 32 B 64 B 64 B
L1 associativity 4-way 4-way 2-way 2-way 4-way
L2 cache size 512 KB 256 KB 1024 KB 256 KB 256 KB
L2 line size 128 B 32 B 32 B 64 B 128 B
L2 associativity 8-way 4-way 2-way 8-way 8-way
TLB entries 128 64 64 40 128
TLB associativity full 4-way 64-way 4-way full
TLB miss handling hardware hardware software hardware ?
RAM size 512 MB 256 MB 128 MB 512 MB 3072 MB
Engineering a Cache-Oblivious Sorting Algorithm 9
![Page 11: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/11.jpg)
Comparison of QuicksortImplementations
Engineering a Cache-Oblivious Sorting Algorithm 10
![Page 12: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/12.jpg)
6e-09
8e-09
1e-08
1.2e-08
1.4e-08
1.6e-08
1.8e-08
2e-08
2.2e-08
12 14 16 18 20 22 24 26
Wal
ltim
e/n*
log
n
log n
Uniform pairs - Pentium 4
GCCDinkMix
Sedge
Engineering a Cache-Oblivious Sorting Algorithm 11
![Page 13: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/13.jpg)
1.5e-08
2e-08
2.5e-08
3e-08
3.5e-08
4e-08
4.5e-08
5e-08
5.5e-08
6e-08
12 14 16 18 20 22 24
Wal
ltim
e/n*
log
n
log n
Uniform pairs - Pentium III
GCCDinkMix
Sedge
Engineering a Cache-Oblivious Sorting Algorithm 12
![Page 14: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/14.jpg)
1e-08
1.5e-08
2e-08
2.5e-08
3e-08
3.5e-08
4e-08
12 14 16 18 20 22 24 26
Wal
ltim
e/n*
log
n
log n
Uniform pairs - AMD Athlon
GCCDinkMix
Sedge
Engineering a Cache-Oblivious Sorting Algorithm 13
![Page 15: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/15.jpg)
5e-09
1e-08
1.5e-08
2e-08
2.5e-08
3e-08
12 14 16 18 20 22 24 26 28
Wal
ltim
e/n*
log
n
log n
Uniform pairs - Itanium 2
GCCDinkMix
Sedge
Engineering a Cache-Oblivious Sorting Algorithm 14
![Page 16: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/16.jpg)
5e-08
1e-07
1.5e-07
2e-07
2.5e-07
3e-07
3.5e-07
13 14 15 16 17 18 19 20 21
Wal
ltim
e/n*
log
n
log n
Uniform pairs - MIPS 10000
GCCDinkMix
Sedge
Engineering a Cache-Oblivious Sorting Algorithm 15
![Page 17: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/17.jpg)
Results for Inputs in RAM
Engineering a Cache-Oblivious Sorting Algorithm 16
![Page 18: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/18.jpg)
6e-09
8e-09
1e-08
1.2e-08
1.4e-08
1.6e-08
1.8e-08
12 14 16 18 20 22 24 26
Wal
ltim
e/n*
log
n
log n
Uniform pairs - Pentium 4
Funnelsort2Funnelsort4
Mixmsort-c
msort-mRmerge
GCCTPIE
Engineering a Cache-Oblivious Sorting Algorithm 17
![Page 19: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/19.jpg)
2e-08
2.5e-08
3e-08
3.5e-08
4e-08
4.5e-08
5e-08
5.5e-08
6e-08
6.5e-08
12 14 16 18 20 22 24
Wal
ltim
e/n*
log
n
log n
Uniform pairs - Pentium III
Funnelsort2Funnelsort4
Mixmsort-c
msort-mRmerge
GCCTPIE
Engineering a Cache-Oblivious Sorting Algorithm 18
![Page 20: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/20.jpg)
1e-08
1.5e-08
2e-08
2.5e-08
3e-08
12 14 16 18 20 22 24
Wal
ltim
e/n*
log
n
log n
Uniform pairs - AMD Athlon
Funnelsort2Funnelsort4
Mixmsort-c
msort-mRmerge
GCCTPIE
Engineering a Cache-Oblivious Sorting Algorithm 19
![Page 21: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/21.jpg)
8e-09
1e-08
1.2e-08
1.4e-08
1.6e-08
1.8e-08
2e-08
2.2e-08
2.4e-08
2.6e-08
2.8e-08
12 14 16 18 20 22 24 26 28
Wal
ltim
e/n*
log
n
log n
Uniform pairs - Itanium 2
funnelsort2GCC
msort-cmsort-m
Engineering a Cache-Oblivious Sorting Algorithm 20
![Page 22: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/22.jpg)
6e-08
8e-08
1e-07
1.2e-07
1.4e-07
1.6e-07
1.8e-07
2e-07
2.2e-07
13 14 15 16 17 18 19 20 21 22
Wal
ltim
e/n*
log
n
log n
Uniform pairs - MIPS 10000
Funnelsort2Funnelsort4
Mixmsort-c
msort-mRmerge
GCC
Engineering a Cache-Oblivious Sorting Algorithm 21
![Page 23: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/23.jpg)
Results for Inputs on Disk
Engineering a Cache-Oblivious Sorting Algorithm 22
![Page 24: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/24.jpg)
0
5e-08
1e-07
1.5e-07
2e-07
2.5e-07
3e-07
3.5e-07
4e-07
21 22 23 24 25 26 27 28
Wal
ltim
e/n*
log
n
log n
Uniform pairs - Pentium 4
Funnelsort2msort-c
msort-mRmerge
GCCTPIE
Engineering a Cache-Oblivious Sorting Algorithm 23
![Page 25: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/25.jpg)
0
1e-07
2e-07
3e-07
4e-07
5e-07
6e-07
21 22 23 24 25 26 27 28
Wal
ltim
e/n*
log
n
log n
Uniform pairs - Pentium III
Funnelsort2msort-c
msort-mRmerge
GCCTPIE
Engineering a Cache-Oblivious Sorting Algorithm 24
![Page 26: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/26.jpg)
0
5e-07
1e-06
1.5e-06
2e-06
2.5e-06
3e-06
18 19 20 21 22 23 24 25 26 27 28
Wal
ltim
e/n*
log
n
log n
Uniform pairs - MIPS 10000
Funnelsort2msort-c
msort-mRmerge
GCC
Engineering a Cache-Oblivious Sorting Algorithm 25
![Page 27: Engineering a Cache-Oblivious Sorting Algorithmtildeweb.au.dk/au121/slides/alenex04.pdf · Frigo, Leiserson, Prokop, Ramachandran, FOCS’99 Program in the RAM model Analyze in the](https://reader033.vdocument.in/reader033/viewer/2022060520/604f29719ac7a0393c2d0c5c/html5/thumbnails/27.jpg)
Conclusion
• Very high performing generic sorting algorithm
• Performance remains robust across wide range of inputsizes
• Across several different processor and operating systemarchitectures
• On several different data types
• On several different input distributions
• Overhead involved in being cache-oblivious can be smallenough for the nice theoretical properties to actuallytransfer into practical advantages
Engineering a Cache-Oblivious Sorting Algorithm 26