library generators and program optimization david padua university of illinois at urbana-champaign
TRANSCRIPT
![Page 1: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/1.jpg)
Library Generators and Program Optimization
David Padua
University of Illinois at Urbana-Champaign
![Page 2: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/2.jpg)
Libraries and Productivity
• Libraries help productivity.• But not always.
– Not all algorithms implemented.– Not all data structures.
• In any case, much effort goes into highly-tuned libraries.
• Automatic generation of libraries libraries would– Reduce cost of developing libraries– For a fixed cost, enable a wider range of
implementations and thus make libraries more usable.
![Page 3: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/3.jpg)
An Illustration based on MATLAB of the effect of libraries on
performance
![Page 4: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/4.jpg)
Compilers vs. Libraries in Sorting
~2X
~2X
![Page 5: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/5.jpg)
Compilers versus libraries in DFT
![Page 6: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/6.jpg)
Compilers vs. Libraries inMatrix-Matrix Multiplication (MMM)
![Page 7: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/7.jpg)
Library Generators
• Automatically generate highly efficient libraries for a class platforms.
• No need to manually tune the library to the architectural characteristics of a new machine.
![Page 8: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/8.jpg)
Library Generators (Cont.)
• Examples: – In linear algebra: ATLAS, PhiPAC– In signal processing: FFTW, SPIRAL
• Library generators usually handle a fixed set of algorithms.
• Exception: SPIRAL accepts formulas and rewriting rules as input.
![Page 9: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/9.jpg)
Library Generators (Cont.)
• At installation time, LGs apply empirical optimization.– That is, search for the best version in a set of
different implementations – Number of versions astronomical. Heuristics
are needed.
![Page 10: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/10.jpg)
Library Generators (Cont.)
• LGs must output C code for portability.
• Unenven quality of compilers =>– Need for source-to-source optimizers– Or incorporate in search space variations
introduced by optimizing compilers.
![Page 11: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/11.jpg)
Library Generators (Cont.)
Generator
C function
Source-to-source optimizer
C function
Native compiler
Algorithm description
Object code Execution
performance
Final C function
![Page 12: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/12.jpg)
Important research issues
• Reduction of the search space with minimal impact on performance
• Adaptation to the input data (not needed for dense linear algebra)
• More flexible of generators– algorithms– data structures– classes of target machines
• Tools to build library generators.
![Page 13: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/13.jpg)
Library generators and compilers
• LGs are a good yardstick for compilers
• Library generators use compilers.
• Compilers could use library generator techniques to optimize libraries in context.
• Search strategies could help design better compilers - – Optimization strategy: Most important open
problem in compilers.
![Page 14: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/14.jpg)
Organization of a library generation systemHigh Level Specification(Domain SpecificLanguage (DSL))
SIGNAL PROCESSINGFORMULA
LINEAR ALGEBRA ALGORITHMIN FUNCTIONAL LANGUAGENOTATION
PARAMETERIZATION FOR
SIGNAL PROCESSING
PARAMETERIZATION
PROGRAMGENERATOR FOR
SORTING
PARAMETERIZATION FOR
LINEAR ALGEBRA
X code withsearch directives
Backend compiler
ExecutableRun
Selection Strategy
Reflective optimization
![Page 15: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/15.jpg)
Three library generation projects
1. Spiral and the impact of compilers
2. ATLAS and analytical model
3. Sorting and adapting to the input
![Page 16: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/16.jpg)
![Page 17: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/17.jpg)
Spiral: A code generator for digital signal processing
transformsJoint work with:
Jose Moura (CMU),
Markus Pueschel (CMU),
Manuela Veloso (CMU),
Jeremy Johnson (Drexel)
![Page 18: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/18.jpg)
SPIRAL
• The approach:– Mathematical formulation of signal processing
algorithms– Automatically generate algorithm versions– A generalization of the well-known FFTW– Use compiler technique to translate formulas
into implementations– Adapt to the target platform by searching for
the optimal version
![Page 19: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/19.jpg)
![Page 20: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/20.jpg)
Fast DSP Algorithms As Matrix Factorizations
• Computing y = F4 x is carried out as:
t1 = A4 x ( permutation )
t2 = A3 t1 ( two F2’s )
t3 = A2 t2 ( diagonal scaling )
y = A1 t3 ( two F2’s )• The cost is reduced because A1, A2,
A3 and A4 are structured sparse matrices.
![Page 21: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/21.jpg)
General Tensor Product Formulation
Theorem
Example
rsrsr
rsssrrs LFITIFF )()(
is a diagonal matrixis a stride permutation
rssT
rsrL
1000
0010
0100
0001
1100
1100
0011
0011
1000
0100
0010
0001
1010
0101
1010
0101
)()( 4222
44224 LFITIFF
![Page 22: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/22.jpg)
Factorization Trees
F2
F2 F2
F8 : R1
F4 : R1F2
F2 F2
F8 : R1
F4 : R1
F2 F2 F2
F8 : R2
Different computation orderDifferent data access
patternDifferent performance
![Page 23: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/23.jpg)
The SPIRAL System
Formula Generator
SPL Compiler
Performance Evaluation
Search Engine
DSP Transform
Target machine DSP Library
SPL Program
C/FORTRAN Programs
![Page 24: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/24.jpg)
SPL Compiler
Parsing
Intermediate Code Generation
Intermediate Code Restructuring
Target Code Generation
Abstract Syntax Tree
I-Code
I-Code
FORTRAN, C
Template Table
SPL Formula Template Definition
OptimizationI-Code
![Page 25: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/25.jpg)
Optimizations
SPL Compiler
C/Fortran Compiler
Formula Generator* High-level scheduling* Loop transformation
* High-level optimizations- Constant folding- Copy propagation- CSE- Dead code elimination
* Low-level optimizations- Instruction scheduling- Register allocation
![Page 26: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/26.jpg)
Basic Optimizations (FFT, N=25, SPARC, f77 –fast –O5)
![Page 27: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/27.jpg)
Basic Optimizations(FFT, N=25, MIPS, f77 –O3)
![Page 28: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/28.jpg)
Basic Optimizations(FFT, N=25, PII, g77 –O6 –malign-double)
![Page 29: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/29.jpg)
Overall performance
![Page 30: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/30.jpg)
An analytical model for ATLAS
Joint work with Keshav Pingali (Cornell)
Gerald DeJong Maria Garzaran
![Page 31: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/31.jpg)
ATLAS• ATLAS = Automated Tuned Linear Algebra Software,
developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee.
• ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). – Use search to adapt to the target machine
![Page 32: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/32.jpg)
ATLAS Infrastructure
DetectHardware
Parameters
ATLAS SearchEngine
(MMSearch)NR
MulAddLatency
L1SizeATLAS MM
Code Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
Compile,Execute,Measure
MFLOPS
DetectHardware
Parameters
ATLAS MMCode Generator
(MMCase)
ATLAS SearchEngine
(MMSearch)
![Page 33: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/33.jpg)
Detecting Machine Parameters
• Micro-benchmarks– L1Size: L1 Data Cache size
• Similar to Hennessy-Patterson book
– NR: Number of registers• Use several FP temporaries repeatedly
– MulAdd: Fused Multiply Add (FMA)• “c+=a*b” as opposed to “c+=t; t=a*b”
– Latency: Latency of FP Multiplication• Needed for scheduling multiplies and adds in the
absence of FMA
![Page 34: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/34.jpg)
Compiler View• ATLAS Code Generation
• Focus on MMM (as part of BLAS-3)– Very good reuse O(N2) data, O(N3) computation– No “real” dependecies (only input / reuse ones)
DetectHardware
Parameters
ATLAS SearchEngine
(MMSearch)NR
MulAddLatency
L1SizeATLAS MM
Code Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
Compile,Execute,Measure
MFLOPS
ATLAS MMCode Generator
(MMCase)
![Page 35: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/35.jpg)
Adaptations/Optimizations
• Cache-level blocking (tiling)– Atlas blocks only for L1 cache
• Register-level blocking– Highest level of memory hierarchy– Important to hold array values in registers
• Software pipelining– Unroll and schedule operations
• Versioning– Dynamically decide which way to compute
![Page 36: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/36.jpg)
Cache-level blocking (tiling)• Tiling in ATLAS
– Only square tiles (NBxNBxNB)– Working set of tile fits in L1– Tiles are usually copied to continuous
storage– Special “clean-up” code generated for
bounderies
• Mini-MMM
for (int j = 0; j < NB; j++) for (int i = 0; i < NB; i++) for (int k = 0; k < NB; k++) C[i][j] += A[i][k] * B[k][j]
• NB: Optimization parameter
B
N
M
A C
NB
NB
K
K
![Page 37: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/37.jpg)
Register-level blocking• Micro-MMM
– MUx1 elements of A– 1xNU elements of B– MUxNU sub-matrix of C– MU*NU + MU + NU ≤ NR
• Mini-MMM revisedfor (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into
registers for (int k = 0; k < NB; k++) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1]
• Unroll K look KU times• MU, NU, KU: optimization
parameters
B
NB
NB
A C
K
MU
NU
K
![Page 38: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/38.jpg)
Scheduling• FMA Present?• Schedule Computation
– Using Latency• Schedule Memory Operations
– Using FFetch, IFetch, NFetch• Mini-MMM revised
for (int j = 0; j < NB; j += NU) for (int i = 0; i < NB; i += MU) load C[i..i+MU-1, j..j+NU-1] into registers for (int k = 0; k < NB; k += KU) load A[i..i+MU-1,k] into registers load B[k,j..j+NU-1] into registers multiply A’s and B’s and add to C’s
... load A[i..i+MU-1,k+KU-1] into registers load B[k+KU-1,j..j+NU-1] into registers multiply A’s and B’s and add to C’s store C[i..i+MU-1, j..j+NU-1]
• Latency, xFetch: optimization parameters
KU times
M1
M2
M3
M4
MMU*NU
…
A1
A2
A3
A4
AMU*NU
…
L1
L2
L3
LMU+NU
…
La
ten
cy=2
A1
A2
AMU*NU
…
Computation
MemoryOperationsComputation
MemoryOperations
Computation
MemoryOperations
Computation
MemoryOperations
Computation
MemoryOperations
IFetch Loads
NFetch Loads
NFetch Loads
NFetch Loads
…
![Page 39: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/39.jpg)
Searching for Optimization Parameters
• ATLAS Search Engine
• Multi-dimensional search problem– Optimization parameters are independent variables– MFLOPS is the dependent variable– Function is implicit but can be repeatedly evaluated
DetectHardware
Parameters
ATLAS SearchEngine
(MMSearch)NR
MulAddLatency
L1SizeATLAS MM
Code Generator(MMCase)
xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
Compile,Execute,Measure
MFLOPS
ATLAS SearchEngine
(MMSearch)
![Page 40: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/40.jpg)
Search Strategy
• Orthogonal Range Search– Optimize along one dimension at a time, using
reference values for not-yet-optimized parameters– Not guaranteed to find optimal point– Input
• Order in which dimensions are optimized– NB, MU & NU, KU, xFetch, Latency
• Interval in which search is done in each dimension
– For NB it is , step 4
• Reference values for not-yet-optimized dimensions– Reference values for KU during NB search are 1 and NB
)80,1min(16 SizeLNB
![Page 41: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/41.jpg)
Modeling for Optimization Parameters
• Our Modeling Engine
• Optimization parameters– NB: Hierarchy of Models (later)– MU, NU:– KU: maximize subject to L1 Instruction Cache– Latency, MulAdd: from hardware parameters– xFetch: set to 2
DetectHardware
Parameters
ATLAS SearchEngine
(MMSearch)NR
MulAddLatency
L1I$Size ATLAS MMCode Generator
(MMCase)xFetchMulAddLatency
NBMU,NU,KU MiniMMM
Source
L1Size
Model
NRLatencyNUMUNUMU *
![Page 42: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/42.jpg)
Modeling for Tile Size (NB)• Models of increasing complexity
– 3*NB2 ≤ C• Whole work-set fits in L1
– NB2 + NB + 1 ≤ C• Fully Associative• Optimal Replacement• Line Size: 1 word
– or
• Line Size > 1 word
– or
• LRU Replacement
B
N
M
A C
NB
NB
K
K
B
C
B
NB
B
NB
1
2
B
CNB
B
NB
1
2
B
C
B
NB
B
NB
B
NB
12
2
B
CNB
B
NB
13
2 A
M(I)
K
C
B
N (J)
KB
A
M(I)
K
C
B
N (J)
KL
![Page 43: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/43.jpg)
Experiments
• Architectures:– SGI R12000, 270MHz– Sun UltraSPARC III, 900MHz– Intel Pentium III, 550MHz
• Measure– Mini-MMM performance– Complete MMM performance– Sensitivity to variations on parameters
![Page 44: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/44.jpg)
MiniMMM Performance
• SGI– ATLAS: 457 MFLOPS– Model: 453 MFLOPS– Difference: 1%
• Sun– ATLAS: 1287 MFLOPS– Model: 1052 MFLOPS– Difference: 20%
• Intel– ATLAS: 394 MFLOPS– Model: 384 MFLOPS– Difference: 2%
![Page 45: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/45.jpg)
MMM Performance• SGI • Sun
• Intel
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000
0
100
200
300
400
500
600
0 1000 2000 3000 4000 5000
0200400600800
10001200140016001800
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
BLAS COMPILER
ATLAS MODEL
![Page 46: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/46.jpg)
Sensitivity to NB and Latency on Sun
• Tile Size (NB)
• MU & NU, KU, Latency, xFetch for all architectures
• Latency
B
M
A
0
200
400
600
800
1000
1200
1400
1600
20 40 60 80 100 120 140
Tile Size (B: Best, A: ATLAS, M: Model)
MF
LO
PS
B
M
A
0
200
400
600
800
1000
1200
1400
1600
1 3 5 7 9 11 13
Latency (B: Best, A: ATLAS, M: Model)
MF
LO
PS
![Page 47: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/47.jpg)
Sensitivity to NB on SGI
B
MA
0
100
200
300
400
500
600
20 220 420 620 820
Tile Size (B: Best, A: ATLAS, M: Model)
MF
LO
PS
3*NB2 ≤ C
NB2 + NB + 1 ≤ C
![Page 48: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/48.jpg)
Sorting
Joint work with
Maria Garzaran
Xiaoming Li
![Page 49: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/49.jpg)
ESSL on Power3
![Page 50: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/50.jpg)
ESSL on Power4
![Page 51: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/51.jpg)
Motivation
• No universally best sorting algorithm
• Can we automatically GENERATE and tune sorting algorithms for each platform ?
• Performance of sorting depends not only on the platform but also on the input characteristics.
![Page 52: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/52.jpg)
A firs strategy: Algorithm Selection
• Select the best algorithm from Quicksort, Multiway Merge Sort and CC-radix.
• Relevant input characteristics: number of keys, entropy vector.
![Page 53: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/53.jpg)
Algorithm Selection
![Page 54: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/54.jpg)
A better Solution
• We can use different algorithms for different partitions
• Build Composite Sorting algorithms– Identify primitives from the sorting algorithms– Design a general method to select an
appropriate sorting primitive at runtime– Design a mechanism to combine the
primitives and the selection methods to generate the composite sorting algorithm
![Page 55: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/55.jpg)
Sorting Primitives
• Divide-by-Value– A step in Quicksort– Select one or multiple pivots and sort the input array
around these pivots– Parameter: number of pivots
• Divide-by-Position (DP)– Divide input into same-size sub-partitions– Use heap to merge the multiple sorted sub-partitions– Parameters: size of sub-partitions, fan-out and size of
the heap
![Page 56: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/56.jpg)
Sorting Primitives
• Divide-by-Radix (DR)– Non-comparison based sorting algorithm– Parameter: radix (r bits)
![Page 57: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/57.jpg)
Selection Primitives
• Branch-by-Size• Branch-by-Entropy
– Parameter: number of branches, threshold vector of the branches
![Page 58: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/58.jpg)
Leaf Primitives
• When the size of a partition is small, we stick to one algorithm to sort the partition fully.
• Two methods are used in the cleanup operation– Quicksort– CC-Radix
![Page 59: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/59.jpg)
Composite Sorting Algorithms
• Composite sorting algorithms are built with these primitives.
• Algorithms are represented as trees.
![Page 60: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/60.jpg)
Performance of Classifier Sorting
• Power3
![Page 61: Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022062322/5697c0071a28abf838cc5d87/html5/thumbnails/61.jpg)
Power4