using cache models and empirical search in automatic tuning of applications apan qasem ken kennedy...
TRANSCRIPT
Using Cache Models and Empirical Search in Automatic Tuning of
Applications
Using Cache Models and Empirical Search in Automatic Tuning of
ApplicationsApan Qasem Ken Kennedy John Mellor-Crummey
Rice UniversityHouston, TX
Apan Qasem Ken Kennedy John Mellor-CrummeyRice University
Houston, TX
Autotuning Workshop 2005 Rice University 2
OutlineOutline
• Overview of Framework– Fine grain control of transformations– Loop-level feedback– Search strategy
• Experimental Results• Empirical Tuning of Loop Fusion• Future Work
A Framework for AutotuningA Framework for AutotuningA Framework for AutotuningA Framework for Autotuning
BinaryF77 Source
Vendor Compiler
LoopTool
hpcview
bloop hpcrun
ParameterizedSearch Engine
AnalyticModeler
AnnotatedSource
TransformedSource
Search Space
Architectural Specs
Search State
Profile Information
Program Structure
Simulated Annealing
Random
Direct Search
Itanium2
Alpha
SGI
Pentium 4
Initial Parameters
Next Iteratio
n
Parameters
Feedback
Autotuning Workshop 2005 Rice University 4
Fine Grain Control of Fine Grain Control of TransformationsTransformations
• Not enough control over transformations in most commercial compilers
• Example sweep3d• 68 loops in the sweep routine • Using the same unroll factor for all 4 loops
results in a 2% speedup • Using 4 different unroll factors for 4 loops
results in a 8% speedup
Autotuning Workshop 2005 Rice University 5
Fine Grain Control of Fine Grain Control of TransformationsTransformations
• LoopTool: A Source to source transformer– Performs transformations
such as loop fusion, tiling, unroll-and-jam
– Applies transformations from source code annotation
– Provides loop level control of transformations
cdir$ unroll 4do j = 1, N cdir$ block 16 do i = 1, M cdir$ block 16
do k = 1, L S1(k, i, j)
enddo enddo enddo
Autotuning Workshop 2005 Rice University 6
FeedbackFeedback
• Feedback beyond whole program execution time– Need feedback at loop level
granularity• Can potentially optimize independent
code regions concurrently
– Need other performance metrics:• Cache misses • TLB misses• Pipeline stalls
Autotuning Workshop 2005 Rice University 7
FeedbackFeedback
• HPCToolKit– hpcrun - profiles executions using
statistical sampling of hardware performance counters
– bloop - retrieves loop structure from binaries
– hpcview - correlates program structure information with sample-based performance profiles
Autotuning Workshop 2005 Rice University 8
Search SpaceSearch Space
• Optimization search space is very large and complex– Example mm
• 2-level blocking (range 1-100) • 1-unrolled loop (range 1-20)• 100 x 100 x 20 = 200,000 search points
– Characteristics of the search space is hard to predict
– Need a search technique that can explore the search space efficiently
Autotuning Workshop 2005 Rice University 9
Search StrategySearch Strategy
• Pattern-based direct search method– The search is derivative-free
• Works well for a non-continuous search space
– Provides approximate solutions at each stage of the calculation• Can stop the search at any point when
constrained by tuning time
– Provides flexibility
Autotuning Workshop 2005 Rice University 10
Performance ImprovementPerformance Improvement
1
1.2
1.4
1.6
1.8
2
2.2
2.4
Speedup
advect3d lud mm vpenta mgrid swim GeometricMean
Benchmarks
DS 30
DS 60
DS 90
DS 120
Autotuning Workshop 2005 Rice University 11
Performance Improvement Performance Improvement ComparisonComparison
80
85
90
95
100
105
0 30 60 90 120 150
Number of Iterations
Fraction of Best Speedup
Exhaustive Search
Direct Search
Random Search
Autotuning Workshop 2005 Rice University 12
Tuning Time ComparisonTuning Time Comparison
0
2
4
6
8
10
12
advect3d lud mm vpenta mgrid swim
Benchmarks
Fraction of Tuning Time
DS 30
DS 60
DS 90
DS 120
Autotuning Workshop 2005 Rice University 13
Empirical Tuning of Loop Empirical Tuning of Loop FusionFusion
• Important transformation when looking at whole applications
• Poses new problems– Multi-loop reuse analysis– Interaction with tiling– What parameters to tune?
• Trying all combinations is perhaps too much
Autotuning Workshop 2005 Rice University 14
Profitability Model for Fusion Profitability Model for Fusion
• Hierarchical reuse analysis to estimate cost at different memory levels
• A probabilistic model to account for conflict misses– Effective Cache Size
• Resource constraints– Register Pressure– Instruction Cache Capacity
• Integrate the model into a pair-wise greedy fusion algorithm
Autotuning Workshop 2005 Rice University 15
Tuning Parameters for Tuning Parameters for FusionFusion
• Use search to tune the following parameters– Effective Cache Size– Register Pressure– Instruction Cache capacity
• Preliminary Experiments – Works reasonably well
Autotuning Workshop 2005 Rice University 16
Speedup from Loop FusionSpeedup from Loop Fusion
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
Speedup over no-fuse
advect3d erlebacher liv18 mgrid
Benchmarks
ccfm
simple
mips-pro
nofuse
Autotuning Workshop 2005 Rice University 17
SummarySummary
• Autotuning framework able to improve performance on a range of architectures
• Direct search able to find good tile sizes and unroll factors by exploring only a small fraction of the search space
• Combining static models with search
works well for loop fusion
Autotuning Workshop 2005 Rice University 18
Future WorkFuture Work
• Refine cache models to capture interactions between transformations
• Consider data allocation strategies for tuning
Extra Slides Begin HereExtra Slides Begin Here
Autotuning Workshop 2005 Rice University 20
PlatformsPlatforms
Processor L1 L2 L3 Compiler
Alpha 21264A @ 667MHz
8K 8M - Compaq Fortran V5.5-1877
Itanium2 @ 900 MHz
16K 256K 1.5M Intel Fortran Compiler 7.1
SGI Origin R10K @ 195 MHz
32K 1M - MipsPro 7.3
Pentium 4 @ 2 GHz
8K 512K - Intel Fortran Compiler 7.1
Autotuning Workshop 2005 Rice University 21
Performance Improvement for Different Performance Improvement for Different ArchitecturesArchitectures
1
1.2
1.4
1.6
1.8
2
2.2
Speedup
Itanium2 Alpha SGI Pentium IV
Platforms
DS 30DS 60DS 90DS 120