using cache models and empirical search in automatic tuning of applications apan qasem ken kennedy...

21
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor- Crummey Rice University Houston, TX

Upload: stephany-craig

Post on 20-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Using Cache Models and Empirical Search in Automatic Tuning of

Applications

Using Cache Models and Empirical Search in Automatic Tuning of

ApplicationsApan Qasem Ken Kennedy John Mellor-Crummey

Rice UniversityHouston, TX

Apan Qasem Ken Kennedy John Mellor-CrummeyRice University

Houston, TX

Page 2: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 2

OutlineOutline

• Overview of Framework– Fine grain control of transformations– Loop-level feedback– Search strategy

• Experimental Results• Empirical Tuning of Loop Fusion• Future Work

Page 3: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

A Framework for AutotuningA Framework for AutotuningA Framework for AutotuningA Framework for Autotuning

BinaryF77 Source

Vendor Compiler

LoopTool

hpcview

bloop hpcrun

ParameterizedSearch Engine

AnalyticModeler

AnnotatedSource

TransformedSource

Search Space

Architectural Specs

Search State

Profile Information

Program Structure

Simulated Annealing

Random

Direct Search

Itanium2

Alpha

SGI

Pentium 4

Initial Parameters

Next Iteratio

n

Parameters

Feedback

Page 4: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 4

Fine Grain Control of Fine Grain Control of TransformationsTransformations

• Not enough control over transformations in most commercial compilers

• Example sweep3d• 68 loops in the sweep routine • Using the same unroll factor for all 4 loops

results in a 2% speedup • Using 4 different unroll factors for 4 loops

results in a 8% speedup

Page 5: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 5

Fine Grain Control of Fine Grain Control of TransformationsTransformations

• LoopTool: A Source to source transformer– Performs transformations

such as loop fusion, tiling, unroll-and-jam

– Applies transformations from source code annotation

– Provides loop level control of transformations

cdir$ unroll 4do j = 1, N cdir$ block 16 do i = 1, M cdir$ block 16

do k = 1, L S1(k, i, j)

enddo enddo enddo

Page 6: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 6

FeedbackFeedback

• Feedback beyond whole program execution time– Need feedback at loop level

granularity• Can potentially optimize independent

code regions concurrently

– Need other performance metrics:• Cache misses • TLB misses• Pipeline stalls

Page 7: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 7

FeedbackFeedback

• HPCToolKit– hpcrun - profiles executions using

statistical sampling of hardware performance counters

– bloop - retrieves loop structure from binaries

– hpcview - correlates program structure information with sample-based performance profiles

Page 8: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 8

Search SpaceSearch Space

• Optimization search space is very large and complex– Example mm

• 2-level blocking (range 1-100) • 1-unrolled loop (range 1-20)• 100 x 100 x 20 = 200,000 search points

– Characteristics of the search space is hard to predict

– Need a search technique that can explore the search space efficiently

Page 9: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 9

Search StrategySearch Strategy

• Pattern-based direct search method– The search is derivative-free

• Works well for a non-continuous search space

– Provides approximate solutions at each stage of the calculation• Can stop the search at any point when

constrained by tuning time

– Provides flexibility

Page 10: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 10

Performance ImprovementPerformance Improvement

1

1.2

1.4

1.6

1.8

2

2.2

2.4

Speedup

advect3d lud mm vpenta mgrid swim GeometricMean

Benchmarks

DS 30

DS 60

DS 90

DS 120

Page 11: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 11

Performance Improvement Performance Improvement ComparisonComparison

80

85

90

95

100

105

0 30 60 90 120 150

Number of Iterations

Fraction of Best Speedup

Exhaustive Search

Direct Search

Random Search

Page 12: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 12

Tuning Time ComparisonTuning Time Comparison

0

2

4

6

8

10

12

advect3d lud mm vpenta mgrid swim

Benchmarks

Fraction of Tuning Time

DS 30

DS 60

DS 90

DS 120

Page 13: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 13

Empirical Tuning of Loop Empirical Tuning of Loop FusionFusion

• Important transformation when looking at whole applications

• Poses new problems– Multi-loop reuse analysis– Interaction with tiling– What parameters to tune?

• Trying all combinations is perhaps too much

Page 14: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 14

Profitability Model for Fusion Profitability Model for Fusion

• Hierarchical reuse analysis to estimate cost at different memory levels

• A probabilistic model to account for conflict misses– Effective Cache Size

• Resource constraints– Register Pressure– Instruction Cache Capacity

• Integrate the model into a pair-wise greedy fusion algorithm

Page 15: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 15

Tuning Parameters for Tuning Parameters for FusionFusion

• Use search to tune the following parameters– Effective Cache Size– Register Pressure– Instruction Cache capacity

• Preliminary Experiments – Works reasonably well

Page 16: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 16

Speedup from Loop FusionSpeedup from Loop Fusion

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

Speedup over no-fuse

advect3d erlebacher liv18 mgrid

Benchmarks

ccfm

simple

mips-pro

nofuse

Page 17: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 17

SummarySummary

• Autotuning framework able to improve performance on a range of architectures

• Direct search able to find good tile sizes and unroll factors by exploring only a small fraction of the search space

• Combining static models with search

works well for loop fusion

Page 18: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 18

Future WorkFuture Work

• Refine cache models to capture interactions between transformations

• Consider data allocation strategies for tuning

Page 19: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Extra Slides Begin HereExtra Slides Begin Here

Page 20: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 20

PlatformsPlatforms

Processor L1 L2 L3 Compiler

Alpha 21264A @ 667MHz

8K 8M - Compaq Fortran V5.5-1877

Itanium2 @ 900 MHz

16K 256K 1.5M Intel Fortran Compiler 7.1

SGI Origin R10K @ 195 MHz

32K 1M - MipsPro 7.3

Pentium 4 @ 2 GHz

8K 512K - Intel Fortran Compiler 7.1

Page 21: Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan

Autotuning Workshop 2005 Rice University 21

Performance Improvement for Different Performance Improvement for Different ArchitecturesArchitectures

1

1.2

1.4

1.6

1.8

2

2.2

Speedup

Itanium2 Alpha SGI Pentium IV

Platforms

DS 30DS 60DS 90DS 120