compilingforia-64 carol thompson optimization architect hewlett packard

32
Compiling Compiling for for IA-64 IA-64 Carol Thompson Carol Thompson Optimization Architect Optimization Architect Hewlett Packard Hewlett Packard

Upload: alanna-robnett

Post on 14-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

CompilingCompilingforfor

IA-64IA-64Carol ThompsonCarol ThompsonOptimization ArchitectOptimization Architect

Hewlett PackardHewlett Packard

Page 2: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

History of ILP CompilersHistory of ILP Compilers

• CISC era: no significant ILPCISC era: no significant ILP– Compiler is merely a tool to enable use of high-Compiler is merely a tool to enable use of high-

level language, at some performance costlevel language, at some performance cost• RISC era: advent of ILPRISC era: advent of ILP

– Compiler-influenced architectureCompiler-influenced architecture– Instruction scheduling becomes importantInstruction scheduling becomes important

• EPIC era: ILP as driving forceEPIC era: ILP as driving force– Compiler-specified ILPCompiler-specified ILP

Page 3: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Increasing Scope for ILP Increasing Scope for ILP CompilationCompilation

• Early RISC CompilersEarly RISC Compilers– Basic block scope (delimited by Basic block scope (delimited by

branches & branch targets)branches & branch targets)• Superscalar RISC and early VLIW Superscalar RISC and early VLIW

CompilersCompilers– Trace scope (single entry, Trace scope (single entry,

single path)single path)– Superblocks & Hyperblocks Superblocks & Hyperblocks

(single entry, multiple path)(single entry, multiple path)• EPIC CompilersEPIC Compilers

– Composite regions: multiple Composite regions: multiple entry, multiple pathentry, multiple path

Composite Regions

Traces

Superblock

Basic Blocks

Page 4: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Unbalanced and UnbiasedUnbalanced and UnbiasedControl FlowControl Flow

• Most code is not well balancedMost code is not well balanced– Many very small blocksMany very small blocks– Some very largeSome very large– Then and else clause frequently Then and else clause frequently

unbalancedunbalanced– Number of instructionsNumber of instructions– PathlengthPathlength

• Many branches are highly biasedMany branches are highly biased– But some are notBut some are not– Compiler can obtain frequency Compiler can obtain frequency

information from profiling or information from profiling or derive heuristically derive heuristically

60

60

0

0

40

55

55

5

5

40

Page 5: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Basic BlocksBasic Blocks

• Basic Blocks are simpleBasic Blocks are simple– No issues with executing No issues with executing

unnecessary instructionsunnecessary instructions– No speculation or No speculation or

predication support requiredpredication support required• But, very limited ILPBut, very limited ILP

– Short blocks offer very little Short blocks offer very little opportunity for parallelismopportunity for parallelism

– Long latency code is unable Long latency code is unable to take advantage of issue to take advantage of issue bandwidth in an earlier bandwidth in an earlier blockblock

60

60

0

0

40

55

55

5

5

40

Page 6: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

TracesTraces

60

60

0

0

40

55

55

5

5

40

• Traces allow scheduling of multiple Traces allow scheduling of multiple blocks togetherblocks together

– Increases available ILPIncreases available ILP

– Long latency operations can be Long latency operations can be moved up, as long as they are on moved up, as long as they are on the same tracethe same trace

• But, unbiased branches are a But, unbiased branches are a problemproblem

– Long latency code in slightly less Long latency code in slightly less frequent paths can’t move upfrequent paths can’t move up

– Issue bandwidth may go unused Issue bandwidth may go unused (not enough concurrent (not enough concurrent instructions to fill available instructions to fill available execution units)execution units)

Page 7: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

60

60

0

0

40

55

55 5

40

5

5

Superblocks and HyperblocksSuperblocks and Hyperblocks• Superblocks and Hyperblocks Superblocks and Hyperblocks

allow inclusion of multiple allow inclusion of multiple important pathsimportant paths

– Long latency code may migrate Long latency code may migrate up from multiple pathsup from multiple paths

– Hyperblocks may be fully Hyperblocks may be fully predicatedpredicated

– More effective utilization of More effective utilization of issue bandwidthissue bandwidth

• But, requires code duplicationBut, requires code duplication

• Wholesale predication may Wholesale predication may lengthen important pathslengthen important paths

Page 8: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Composite RegionsComposite Regions

• Allow rejoin from non-Region codeAllow rejoin from non-Region code

– Wholesale code duplication is Wholesale code duplication is not requirednot required

– Support full code motion across Support full code motion across regionregion

– Allow all interesting paths to be Allow all interesting paths to be scheduled concurrentlyscheduled concurrently

• Nested, less important Regions Nested, less important Regions bear the burden of the rejoinbear the burden of the rejoin

– Compensation code, as neededCompensation code, as needed

60

60

0

0

40

55

55

5

5

40

Page 9: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Predication ApproachesPredication Approaches

• Full Predication of Full Predication of entire Regionentire Region– Penalizes Penalizes

short pathsshort paths

60

60

0

0

40

55

55

5

5

40

Page 10: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

On-Demand PredicationOn-Demand Predication

• Predicate (and Predicate (and Speculate) as Speculate) as neededneeded– reduce critical reduce critical

path(s)path(s)– fully utilize issue fully utilize issue

bandwidthbandwidth• Retain control flow to Retain control flow to

accommodate accommodate unbalanced pathsunbalanced paths

60

60

0

0

40

55

55

5

5

40

Page 11: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Predicate AnalysisPredicate Analysis

• Instruction scheduler requires knowledge of Instruction scheduler requires knowledge of predicate relationshipspredicate relationships– For dependence analysisFor dependence analysis– For code motionFor code motion– ……

• Predicate Query SystemPredicate Query System– Graphical representation of predicate Graphical representation of predicate

relationshipsrelationships– Superset, subset, disjoint, …Superset, subset, disjoint, …

Page 12: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Predicate ComputationPredicate Computation

• Compute all predicates possibly neededCompute all predicates possibly needed• OptimizeOptimize

– to share predicates where possibleto share predicates where possible– to utilize parallel comparesto utilize parallel compares– to fully utilize dual-targetsto fully utilize dual-targets

Page 13: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Predication and Branch CountsPredication and Branch Counts

• Predication reduces branchesPredication reduces branches– at both moderate and aggressive opt. levelsat both moderate and aggressive opt. levels

Normalized Dynamic Branch Counts

00.20.40.60.8

11.2

Benchmark

-O

-O w/pred

+O4+P

+O4 +P w/pred

Page 14: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Predication & Branch PredictionPredication & Branch Prediction

• Comparable misprediction rate with predicationComparable misprediction rate with predication

– despite significantly fewer branchesdespite significantly fewer branches increased mean time between mispredicted branchesincreased mean time between mispredicted branches

Normalized Mispredict Rates

0

0.5

1

1.5

2

Benchmark

-O

-O w/pred

+O4+P

+O4 +P w/pred

Page 15: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Register AllocationRegister Allocation

• Modeled as a graph-coloring Modeled as a graph-coloring problem.problem.– Nodes in the graph Nodes in the graph

represent live ranges of represent live ranges of variablesvariables

– Edges represent a Edges represent a temporal overlap of the temporal overlap of the live rangeslive ranges

– Nodes sharing an edge Nodes sharing an edge must be assigned must be assigned different colors (registers)different colors (registers)

x = ...y = ...

= ... xz = ... = … y = … z

y

zx

Requires Two Colors

y

z

x

Page 16: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Register AllocationRegister Allocation

x = ...y = ...

x

zy

With Control Flow

z = ... = … z

= … yx = ...

= … x

x

y

z

Requires Two Colors

Page 17: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Register AllocationRegister Allocation

x

zy

With Predicationxx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

Now Requires Three Colors

y

Page 18: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Predicate AnalysisPredicate Analysis

p0

p2p1

x

yx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

p1 and p2 are disjointIf p1 is TRUE, p2 is false

and vice versa

Page 19: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Register AllocationRegister Allocation

x

zy

With Predicate Analysisx

yx = ...

y = ...

z = ...

= …y

x = ...

= …z

= … x

z

Now Back to Two Colors

Page 20: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Effect of Predicate-Aware Effect of Predicate-Aware Register AllocationRegister Allocation

• Reduces register requirements for individual Reduces register requirements for individual procedures by 0% to 75%procedures by 0% to 75%– Depends upon how aggressively predication is Depends upon how aggressively predication is

appliedapplied• Average dynamic reduction in register stack Average dynamic reduction in register stack

allocation for gcc is 4.7%allocation for gcc is 4.7%

Page 21: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Object-Oriented CodeObject-Oriented Code

• ChallengesChallenges– Small Procedures, many Small Procedures, many

indirect (virtual)indirect (virtual)– Limits size of regions, Limits size of regions,

scope for ILPscope for ILP

– Exception HandlingException Handling

– Bounds Checking (Java)Bounds Checking (Java)– Inherently serial - must Inherently serial - must

check before check before executing load or storeexecuting load or store

SolutionsSolutionsInliningInlining

for non-virtual functions or for non-virtual functions or provably unique virtual provably unique virtual functionsfunctionsSpeculative inlining for most Speculative inlining for most common variantcommon variant

Liveness analysis of handlersLiveness analysis of handlersArchitectural support for Architectural support for speculation ensures speculation ensures recoverabilityrecoverability

Speculative executionSpeculative executionGuarantees correct Guarantees correct exception behaviorexception behavior

Dynamic optimization (e..g Java)Dynamic optimization (e..g Java)Make use of dynamic Make use of dynamic

profileprofile

Page 22: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Method CallsMethod Calls• Barrier between execution Barrier between execution

streamsstreams

• Often, location of called Often, location of called method must be determined method must be determined at runtimeat runtime

– Costly “identity check” on Costly “identity check” on object must complete object must complete before method may beginbefore method may begin

– Even if the call nearly Even if the call nearly always goes to the same always goes to the same placeplace

– Little ILPLittle ILP

Resolvetarget

method

Call-dependentcode

Possibletarget

Possibletarget

Possibletarget

Page 23: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Speculating Across Method Speculating Across Method CallsCalls

• Compiler predicts target methodCompiler predicts target method– ProfilingProfiling– Current state of class hierarchyCurrent state of class hierarchy

• Predicted method is inlinedPredicted method is inlined– Full or partialFull or partial

• Speculative execution of called method begins Speculative execution of called method begins while actual target is determinedwhile actual target is determined

Page 24: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Speculation Across Method Speculation Across Method Calls Calls

Resolvetargetmethod

call method

Dominantcalled

method

Othertarget

method

Othertarget

method

call othermethod if needed

Dominantcalled

method

Othertarget

method

Othertarget

method

Resolvetarget

method

Page 25: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Bounds & Null ChecksBounds & Null Checks

• Checks inhibit code motionChecks inhibit code motion• Null checksNull checks

x = y.foo;x = y.foo; if( y == null ) throw NullPointerException;if( y == null ) throw NullPointerException;

x = y.foo;x = y.foo;

• Bounds checksBounds checks

x = a[i];x = a[i]; if( a == null ) throw NullPointerException;if( a == null ) throw NullPointerException;

if( i < 0 || i >= a.length)if( i < 0 || i >= a.length)

throw ArrayIndexOutOfBounds Exception;throw ArrayIndexOutOfBounds Exception;

x = a[i];x = a[i];

Page 26: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Speculating Across Bounds Speculating Across Bounds ChecksChecks

• Bounds checks rarely failBounds checks rarely fail

x = a[i];x = a[i]; ld.sld.st = a[i];t = a[i];

if( a == null ) throw NullPointerException;if( a == null ) throw NullPointerException;

if( i < 0 || i >= a.length)if( i < 0 || i >= a.length)

throw ArrayIndexOutOfBoundsException;throw ArrayIndexOutOfBoundsException;

chk.schk.s tt

x = t;x = t;

• Long latency load can begin before checksLong latency load can begin before checks

Page 27: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Exception HandlingException Handling

• Exception handling inhibits motion of subsequent Exception handling inhibits motion of subsequent codecodeif( y.foo ) throw MyException;if( y.foo ) throw MyException;

x = y.bar + z.baz;x = y.bar + z.baz;

Page 28: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Speculation in the Presence Speculation in the Presence of Exception Handlingof Exception Handling

• Execution of subsequent instructions may begin Execution of subsequent instructions may begin before exception is resolvedbefore exception is resolved

if( y.foo ) throw MyException;if( y.foo ) throw MyException;

x = y.bar + z.baz;x = y.bar + z.baz;

ldld t1 = y.foot1 = y.foo

ld.sld.s t2 = y.bart2 = y.bar

ld.sld.s t3 = z.bazt3 = z.baz

addadd x = t2 + t3x = t2 + t3

if( t1 ) throw MyException;if( t1 ) throw MyException;

chk.schk.s xx

Page 29: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Dependence Graph for Dependence Graph for Instruction SchedulingInstruction Scheduling

add t1 = 8,p

(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]

cmp4.ge p1,p2=n,count

If( n < p->count ) {If( n < p->count ) {

(*log)++;(*log)++;

return p->x[n];return p->x[n];

} else {} else {

return 0;return 0;

}}

Page 30: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Dependence Graph with Dependence Graph with Predication & SpeculationPredication & Speculation

add t1 = 8,p

(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]

cmp4.ge p1,p2=n,count

chk.a t4

chk.a p

• During dependence graph During dependence graph construction, potentially construction, potentially controlcontrol and and datadata speculative edges and speculative edges and nodes are identifiednodes are identified

• Check nodes are added Check nodes are added where possibly needed where possibly needed (note that only data (note that only data speculation checks are speculation checks are shown here)shown here)

Page 31: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

Dependence Graph with Dependence Graph with Predication & SpeculationPredication & Speculation

add t1 = 8,p(p1) ld4 t3 = [log]

(p1) add t2 = 1,t2

(p2) mov out0 = 0

br.ret rp

(p1) ld4 out0 = [t4]

shladd t4 = n,4,t3

(p1) ld4 t3 = [p]

(p1) st4 [log] = t2

ld4 count = [t1]

cmp4.ge p1,p2=n,count

chk.a t4chk.a p

• Speculative edges may be violated. Here the graph is re-drawn to show the Speculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelismenhanced parallelism

• Note that the speculation of both writes to the out0 register would require Note that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its schedulinginsertion of a copy. The scheduler must consider this in its scheduling

• Nodes with sufficient slack (e.g. writes to out0) will not be speculatedNodes with sufficient slack (e.g. writes to out0) will not be speculated

Page 32: CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

ConclusionsConclusions• IA-64 compilers push the complexity of the compilerIA-64 compilers push the complexity of the compiler

– However, the technology is a logical progression However, the technology is a logical progression from today’sfrom today’s– Today’s RISC compilersToday’s RISC compilers

– are more complex are more complex – are more reliableare more reliable– and deliver more performanceand deliver more performance

than those of the early daysthan those of the early days– Complexity trend is mirrored in both hardware and Complexity trend is mirrored in both hardware and

applicationsapplications– Need a balance to maximize benefits from eachNeed a balance to maximize benefits from each