ia-64 architecture innovations

55
® IA-64 IA-64 Architecture Architecture Innovations Innovations John Crawford John Crawford Architect & Intel Architect & Intel Fellow Fellow Intel Corporation Intel Corporation Jerry Huck Jerry Huck Manager & Lead Manager & Lead Architect Architect Hewlett Packard Co. Hewlett Packard Co.

Upload: denim

Post on 06-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

IA-64 Architecture Innovations. John Crawford Architect & Intel Fellow Intel Corporation. Jerry Huck Manager & Lead Architect Hewlett Packard Co. Agenda. Architecture Principles Predication & Speculation Branch Architecture Software Pipelining. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: IA-64 Architecture Innovations

®®

IA-64 Architecture IA-64 Architecture InnovationsInnovations

John Crawford John Crawford Architect & Intel FellowArchitect & Intel Fellow

Intel CorporationIntel Corporation

Jerry HuckJerry Huck Manager & Lead ArchitectManager & Lead Architect

Hewlett Packard Co.Hewlett Packard Co.

Page 2: IA-64 Architecture Innovations

®®

AgendaAgenda

Architecture PrinciplesArchitecture PrinciplesPredication & SpeculationPredication & SpeculationBranch ArchitectureBranch ArchitectureSoftware PipeliningSoftware Pipelining

Page 3: IA-64 Architecture Innovations

®® Today’s Processors often 60% IdleToday’s Processors often 60% IdleToday’s Processors often 60% IdleToday’s Processors often 60% Idle

parallelizedparallelizedcodecode parallelizedparallelized

codecode

parallelizedparallelizedcodecode

HardwareHardwareCompilerCompiler

multiplemultiple functional unitsfunctional units

Original SourceOriginal SourceCodeCode

Sequential MachineSequential MachineCodeCode

......

......

Execution Units Available Execution Units Available Used InefficientlyUsed Inefficiently

Traditional Architectures: Traditional Architectures: Limited ParallelismLimited Parallelism

Page 4: IA-64 Architecture Innovations

®® Increases Parallel ExecutionIncreases Parallel ExecutionIncreases Parallel ExecutionIncreases Parallel Execution

IA-64 Compiler IA-64 Compiler Views WiderViews Wider

ScopeScope

Parallel MachineParallel MachineCodeCode

CompilerCompiler

Original SourceOriginal SourceCodeCode

CompileCompile

HardwareHardware multiple functional unitsmultiple functional units

......

......

More efficient use of More efficient use of execution resourcesexecution resources

IA-64 Architecture: IA-64 Architecture: Explicit ParallelismExplicit Parallelism

Page 5: IA-64 Architecture Innovations

®®

IA-64 PrinciplesIA-64 Principles Explicitly parallel:Explicitly parallel:

– Instruction level parallelism (ILP) in machine code Instruction level parallelism (ILP) in machine code

– Compiler schedules across a wider scopeCompiler schedules across a wider scope

Enhanced ILP :Enhanced ILP :– Predication, Speculation, Software pipelining, ... Predication, Speculation, Software pipelining, ...

Fully compatible:Fully compatible:– Across all IA-64 family membersAcross all IA-64 family members

– IA-32 in hardware and PA-RISC through instruction mapping IA-32 in hardware and PA-RISC through instruction mapping

– Inherently scalableInherently scalable

Massively resourced:Massively resourced:– Many registersMany registers

– Many functional unitsMany functional units

Page 6: IA-64 Architecture Innovations

®®

PredicationPredication

cmpcmp

p1

p1

p1

p2

p2

p2

Traditional ArchitecturesTraditional Architectures IA-64IA-64

Removes branches, converts to predicated execution Removes branches, converts to predicated execution – Executes multiple paths simultaneouslyExecutes multiple paths simultaneously

Increases performance by exposing parallelism and reducing Increases performance by exposing parallelism and reducing critical path critical path

– Better utilization of wider machinesBetter utilization of wider machines

– Reduces mispredicted branchesReduces mispredicted branches

elseelse

thenthen

cmpcmp

Page 7: IA-64 Architecture Innovations

®®

(p2) p3=

(p3)...

(p1) p3=

Regular: p3 is set just onceRegular: p3 is set just once Unconditional: p3 and p4 Unconditional: p3 and p4 are AND’ed with p2are AND’ed with p2

p1,p2,<-...

(p2) p3,p4 <-cmp.unc...

(p3)... (p4)...p2&p3 p2&p4

Opportunity for Even More ParallelismOpportunity for Even More ParallelismOpportunity for Even More ParallelismOpportunity for Even More Parallelism

Predication ReviewPredication ReviewTwo kinds of normal comparesTwo kinds of normal compares

– Regular Regular – Unconditional (nested IF’s)Unconditional (nested IF’s)

Page 8: IA-64 Architecture Innovations

®®

Reduces Critical PathReduces Critical PathReduces Critical PathReduces Critical Path

BB

AA

CC

DD

BBAA CC

DD

Introducing Parallel Introducing Parallel ComparesCompares Three new types of compares:Three new types of compares:

– AND: both target predicates set FALSE if compare is falseAND: both target predicates set FALSE if compare is false

– OR: both target predicates set TRUE if compare is trueOR: both target predicates set TRUE if compare is true

– ANDOR: if true, sets one TRUE, sets other FALSEANDOR: if true, sets one TRUE, sets other FALSE

Page 9: IA-64 Architecture Innovations

®®

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]ld R2=[R1]ld R2=[R1]ld.s R4=[R3]ld.s R4=[R3]ld.s R6=[R5]ld.s R6=[R5]P1,P2 <-cmp.unc(R2==true)P1,P2 <-cmp.unc(R2==true)

(p1)(p1) chk.s R4chk.s R4(p1)(p1) P3,P4 <-cmp.unc(R4==true)P3,P4 <-cmp.unc(R4==true)

(p3)(p3) chk.s R6chk.s R6(p3)(p3) P5,P6 <-cmp.unc(R5==true)P5,P6 <-cmp.unc(R5==true)(P5) br then(P5) br thenelseelse

1

2

4

5

6

7

ThenElse

P1

P2

P5

P3 P4

P6

8 queens control flow8 queens control flowUnconditional ComparesUnconditional Compares

Eight Queens ExampleEight Queens Example

Page 10: IA-64 Architecture Innovations

®®

Eight Queens ExampleEight Queens Exampleif ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

ThenElse

P1

P2

P5

P3 P4

P6

Parallel ComparesParallel Compares

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse

1

2

4

P1P1

5

8 queens control flow8 queens control flow

Page 11: IA-64 Architecture Innovations

®®

Eight Queens ExampleEight Queens Exampleif ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

ThenElse

P1

P2

P5

P3 P4

P6

Parallel ComparesParallel Compares

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse

1

2

4

5

Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5

8 queens control flow8 queens control flow

ThenElse

P1= true P1=False

Page 12: IA-64 Architecture Innovations

®® Tbit (Test Bit) Also Sets PredicatesTbit (Test Bit) Also Sets PredicatesTbit (Test Bit) Also Sets PredicatesTbit (Test Bit) Also Sets Predicates

Five Predicate Compare Five Predicate Compare TypesTypes (qp) p1,p2 <- cmp.relation(qp) p1,p2 <- cmp.relation

– if(qp) {p1 = relation; p2 = !relation}; if(qp) {p1 = relation; p2 = !relation};

(qp) p1,p2 <- cmp.relation.unc(qp) p1,p2 <- cmp.relation.unc– p1 = qp&relation; p2 = qp&!relation;p1 = qp&relation; p2 = qp&!relation;

(qp) p1,p2 <- cmp.relation.and(qp) p1,p2 <- cmp.relation.and– if(qp & (relation==FALSE)) { p1=0; p2=0; }if(qp & (relation==FALSE)) { p1=0; p2=0; }

(qp) p1,p2 <- cmp.relation.or(qp) p1,p2 <- cmp.relation.or– if(qp & (relation==TRUE)) { p1=1; p2=1; }if(qp & (relation==TRUE)) { p1=1; p2=1; }

(qp) p1,p2 <- cmp.relation.or.andcm(qp) p1,p2 <- cmp.relation.or.andcm– if(qp & (relation==TRUE)) { p1=1; p2=0; }if(qp & (relation==TRUE)) { p1=1; p2=0; }

Page 13: IA-64 Architecture Innovations

®®

* Source: S. Mahlke, 1995* Source: S. Mahlke, 1995

Predication BenefitsPredication Benefits Reduces branches and mispredict penalties Reduces branches and mispredict penalties

– 50% fewer branches and 37% faster code*50% fewer branches and 37% faster code*

Parallel compares further reduce critical pathsParallel compares further reduce critical paths Greatly improves code with hard to predict Greatly improves code with hard to predict

branchesbranches– Large server apps- capacity limitedLarge server apps- capacity limited

– Sorting, data mining- large database appsSorting, data mining- large database apps

– Data compressionData compression

Traditional architectures’ “bolt-on” approach can’t Traditional architectures’ “bolt-on” approach can’t efficiently approximate predicationefficiently approximate predication

– Cmove: 39% more instructions, 23% slower performance*Cmove: 39% more instructions, 23% slower performance*

– Instructions must all be speculativeInstructions must all be speculative

Page 14: IA-64 Architecture Innovations

®®

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

IA-64IA-64

instr 1instr 1instr 2instr 2. . .. . .brbr

LoadLoaduseuse

Traditional ArchitecturesTraditional Architectures

Allows elevation of load, Allows elevation of load, even above a brancheven above a branch

Speculation ReviewSpeculation Review

Memory latency is a major performance Memory latency is a major performance bottleneck in today’s systemsbottleneck in today’s systems– CPU to memory gap increasingCPU to memory gap increasing

BarrierBarrier

Page 15: IA-64 Architecture Innovations

®® Enables Further ParallelismEnables Further ParallelismEnables Further ParallelismEnables Further Parallelism

Hoisting UsesHoisting Uses

The uses of speculative data can also The uses of speculative data can also be executed speculativelybe executed speculatively– distinguishes speculation from simple prefetchdistinguishes speculation from simple prefetch

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

IA-64IA-64

Page 16: IA-64 Architecture Innovations

®®

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

PropagatePropagateExceptionException

;Exception Detection;Exception Detection

;Exception Delivery;Exception Delivery

IA-64IA-64

Introducing the NaTIntroducing the NaT(“Not a Thing”)(“Not a Thing”)

NaT is the GR’s 65th bit that indicates:NaT is the GR’s 65th bit that indicates:– whether or not an exception has occurred whether or not an exception has occurred – branch to fixup code requiredbranch to fixup code required

NaT set during ld.s, checked by Chk.sNaT set during ld.s, checked by Chk.s

Page 17: IA-64 Architecture Innovations

®®

All computation instructions propagate NaTs to reduce All computation instructions propagate NaTs to reduce number of checksnumber of checks

Cmp propagates “false” when writing predicates Cmp propagates “false” when writing predicates RISC architectures require more instructions for RISC architectures require more instructions for

equivalent integrity equivalent integrity – e.g., non faulting loade.g., non faulting load

PropagationPropagation

chk.s r5chk.s r5sub r7 = r5,r2sub r7 = r5,r2

ld8.s r3 = (r9)ld8.s r3 = (r9)ld8.sld8.s r4 = (r10) r4 = (r10)addaddr6 = r3, r4r6 = r3, r4ld8.s r5 = (r6)ld8.s r5 = (r6)p1,p2 = cmp(...)p1,p2 = cmp(...) Allows single chk on Allows single chk on

resultresult

Page 18: IA-64 Architecture Innovations

®®

ld.sld.sinstr 1instr 1instr 2instr 2usesusesbrbr

chk.schk.s(Home Block)(Home Block)

ldldusesusesbr homebr home

Recovery codeRecovery code

Complete Solution for Exception ManagementComplete Solution for Exception ManagementComplete Solution for Exception ManagementComplete Solution for Exception Management

Exception Deferral: More Exception Deferral: More Than Skin DeepThan Skin Deep Deferral allows the efficient Deferral allows the efficient

delay of costly exceptionsdelay of costly exceptions OS controlled deferral by OS controlled deferral by

hardware of:hardware of:– Page faultsPage faults– Protection violationsProtection violations– ……

NaTs enable deferral with NaTs enable deferral with recoveryrecovery

Efficiently support structured Efficiently support structured exception handling in C/C++exception handling in C/C++

Page 19: IA-64 Architecture Innovations

®®

Control Speculation Control Speculation SummarySummaryAll loads have a speculative form that sets All loads have a speculative form that sets

the NaT bit when deferring exceptionsthe NaT bit when deferring exceptionsComputational instructions propagate NaTsComputational instructions propagate NaTsOS controls deferral of faults but supported OS controls deferral of faults but supported

directly in HW - “no-fault speculation”directly in HW - “no-fault speculation”– Minimizes overhead of data that is not usedMinimizes overhead of data that is not used

Chk more effective than non-faulting loadChk more effective than non-faulting load

Page 20: IA-64 Architecture Innovations

®®

Store BarrierStore Barrier

Traditional architectures limited by the Store BarrierTraditional architectures limited by the Store BarrierTraditional architectures limited by the Store BarrierTraditional architectures limited by the Store Barrier

instr 1instr 1instr 2instr 2. . .. . .Store(*)Store(*)

Load (*)Load (*)useuse

BarrierBarrier

Traditional ArchitecturesTraditional Architectures

Page 21: IA-64 Architecture Innovations

®®

Introducing Data Introducing Data SpeculationSpeculation

Compiler can issue a load prior to a Compiler can issue a load prior to a preceding, possibly-conflicting storepreceding, possibly-conflicting store

Unique feature to IA-64Unique feature to IA-64Unique feature to IA-64Unique feature to IA-64

instr 1instr 1instr 2instr 2. . .. . .st8st8

ld8ld8useuse

BarrierBarrier

Traditional ArchitecturesTraditional Architectures

ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8

ld.cld.cuse use

IA-64IA-64

Page 22: IA-64 Architecture Innovations

®®

Data SpeculationData SpeculationUses can be hoistedUses can be hoisted

Synergy with control speculation Synergy with control speculation yields greater performanceyields greater performance

Synergy with control speculation Synergy with control speculation yields greater performanceyields greater performance

ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8

ld.cld.cuse use

ld8.ald8.ainstr 1instr 1useuseinstr 2instr 2st8st8

chk.achk.a ld8ld8usesusesbr homebr home

Recovery codeRecovery code

Page 23: IA-64 Architecture Innovations

®®

Advanced Load Address Advanced Load Address Table - ALATTable - ALAT ld.a inserts entries.ld.a inserts entries. Conflicting stores remove entries Conflicting stores remove entries

– Also: ld.c.clr, chk.a.clr, Also: ld.c.clr, chk.a.clr,

Presence of entry indicates successPresence of entry indicates success– chk.a branches when no entry is found chk.a branches when no entry is found

reg # Address

reg # Address

reg # Address...

ld.a reg# =...

stchk.a reg# ?

Page 24: IA-64 Architecture Innovations

®®

Architectural Support for Architectural Support for Data SpeculationData SpeculationInstructionsInstructions

– ld.a - advanced loadsld.a - advanced loads

– ld.c - check loadsld.c - check loads

–chk.a - advance load checkschk.a - advance load checks

Speculative Advanced loads - ld.sa - is Speculative Advanced loads - ld.sa - is an advanced load with deferral an advanced load with deferral

ALAT - HW structure containing ALAT - HW structure containing outstanding advanced loadsoutstanding advanced loads

Page 25: IA-64 Architecture Innovations

®®

Speculation BenefitsSpeculation BenefitsReduces impact of memory latencyReduces impact of memory latency

– Study demonstrates performance improvement Study demonstrates performance improvement of 79% when combined with predication*of 79% when combined with predication*

Greatest improvement to code with Greatest improvement to code with many cache accessesmany cache accesses– Large databasesLarge databases

– Operating systemsOperating systems

Scheduling flexibility enables new Scheduling flexibility enables new levels of performance headroomlevels of performance headroom

* August, et.al, 1998

Page 26: IA-64 Architecture Innovations

®®

AgendaAgenda

Architecture PrinciplesArchitecture PrinciplesPredication & SpeculationPredication & SpeculationBranch ArchitectureBranch ArchitectureSoftware PipeliningSoftware Pipelining

Page 27: IA-64 Architecture Innovations

®®

Instruction 1Instruction 1 Instruction 0Instruction 0 TemplateTemplate

128-bit bundle128-bit bundle00127127

QPQPIP-OffsetIP-OffsetBranchBranch

21-bits21-bits

Branch InstructionBranch Instruction

Two basic branch formatsTwo basic branch formats– Relative: IP := IP + Offset21Relative: IP := IP + Offset21

– Indirect: IP := BR[I] Indirect: IP := BR[I] – 8 branch registers for efficient branch execution8 branch registers for efficient branch execution

– Call/Return linking through branch registersCall/Return linking through branch registers

Loop branches with 64-bit loopcount register (LC)Loop branches with 64-bit loopcount register (LC)– Enables perfect branch prediction of counted loopsEnables perfect branch prediction of counted loops

– Traditional architectures always mispredict last iterationTraditional architectures always mispredict last iteration– Incurs misprediction stall costing many cycles Incurs misprediction stall costing many cycles

41-bits41-bits

Page 28: IA-64 Architecture Innovations

®®

(p1) BR #label_A;(p1) BR #label_A;

Conditional branchesConditional branches

(p0) BR #label_A;(p0) BR #label_A;

Unconditional branchesUnconditional branches

AA BB AA

““always true”always true”

Branch PredicatesBranch Predicates

Compiler directed static prediction Compiler directed static prediction augments dynamic predictionaugments dynamic prediction– Better predict highly correlated branches Better predict highly correlated branches

(always/never taken)(always/never taken)

– Frees space in H/W predictorFrees space in H/W predictor

– Can give hint for dynamic predictorCan give hint for dynamic predictor

P1=trueP1=true P1=falseP1=false

Page 29: IA-64 Architecture Innovations

®®

Compare & Branch in Compare & Branch in Same CycleSame Cycle

Queens Loop: Parallel Compares &Queens Loop: Parallel Compares &Compare-branchCompare-branch

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) (p1) br thenbr thenelseelse

1

2

4

From 5 Cycles Down to 4From 5 Cycles Down to 4From 5 Cycles Down to 4From 5 Cycles Down to 4

Page 30: IA-64 Architecture Innovations

®®

3 branch cycles3 branch cycles 1 branch cycle1 branch cycle

w/o Speculationw/o Speculation Hoisting LoadsHoisting Loads IA-64IA-64

ld8 r6 = (ra)ld8 r6 = (ra)(p1) br exit1(p1) br exit1

ld8 r7 = (rb)ld8 r7 = (rb)(p3) br exit2(p3) br exit2

ld8 r8 = (rc)ld8 r8 = (rc)(p5) br exit3(p5) br exit3

chk r6, rec0chk r6, rec0(p1) br exit1(p1) br exit1

Chk r7, rec1Chk r7, rec1(p3) br exit2(p3) br exit2

Chk r8, rec2Chk r8, rec2(p5) br exit3(p5) br exit3

ld8.s r6 = (ra)ld8.s r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

ld8.s r6 = (ra)ld8.s r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

chk r6, rec0chk r6, rec0(p2) chk r7, rec1(p2) chk r7, rec1(p4) chk r8, rec2 (p4) chk r8, rec2 }{}{(p1) br exit1(p1) br exit1(p3) br exit2(p3) br exit2(p5) br exit3(p5) br exit3}}

P1P1

P6P6P5P5

P2P2

P4P4P3P3

Multiway branches: more than 1 branch in a single cycleMultiway branches: more than 1 branch in a single cycle Allows n-way branchingAllows n-way branching

Supports Aggressive SpeculationSupports Aggressive SpeculationSupports Aggressive SpeculationSupports Aggressive Speculation

Multi-way BranchMulti-way Branch

Page 31: IA-64 Architecture Innovations

®®

Software PipeliningSoftware Pipelining Overlapping execution of different loop iterationsOverlapping execution of different loop iterations

vs.vs.

More iterations in same amount of timeMore iterations in same amount of time

Page 32: IA-64 Architecture Innovations

®®

Especially Useful for Integer Code With Especially Useful for Integer Code With Small Number of Loop IterationsSmall Number of Loop Iterations

Especially Useful for Integer Code With Especially Useful for Integer Code With Small Number of Loop IterationsSmall Number of Loop Iterations

Software PipeliningSoftware Pipelining IA-64 features that make this possibleIA-64 features that make this possible

– Full PredicationFull Predication

– Special branch handling features Special branch handling features

– Register rotation: removes loop copy overheadRegister rotation: removes loop copy overhead

– Predicate rotation: removes prologue & epiloguePredicate rotation: removes prologue & epilogue

Traditional architectures use loop unrollingTraditional architectures use loop unrolling – High overhead: extra code for loop body, prologue, and High overhead: extra code for loop body, prologue, and

epilogue epilogue

Page 33: IA-64 Architecture Innovations

®®

Execution (Cycles) 1 2 3 4 5 6 7 8

For (i=0; i<n; i++) {For (i=0; i<n; i++) {

*b++ =*b++ = *a++;*a++;

}} /* MemCopy */ /* MemCopy */

// setup ra/rb/lc, // setup ra/rb/lc,

.label loop.label loop

{{

ld8 r35 = [ra],8ld8 r35 = [ra],8

}{}{

st8 [rb],8 = r35st8 [rb],8 = r35

br.cloop #loop // check n!=0br.cloop #loop // check n!=0

}}

ld1st1 br.cloop

ld2st2 br. cloop

ld3st3 br. cloop

ld4st4 br. cloop

Basic Copy Loop

3 ops3 ops

Basic Loop ExampleBasic Loop Example

Simple Non-overlapping iterationsSimple Non-overlapping iterations– 2 cycles per iteration2 cycles per iteration– 3 operations in loop body3 operations in loop body

Page 34: IA-64 Architecture Innovations

®®

Epilogue

Prologue

Main loop

ld1st1ld2st2 br.cloopld3st3

1 2 3 4

5

Test for loop count 0,1 Test for loop count 0,1 ld8 r34 = [ra],8ld8 r34 = [ra],8

.label loop.label loop ld8ld8 r35 = [ra],8r35 = [ra],8 st8 [rb],8 = r34st8 [rb],8 = r34 br.cle #e-exitbr.cle #e-exit ld8ld8 r34 = [ra],8r34 = [ra],8 st8 [rb],8 = r35st8 [rb],8 = r35 br.cloop #loopbr.cloop #loop st8st8 [rb],8 = r34[rb],8 = r34 br #thrubr #thru

.label e-exit.label e-exit st8 [rb],8 = r35st8 [rb],8 = r35.label thru.label thru

Unrolled Copy Loop

Execution cycles

ld4

st4

br.cle

br.cle10 ops10 ops

Loop Support: UnrollingLoop Support: Unrolling

Overlapped iterations Overlapped iterations – 1 cycle per word1 cycle per word– 1.6X performance improvement1.6X performance improvement– 3.3X code expansion3.3X code expansion

Incurs Code Expansion PenaltiesIncurs Code Expansion Penalties Incurs Code Expansion PenaltiesIncurs Code Expansion Penalties

Page 35: IA-64 Architecture Innovations

®®

Software Register RenamingSoftware Register RenamingTraditionalTraditional

ArchitectureArchitecture

......

......

R32R32R33R33R34R34

R35R35ldld11 r34 r34

Page 36: IA-64 Architecture Innovations

®®

Software Register RenamingSoftware Register RenamingTraditionalTraditional

ArchitectureArchitecture

......

......

R32R32R33R33R34R34

R35R35stst11 r34 r34ldld22 r35 r35

Page 37: IA-64 Architecture Innovations

®®

Software Register RenamingSoftware Register RenamingTraditionalTraditional

ArchitectureArchitecture

......

......

R32R32R33R33R34R34

R35R35stst22 r35 r35ldld33 r34 r34

Page 38: IA-64 Architecture Innovations

®®

Software Register RenamingSoftware Register RenamingTraditionalTraditional

ArchitectureArchitecture

......

......

R32R32R33R33R34R34

R35R35ldld44 r35 r35stst33 r34 r34

Page 39: IA-64 Architecture Innovations

®®

Software Register RenamingSoftware Register RenamingTraditionalTraditional

ArchitectureArchitecture

......

......

R32R32R33R33R34R34

R35R35stst44 r35 r35

Page 40: IA-64 Architecture Innovations

®®

PalmPalm SunnySunnyisisSpringsSprings

RRB=0RRB=0

Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

ldld11 R35 R35

......

35:35:34:34:33:33:32:32:

36:36:

......

PalmPalm

Page 41: IA-64 Architecture Innovations

®®

PalmPalm SunnySunnyisisSpringsSprings

IA-64IA-64

......

35:35:34:34:33:33:32:32:

36:36:

......

RRB=0RRB=0

Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

PalmPalm

ldld22 R34 R34

stst11 R35 R35

SpringsSpringsPalmPalm

Page 42: IA-64 Architecture Innovations

®®

PalmPalm SunnySunnyisisSpringsSprings

IA-64IA-64

......

34:34:33:33:32:32:127:127:

35:35:

......

RRB=-1RRB=-1

Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

PalmPalm SpringsSprings

ldld33 R34 R34

stst22 R35 R35

isisSpringsSpringsPalmPalm

Page 43: IA-64 Architecture Innovations

®®

PalmPalm SunnySunnyisisSpringsSprings

IA-64IA-64

......

33:33:32:32:127:127:126:126:

34:34:

......

RRB=-2RRB=-2

Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

PalmPalm SpringsSprings

ldld44 R34 R34

stst33 R35 R35

SunnySunnyisisSpringsSprings

isis

Page 44: IA-64 Architecture Innovations

®®

PalmPalm SunnySunnyisisSpringsSprings

IA-64IA-64

......

32:32:127:127:126:126:125:125:

33:33:

......

RRB=-3RRB=-3

Introducing Rotating Introducing Rotating RegistersRegisters GR 32-127, FR32-127 can rotateGR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRsSeparate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB)Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

PalmPalm SpringsSprings

stst44 R35 R35SunnySunnyisis

isis SunnySunny

Page 45: IA-64 Architecture Innovations

®®

// setup ra/rb/lc/ec, check n > 2 { ld8 r35 = [ra],8}.label loop { ld8 r34 = [ra],8 st8 [rb] = r35,8 br.ctop #loop}{ st8 [rb] = r35,8}

Software Pipelined Copy LoopSoftware Pipelined Copy Loop

Epilogue

Prologue

Main loop

ld1st1ld2st2 br. ctopld3st3

1 2 3 4

5

Execution cycles

ld4

st4

br.ctop

br. ctop5 ops

Loop Support: Rotating Loop Support: Rotating RegistersRegisters

Modulo Scheduled IterationsModulo Scheduled Iterations– 1 cycle per word1 cycle per word

– 1.6X performance improvement1.6X performance improvement– additional upside for higher latency conditionsadditional upside for higher latency conditions

– 1.7X code expansion1.7X code expansion

Page 46: IA-64 Architecture Innovations

®®

Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35

RRB=0RRB=0

LC=3LC=3EC=2EC=2

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

0000

00

CodeCode

(p16) ld R34(p16) ld R34

(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34

InitializeInitializeInitializeInitialize

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

11

00

00

Page 47: IA-64 Architecture Innovations

®®

Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35

LC=2LC=2EC=2EC=2 IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

0000

00

CodeCode

(p16) ld R34(p16) ld R34

(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34

Branch 1Branch 1Branch 1Branch 1

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1100

00

RRB=-1RRB=-1

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1111

00

IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

11

00

00

(p17) st R35(p17) st R35 (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34

Page 48: IA-64 Architecture Innovations

®®

Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

0000

00

CodeCode

(p16) ld R34(p16) ld R34

(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34

Branch 2Branch 2Branch 2Branch 2

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1100

00

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1111

00

IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

1100

00

(p17) st R35(p17) st R35 (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34

LC=1LC=1EC=2EC=2 IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

1111

00

RRB=-2RRB=-2

IA-64IA-64

......

63:63:62:62:61:61:60:60:

16:16:

......

1111

11

00

00

(p17) st(p17) st22 R35 R35(p16) ld(p16) ld33 R34 R34

Page 49: IA-64 Architecture Innovations

®®

Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

0000

00

CodeCode

(p16) ld R34(p16) ld R34

(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34

Branch 3Branch 3Branch 3Branch 3

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1100

00

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1111

00

IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

1100

00

(p17) st R35(p17) st R35 (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34

IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

1111

00

IA-64IA-64

......

63:63:62:62:61:61:60:60:

16:16:

......

1111

1100

00

(p17) st(p17) st22 R35 R35(p16) ld(p16) ld33 R34 R34

LC=0LC=0EC=2EC=2 IA-64IA-64

......

63:63:62:62:61:61:60:60:

16:16:

......

1111

1111

00

RRB=-3RRB=-3

IA-64IA-64

......

62:62:61:61:60:60:59:59:

63:63:

......

1111

11

00

00

(p17) st(p17) st33 R35 R35(p16) ld(p16) ld44 R34 R34

Page 50: IA-64 Architecture Innovations

®®

Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

(p16) ld R34(p16) ld R34(p17) st R35(p17) st R35

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

0000

00

CodeCode

(p16) ld R34(p16) ld R34

(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34

Branch 4Branch 4Branch 4Branch 4

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1100

00

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1111

00

IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

1100

00

(p17) st R35(p17) st R35 (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34

IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

1111

00

IA-64IA-64

......

63:63:62:62:61:61:60:60:

16:16:

......

1111

1100

00

(p17) st(p17) st22 R35 R35(p16) ld(p16) ld33 R34 R34

IA-64IA-64

......

63:63:62:62:61:61:60:60:

16:16:

......

1111

1111

00

IA-64IA-64

......

62:62:61:61:60:60:59:59:

63:63:

......

1111

1100

00

(p17) st(p17) st33 R35 R35(p16) ld(p16) ld44 R34 R34

LC=0LC=0EC=1EC=1 IA-64IA-64

......

62:62:61:61:60:60:59:59:

63:63:

......

1111

1100

00

IA-64IA-64

......

61:61:60:60:59:59:58:58:

62:62:

......

1111

00

00

00

RRB=-4RRB=-4

(p16) ld R34

(p17) st(p17) st33 R35 R35(p16) ld R34(p16) ld R34

Page 51: IA-64 Architecture Innovations

®®

Introducing Rotating Introducing Rotating Predicate RegistersPredicate Registers PR16-63 can rotate, with separate Rotating Register BasePR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB)Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number

– RRB + virtual register number = physical register number.RRB + virtual register number = physical register number.

(p16)(p16)

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

0000

00

CodeCode(p17) st R35(p17) st R35(p16) ld(p16) ld11 R34 R34

Fall ThroughFall ThroughFall ThroughFall Through

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1100

00

IA-64IA-64

......

17:17:16:16:63:63:62:62:

18:18:

......

0000

1111

00

IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

1100

00

(p17)(p17) (p17) st(p17) st11 R35 R35(p16) ld(p16) ld22 R34 R34

IA-64IA-64

......

16:16:63:63:62:62:61:61:

17:17:

......

1100

1111

00

IA-64IA-64

......

63:63:62:62:61:61:60:60:

16:16:

......

1111

1100

00

(p17) st(p17) st22 R35 R35(p16) ld(p16) ld33 R34 R34

IA-64IA-64

......

63:63:62:62:61:61:60:60:

16:16:

......

1111

1111

00

IA-64IA-64

......

62:62:61:61:60:60:59:59:

63:63:

......

1111

1100

00

(p17) st(p17) st33 R35 R35(p16) ld(p16) ld44 R34 R34

IA-64IA-64

......

62:62:61:61:60:60:59:59:

63:63:

......

1111

1100

00

IA-64IA-64

......

61:61:60:60:59:59:58:58:

62:62:

......

1111

0000

00

(p16)(p16)

(p17) st(p17) st44 R35 R35(p16) ld R34(p16) ld R34

LC=0LC=0EC=0EC=0 IA-64IA-64

......

61:61:60:60:59:59:58:58:

62:62:

......

1111

0000

00

IA-64IA-64

......

60:60:59:59:58:58:57:57:

61:61:

......

0011

00

00

00

RRB=-5RRB=-5

Fall ThroughFall Through

Page 52: IA-64 Architecture Innovations

®®

// setup ra/rb/lc/ec,

check n > 1

.label loop

{

(p16) ld8 r34 = [ra],8

(p17) st8 [rb] = r35,8

br.ctop #loop

}

Software Pipelined Copy Loop

Main loop

ld1st

ld2st1 br. ctop

ld3st2

1 2 3 4

5

Execution cycles

ld4 st3

br.ctop

br. ctopbr. ctop

3 ops

ld st4 br. ctop

Efficient Loop, Efficient Code SizeEfficient Loop, Efficient Code SizeEfficient Loop, Efficient Code SizeEfficient Loop, Efficient Code Size

Loop Support: Rotating Loop Support: Rotating PredicatesPredicates

Software Pipelined MemCopySoftware Pipelined MemCopy– 1 cycle per word 1 cycle per word – 1.6X performance improvement1.6X performance improvement– no code expansionno code expansion

Page 53: IA-64 Architecture Innovations

®®

Software Pipelining BenefitsSoftware Pipelining BenefitsLoop pipelining maximizes performance; Loop pipelining maximizes performance;

minimizes overheadminimizes overhead– Avoids code expansion of unrolling and code Avoids code expansion of unrolling and code

explosion of prologue and epilogueexplosion of prologue and epilogue

– Smaller code means fewer cache misses Smaller code means fewer cache misses

– Greater performance improvements in higher Greater performance improvements in higher latency conditionslatency conditions

Reduced overhead allows S/W pipelining of Reduced overhead allows S/W pipelining of small loops with unknown trip countssmall loops with unknown trip counts– Typical of integer scalar codesTypical of integer scalar codes

Page 54: IA-64 Architecture Innovations

®®

Reviewing What’s New:Reviewing What’s New:Parallel comparesParallel comparesTbitTbitNat bitsNat bitsDeferralDeferralHoisting usesHoisting usesPropagationPropagationBranch instructionsBranch instructionsStatic predictionStatic predictionAdvanced loadsAdvanced loads

ALATALATLoop branchesLoop branchesLC registerLC registerEC registerEC registerMultiway branchMultiway branchBranch registersBranch registersRegister rotationRegister rotationPredicate rotationPredicate rotationRRBRRB

Page 55: IA-64 Architecture Innovations

®®

SummarySummarySpeculation reduces memory latency impactSpeculation reduces memory latency impact

– IA-64 removes recovery from critical pathIA-64 removes recovery from critical path

– Benefits applications with poor cache locality: Benefits applications with poor cache locality: server applications, OSserver applications, OS

Predication removes branchesPredication removes branches– Parallel compares increase parallelismParallel compares increase parallelism

– Benefits complex control flow: large databasesBenefits complex control flow: large databases

S/W pipelining support with minimal overhead S/W pipelining support with minimal overhead enables broad usageenables broad usage– Performance for small integer loops with unknown Performance for small integer loops with unknown

trip counts as well as monster FP loopstrip counts as well as monster FP loops