ece 4100/6100 advanced computer architecture lecture 15 static scheduling machines
DESCRIPTION
ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines. Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology. Static Scheduling. Compiler performs instruction scheduling VLIW Very Long Instruction Word - PowerPoint PPT PresentationTRANSCRIPT
ECE 4100/6100Advanced Computer Architecture
Lecture 15 Static Scheduling Machines
Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology
2
Static Scheduling• Compiler performs instruction scheduling• VLIW Very Long Instruction Word• An alternative to dynamic scheduling processors• Pack multiple operations into one instruction• Move scheduling to Compiler (Software Approach)• Can simplify the complexity of a hardware-based
instruction scheduler• Cydrome, Multiflow, EPIC
3
Very Long Instruction Word (VLIW)
• Rely on Compilers• Simple Hardware• Dependency is explicitly represented in the
instructions• Instruction window, supposedly, is much larger
than a hardware scheduling window– How about loop boundary?– How about function boundary?– Interprocedural optimization is generally
difficult• Might lead to compatibility or performance issues
if instruction latency changed• EPIC/Itanium closely follows VLIW philosophy,
many embedded and DSP processors embrace VLIW
4
Intel Itanium ISA• Itanium Instruction “Bundle” (VLIW)
– 128 bits each– Contains three Itanium instructions (aka syllables)– Template bits in each bundle specify dependencies both within
a bundle as well as between sequential bundles– A collection of independent bundles forms a “group” (use
stops)
• Each Itanium Instruction– Fixed-length 41 bits long– Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st,
INT ld/st, ALU)– Contains max three 7-bit register specifiers– Contains a 6-bit field for specifying one of the 64 one-bit
qualifying predicate registers
Instruction Slot 1 Instruction Slot 2 Instruction Slot 3 Templt
0454586127
5
Encoding Instruction Bundle
• Use “;;” as “stop bitstop bit” in assembly code to separate dependent instructions
• Instructions between “;;” belong to the same “instruction group”– RAW and WAW are not allowed in the same instruction group– WAR is allowed except for an special case: when writing p63 by
modulo-scheduled branch (e.g. br.ctop) after reading p63 (e.g. qualifying predicate) by B-type instruction
• Each instruction slot can represent one (out of 5) functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit)
• 12 basic templates provided, each with 2 versions depending on stop bit– MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB,
MMB, MFB– MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_,
MFB_
{ .mii ld4 r28=[r8]add r9 = 2,r1;;add r30= 1,r9
}MI_I format Template encoded “02”
6
Itanium Instruction Example
{ .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;;}{ .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5} { .mbb ld8 r45 = [r55](p3)br.call b1=func1(p4)br.cond Label1}{ .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;;}
7
Itanium Register Files
Stacked (Rotating)
Static
0
3132
127
General Purpose Registers
Stacked (Rotating)
Static
0
3132
127
FP Registers
063 081
Stacked (Rotating)
Static
01516
630
Predicate Registers
8
Register Stack Engine
• Avoid spills/fills during function call/return• Callee uses instruction alloc r1=ar.pfs, i, l, o, r alloc r1=ar.pfs, i, l, o, r upon
entering a function
(inputs)
Static
0
3132
127
localsoutputs
illegalsize of frame (sof)
sofsol
Current Frame Marker (CFM) 38 bits
size of locals (sol = i+l)
sorrrb.grrrb.frrrb.pr
size of rotating (sor)
9
Function Call Examplemain(){
a=foo(i*i, b[i]);
}
int foo(int ii, int bb){
}
r32
r43r44r45
i*i b[i]
r127
main: alloc r32=ar.pfs,0,12,2,0
foo: alloc r26=ar.pfs,2,5,0,0
GPR
Caller (main)
r32
r43r32r33
i*i b[i]
r127
GPR
r38
Callee (foo)
10
RSE: A Function Call
32
46
loc
out52
sofsol
CFM 2114
PFS.pfm xx
3238
out
sofsol
70
2114
call
pfm: Previous frame marker
11
RSE: Alloc
32
46
loc
out52
sofsol
CFM 2114
PFS.pfm xx
3238
out
sofsol
70
2114
call alloc r32=ar.pfs,7,9,3,0
sofsol
1916
2114
32
48
loc
out50
inputs
alloc copies PFM to GR (r32)
12
RSE: Return
32
46
loc
out52
sofsol
CFM 2114
PFS.pfm xx
3238
out
sofsol
70
2114
call alloc
sofsol
1916
2114
32
48
loc
out50
32
46
loc
out52
sofsol
2114
2114
return
13
Itanium Pipelines
• Performance improvement due to pipeline shortening — 4% to 6% • Large integer register file cause extra stage WLD (Word Line
Decode) in Itanium, circuit improved for Itanium 2 • Inter-group latency is enforced by a scoreboard
– Latency due to scheduling that failed to space instructions out– Due to cache misses
Front-endFront-end
Ckt improvedCkt improved
Dependency Scoreboard Stall checked here prior to EXE
14
Itanium 2 Eight-stage Pipeline
EXPEXP RENRENROTROTIPGIPG REGREG EXEEXE DETDET WBWB
FP1FP1 FP2FP2 FP3FP3 FP4FP4 WBWB
L2NL2N L2IL2I L2AL2A L2ML2M L2DL2D L2CL2C L2WL2W
CoreCore
FPFP
L2L2
IPGIPG IP Generate, L1I cache (6 inst) and TLB access
EXEEXE ALU Execute, L1D Cache and TLB Access + L2 Cache Tag Access
ROTROT Instruction Rotate and Buffer (6 inst) DETDET Exception Detect, Branch Correction
EXPEXP Expand, Port assignment and routing WBWB Writeback, INT register update
RENREN INT and FP register rename FP1-WBFP1-WB FP FMAC pipeline (2) + register write
REGREG INT and FP register file read L2N-L2IL2N-L2I L2 Queue Nominate/Issue (4)(speculatively issued with L1 requestspeculatively issued with L1 request)
L2A-L2WL2A-L2W L2 Access, Rotate, Correct, Write (4)
15
Itanium 2 MicroarchitectureL1 I-Cache &
Fetch/Prefetch engine I-TLB
8 bundles8 bundlesInstructionInstructionQueueQueue
Branch Prediction
FF FFII IIMM MMMM MMBBBB BB
Register stack engine / remapping Register stack engine / remapping
Branch & Predicate
128 INTRegisters
128 FPRegisters
BranchUnits
BranchUnits
BranchUnits
INT & MMUnits
INT & MMUnits
INT & MMUnits
INT & MMUnits
INT & MMUnits
INT & MMUnits
Quad-port(INT) L1
PIPT DataCache (WT)
D-TLB
ALA
T
FloatingFloatingPointPointUnitsUnits
FloatingFloatingPointPointUnitsUnits
Scor
eboa
rd, P
redi
cate
NaT,
Exc
eptio
ns
IA-32Decode
& Control
11 issue 11 issue portsports
PIPT
Uni
fied
L2 C
ache
Qua
d-Po
rt (E
CC
)PI
PT U
nifie
d L2
Cac
he Q
uad-
Port
(EC
C)
On-
chip
PIP
T U
nifie
d L
3 C
ache
Sin
gle-
port
ed
On-
chip
PIP
T U
nifie
d L
3 C
ache
Sin
gle-
port
ed
(EC
C)
(EC
C)
Bus Controller (ECC)Bus Controller (ECC)
16
17
ld.sld.sinstr 1instr 1instr 2instr 2brbr
chk.schk.suse use
ItaniumItanium
instr 1instr 1instr 2instr 2. . .. . .brbr
LoadLoaduseuse
Conventional ArchitecturesConventional Architectures
Elevate loads above a branchElevate loads above a branch
• To improve memory latency by control speculation at compile time• Defer exceptions by setting NaT (GR’s 65th bit) that indicates:
– Whether or not an exception has occurred – Branch to fixup code required
• NaT set during ld.s, checked by chk.s
BarrierBarrier
Control Speculation (Speculative Load)
18
Control Speculation (Hoist Uses)
• The uses of speculative data can be executed speculatively– Distinguishes speculation from simple prefetch
• NaT bit propagates down to the dependent instruction chain
ld.sld.sinstr 1instr 1instr 2instr 2brbr
chk.schk.suse use
IA-64IA-64
19
Control Speculation (Recovery)• All computation instructions propagate NaTsNaTs to the
consumers to reduce number of checks• Cmp propagates “false” if NaT is set when writing predicates
(“0” for both target predicates)
chk.s chk.s r5r5, recv, recvsub r7 = sub r7 = r5r5,r2,r2
ld8.s r3 = (r9)ld8.s r3 = (r9)ld8.sld8.s r4 = (r10) r4 = (r10)addaddr6 = r3, r4r6 = r3, r4ld8.s ld8.s r5r5 = (r6) = (r6)p1,p2 = cmp(...)p1,p2 = cmp(...)
Allows single chk on Allows single chk on resultresult
ld8ld8ld8ld8addaddld8ld8br homebr home
Recovery codeRecovery code
20
Data Speculation (Advanced Loads)• Compiler can hoist a load prior to a preceding, possibly-
conflicting store• ALAT (Advanced Load Address Table) is used for checking
every store address in-between • Can be done by superscalar machine using Store coloringStore coloring
instr 1instr 1instr 2instr 2. . .. . .st8st8
ld8ld8useuse
BarrierBarrier
Conventional ArchitecturesConventional Architectures
ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8
ld.cld.cuse use
ItaniumItanium
21
Data Speculation (load.a + chk.a)• Compiler hoist a load and its subsequent consumersits subsequent consumers
prior to a preceding, possibly-conflicting store• Need to patch a recovery code for mis-speculation
ld8.a r3=ld8.a r3=instr 1instr 1instr 2instr 2st8st8
ld.cld.cadd =r3, add =r3,
ld8.a r3=ld8.a r3=instr 1instr 1add =r3,add =r3,instr 2instr 2st8st8
chk.achk.aL1:L1:
ld8 r3=ld8 r3=add =r3,add =r3,br L1br L1
Recovery codeRecovery code
22
Parallel Compare Types
• Three new types of compares:– and: both target predicates set FALSE if compare is false– or: both target predicates set TRUE if compare is true– DeMorgan: if true, sets one TRUE, sets other FALSE
• Do not get confused with the “parallel compare” pcmp1/pcmp2/pcmp4
Reduces Critical PathReduces Critical PathReduces Critical PathReduces Critical Path
BB
AA
CC
DD
BBAA CC
DD
23
Eight Queen Example
Source: Crawford & HuckSource: Crawford & Huck
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]ld R2=[R1]ld R2=[R1]ld.s R4=[R3]ld.s R4=[R3]ld.s R6=[R5]ld.s R6=[R5]p1,p2=cmp.unc(R2==true)p1,p2=cmp.unc(R2==true)
(p1)(p1) chk.s R4chk.s R4(p1)(p1) p3,p4=cmp.unc(R4==true)p3,p4=cmp.unc(R4==true)
(p3)(p3) chk.s R6chk.s R6(p3)(p3) p5,p6=cmp.unc(R5==true)p5,p6=cmp.unc(R5==true)(p5) br then(p5) br thenelseelse
1
2
4
5
6
7
ThenElse
P1
P2
P5
P3 P4
P6
8 queens control flow8 queens control flowUnconditional ComparesUnconditional Compares
24
Eight Queen Example
Source: Crawford & HuckSource: Crawford & Huck
if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))
ThenElse
P1
P2
P5
P3 P4
P6
Parallel ComparesParallel Compares
R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse
1
2
4
5
Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5
ThenElse
P1= true P1=False
25
More Example of Parallel Compare
1
0 cmp.eq p1,p2 = r0,r0;;
cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0
(p1) add r1=r2,r3(p2) sub r4=r5-r6
c1
c2
c3
else
c4
then
Itanium CodeItanium Code
2
if (c1 && c2 && c3 && c4)if (c1 && c2 && c3 && c4) r1 = r2 + r3;r1 = r2 + r3;else else r4 = r5 – r6 r4 = r5 – r6
Parallel cmp.crel.and or cmp.crel.or write the same values to both predicatesParallel cmp.crel.and or cmp.crel.or write the same values to both predicates
Use Use cmp.crel.and.orcm cmp.crel.and.orcm or or cmp.crel.or.andcmcmp.crel.or.andcm for writing for writing
complementary predicatescomplementary predicates
Also called Also called DeMorganDeMorgan type type (for complementary output)(for complementary output)
26
Multiway Branches
3 branch cycles3 branch cycles3 branch cycles3 branch cycles 1 branch cycle1 branch cycle1 branch cycle1 branch cycle
w/o Speculationw/o Speculation Hoisting LoadsHoisting Loads
ld8 r6 = (ra)ld8 r6 = (ra)(p1) br exit1(p1) br exit1
ld8 r7 = (rb)ld8 r7 = (rb)(p3) br exit2(p3) br exit2
ld8 r8 = (rc)ld8 r8 = (rc)(p5) br exit3(p5) br exit3
(p1) br exit1(p1) br exit1
chk r7, rec1chk r7, rec1(p3) br exit2(p3) br exit2
chk r8, rec2chk r8, rec2(p5) br exit3(p5) br exit3
ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)
ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)
(p2) chk r7, rec1(p2) chk r7, rec1(p4) chk r8, rec2(p4) chk r8, rec2 (p1) br exit1(p1) br exit1(p3) br exit2(p3) br exit2(p5) br exit3(p5) br exit3
P1P1
P6P6P5P5
P2P2
P4P4P3P3
• Multiway branches: more than 1 branch in a single cycleMultiway branches: more than 1 branch in a single cycle– Itanium allows multiple Itanium allows multiple ““consecutiveconsecutive”” B instructions in the same inst B instructions in the same inst
groupgroup– Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per
cyclecycle– Ordering matters if branch predicates are not mutually exclusiveOrdering matters if branch predicates are not mutually exclusive
• E.g. BBB template enables 3 branches in one bundleE.g. BBB template enables 3 branches in one bundle
Multi-way BranchesMulti-way Branches
27
Branch and Prefetch Hints
• Compiler provides hints for branch predictor by– Completer in branch instructions, e.g. br.call.sptksptk
• 4 completer types for static and dynamic predictions: sptk, spnt, dptk, dpntsptk, spnt, dptk, dpnt
– Explicit brpbrp instructions• Compiler provide hints for instructioninstruction sequentialsequential prefetchingprefetching
– Use completer in branch instructions, e.g. br.call.sptk.manymany• 2 completer types: many, few many, few• ManyMany and fewfew are implementation-specific
• Compiler directs predictor allocation– For managing branch predictor resources– Use completer in branch instructions, e.g.
br.call.sptk.many.nonenone• 2 completer types: none, clr none, clr• nonenone: don’t deallocate; clrclr: deallocate branch info
28
Modulo Scheduling Support
• Will be discussed next• Itanium features support modulo
scheduling (or software pipelining)– Full Predication– Special branch handling features
• br.ctop (for for-loop with known loop count)• br.wtop (for while-loop)
– Register rotation: removes loop copy overhead• No modulo variable expansion, tighter code
– Predicate rotation/generation• Removes prologue & epilogue
29
List Scheduling
++
xx
A1A1
A2A2
A3A3
M1M1
M2M2
M3M3
C1C1
C3C3
C2C2
++
++
xx
xx
ld
st
X1X1
X2X2
P = Mem[A++] + C1;Q = P * C2;Y = P * C3 + (P + Q) * (P * C3);Mem[B++] = Y;
Latency: Latency: Mem — 1 cycleAdder — 2 cyclesMultiplier — 2 cycles
Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}
• Build dependency graph• Assign a priority of “0” to all
operations having no successors• Assign each remaining operation the
sum of priority and latency of their successor. If more than one successor, assign the maximum.
• Schedule instructions based on priority
00
11
33
55 55
99
1111
77
30
List Scheduling
++
xx
A1A1
A2A2
A3A3
M1M1
M2M2
M3M3
C1C1
C3C3
C2C2
++
++
xx
xx
ld
st
X1X1
X2X2 00
11
33
55 55
99
1111
• LS (a heuristic) provides near-optimal schedule
• But no guarantee for optimality, especially, in terms of throughputthroughput
Reservation TableReservation Table
Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2
77
31
Scheduling• If I want to use the same schedule, what is the
minimum initiation interval? • In the example, do I need to wait for 12 cycles?• If not, how do I avoid collision?
Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2
32
Modulo Scheduling [RauGlaeser’81]
• A.k.a. “Polycyclic scheduling” or “Software pipelining”
• Exploit ILP among loop iterations to maximize – Machine utilization– Throughput
• Use a common schedule for the majority of iterations
• Overlap execution of consecutive iterations• Constant initiation rate Initiation IntervalInitiation Interval (IIII)• Minimum II (MIIMII) generates an optimal schedule
with maximum throughput• Originally developed for polycyclic architecture (or
horizontal architecture, or aka VLIW later) at TRW/ESL
33
Modulo Scheduling: Resource Constraint• The optimal schedule is constrained by the number of
available resources• Determine ResII (Resource minimal initiation interval)
– Successive iterations will be scheduled ResII cycles apart
• N(i) is the number of usage of resource i in a loop• C(i) is the number of resources i
) .... ,C(3)
N(3) ,
C(2)
N(2) ,
C(1)
N(1) max( ResII
34
Resource II
++
xx
A1A1
A2A2
A3A3
M1M1
M2M2
M3M3
C1C1
C3C3
C2C2
++
++
xx
xx
ld
st
X1X1
X2X2
• Assume 3 FUs– 1 adder with 2-cycle
latency– 1 mult with 2-cycle
latency– 1 mem unit with 1-cycle
latency
• Determine MII = MII = Resource IIResource II
3 ) 1
3 ,
1
3,
1
2 max( MII ResII
35
Modulo Reservation Table (MRT)
MRT
New Schedule for 1 iteration
Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 01 A1 1 12 2 23 M1 0 34 M2 1 45 A2 2 56 0 67 M3 1 78 2 89 A3 0 910 1 1011 X2 2 11
0 121 132 14
Modulo MEM ADDER MULT012
36
Modulo Reservation Table (MRT)
MRT
New Schedule for 1 iteration
Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 0 X11 A1 1 1 A12 2 23 M1 0 3 M14 M2 1 4 M25 A2 2 5 A26 0 67 M3 1 78 2 8 M39 A3 0 910 1 1011 X2 2 11
0 12 A31 132 14 X2
Modulo MEM ADDER MULT0 X1 A3 M11 A1 M22 X2 A2 M3
37
Modulo Scheduled Loop
Kernel, steady state (MRT schedule)
Prolog
Modulo Time MEM ADDER MULT0 0 X1 (1)1 1 A1 (1)2 20 3 X1 (2) M1 (1)1 4 A1 (2) M2 (1)2 5 A2 (1)0 6 X1 (3) M1 (2)1 7 A1 (3) M2 (2)2 8 A2 (2) M3 (1)0 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) M3 (2)0 12 X1 (5) A3 (1) M1 (4)1 13 A1 (5) M2 (4)2 14 X2 (1) A2 (4) M3 (3)0 15 X1 (6) A3 (2) M1 (5)1 16 A1 (6) M2 (5)2 17 X2 (2) A2 (5) M3 (4)0 18 X1 (7) A3 (3) M1 (6)1 19 A1 (7) M2 (6)2 20 X2 (3) A2 (6) M3 (5)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)
38
Modulo Scheduled Loop
Lastkernel
Epilog
Modulo Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 0 X1 (1) 0 T+0 X1 (N-2) A3 (N-6) M1 (N-3)1 1 A1 (1) 1 T+1 A1 (N-2) M2 (N-3)2 2 2 T+2 X2 (N-6) A2 (N-3) M3 (N-4)0 3 X1 (2) M1 (1) 0 T+3 X1 (N-1) A3 (N-5) M1 (N-2)1 4 A1 (2) M2 (1) 1 T+4 A1 (N-1) M2 (N-2)2 5 A2 (1) 2 T+5 X2 (N-5) A2 (N-2) M3 (N-3)0 6 X1 (3) M1 (2) 0 T+6 X1 (N) A3 (N-4) M1 (N-1)1 7 A1 (3) M2 (2) 1 T+7 A1 (N) M2 (N-1)2 8 A2 (2) M3 (1) 2 T+8 X2 (N-4) A2 (N-1) M3 (N-2)0 9 X1 (4) M1 (3) 0 T+9 A3 (N-3) M1 (N)1 10 A1 (4) M2 (3) 1 T+10 M2 (N)2 11 A2 (3) M3 (2) 2 T+11 X2 (N-3) A2 (N) M3 (N-1)0 12 X1 (5) A3 (1) M1 (4) 0 T+12 A3 (N-2)1 13 A1 (5) M2 (4) 1 T+132 14 X2 (1) A2 (4) M3 (3) 2 T+14 X2 (N-2) M3 (N)0 15 X1 (6) A3 (2) M1 (5) 0 T+15 A3 (N-1)1 16 A1 (6) M2 (5) 1 T+162 17 X2 (2) A2 (5) M3 (4) 2 T+17 X2 (N-1)0 18 X1 (7) A3 (3) M1 (6) 0 T+18 A3 (N)1 19 A1 (7) M2 (6) 1 T+192 20 X2 (3) A2 (6) M3 (5) 2 T+20 X2 (N)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)
39
Another Modulo Schedule Example
xx
A1A1
A3A3
M2M2M1M1
AA BB
EE
ZZ
++ A2A2
CC DD
00
1111
33 33
Modulo Reservation TableModulo Reservation Table
Given 2 adders (1-cycle) & 1 multiplier (2-cycle)Given 2 adders (1-cycle) & 1 multiplier (2-cycle)
prologprolog
epilogepilog
5x kernel5x kernel
Multiplier is fully utilizedMultiplier is fully utilized
MII = max(3/2, 2/1) = 2 MII = max(3/2, 2/1) = 2
++
++
xx
Modulo ADDER1 ADDER2 MULT0 A1 (3) A2 (3) M2 (2)1 A3 (1) M1 (3)
40
How to Perform Register Allocation?• We are overlapping multiple iterations into
one schedule.– Example: iteration 1 to 5 are alive at the same
time
• Registers from multiple iterations are alive during a period of time
MRT
Modulo MEM ADDER MULT0 X1 (5) A3 (1) M1 (4)1 A1 (5) M2 (4)2 X2 (1) A2 (4) M3 (3)
41
Modulo Variable Expansion
• Analyze the “life time” of an architecture register• Unroll the loop to enable modulo schedule• R5 needs to stay alive for 8 cycles = 8/3 = 3 MII (i.e. unroll 3
times)r1(1) r2
(4)
r3 (2) r4
(3)
r5 (8)
r6 (4)
r7 (2)
The cycle numbers assumes WAR allowed in the same cycle
Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 X1 (3) mul r13, r12, $c21 7 A1 (3) mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) mul r16, r14, r150 12 X1 (5) add r7, r5, r6 M1 (4)1 13 A1 (5) M2 (4)2 14 st r7, (B)++ A2 (4) M3 (3)
42
Post MVE code
Kernel (unrolled 3 times)
Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r21, (A)++ mul r13, r12, $c21 7 add r22, r21, $c1 mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 ld r1, (A)++ mul r23, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r24, r22, r23 mul r16, r14, r150 12 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r11, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 15 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 16 add r22, r21, $c1 mul r15, r12, $c32 17 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 18 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r27, (B)++ add r24, r22, r23 mul r16, r14, r150 21 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r11, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 24 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 25 add r22, r21, $c1 mul r15, r12, $c32 26 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 27 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r27, (B)++ add r24, r22, r23 mul r16, r14, r15
43
Register Allocation for MVE
• To save # of registers, might not need to expand all registers• Calculate the lifetime of each register to determine if a new
register is needed across iterations (the formula assumes WAR in the same instruction bundle is allowed)
• # of copies = (MII % lifetime/MII == 0) ? lifetime/MII : MII• 14 5/14
– R1 is alive for 1 cycle = 1/3 = 1 MII (need 1 copy)– R2 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since
3%2=1)– R3 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)– R4 is alive for 3 cycles = 3/3 = 1 MII (need 1 copy)– R5 is alive for 8 cycles = 8/3 = 3 MII (need 3 copies)– R6 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since
3%2=1)– R7 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)
• 13 registers used, instead of 21 with the same unrolling degree
44
MVE (reallocate registers)
Kernel (unrolled 3 times)
The cycle numbers assumes WAR allowed in the same cycle
Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r1, (A)++ mul r3, r2, $c21 4 add r12, r1, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r1, (A)++ mul r3, r12, $c21 7 add r22, r1, $c1 mul r15, r12, $c32 8 add r4, r12, r3 mul r6, r4, r50 9 ld r1, (A)++ mul r3, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r4, r22, r3 mul r16, r4, r150 12 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r1, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 15 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 16 add r22, r1, $c1 mul r15, r12, $c32 17 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 18 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r7, (B)++ add r4, r22, r3 mul r16, r4, r150 21 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r1, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 24 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 25 add 22, r1, $c1 mul r15, r12, $c32 26 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 27 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r7, (B)++ add r4, r22, r3 mul r16, r4, r15
45
Final Modulo Schedule
Prolog Code (12 instruction bundles)
Epilog Code (12 instruction bundles)
**Branch instruction not shown
9 instruction bundles
ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r11, $c1 mul r5, r2, $c3
st r7, (B)++ add r4, r2, r3 mul r26, r24, r25ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c2
add r22, r21, $c1 mul r15, r12, $c3st r17, (B)++ add r14, r12, r13 mul r6, r4, r5ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c2
add r2, r1, $c1 mul r25, r22, $c3st r27, (B)++ add r24, r22, r23 mul r16, r14, r15
46
Final Modulo Schedule (Reallocate Registers)
Prolog Code (12 instruction bundles)
Epilog Code (12 instruction bundles)
**Branch instruction not shown
9 instruction bundles
ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r1, $c1 mul r5, r2, $c3
st r7, (B)++ add r4, r2, r3 mul r26, r4, r25ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c2
add r22, r1, $c1 mul r15, r12, $c3st r7, (B)++ add r4, r12, r3 mul r6, r4, r5ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c2
add r2, r1, $c1 mul r25, r22, $c3st r7, (B)++ add r4, r22, r3 mul r16, r4, r15
47
Issues with Modulo Variable Expansion• Many architecture registers are needed• Code size gets bigger when more
unrolling needed
• Alternative solution: Rotating register file– A hardware technique– Solving problem without code duplication – Similar to register windowregister window plus renamingrenaming:
keep old iteration values on the stack (Itanium calls the hardware Register Stack Register Stack EngineEngine or RSERSE)
48
Intention of Using Rotation Registers• Use exactly the same schedule (below) for all
including– Kernel codes– Prolog codes– Epilog codes
• The “registers” need to be re-allocated• Registers “rotate” per iteration!!!
**Branch instruction not shown
ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c2add r2, r1, $c1 mul r5, r2, $c3
st r7, (B)++ add r4, r2, r3 mul r6, r4, r5
49
Idea of Rotation Register (Original Schedule)
ite
Time Mem Adder Multiplier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 mul r43, r42, $c2
4 mul r45, r42, $c3
5 add r44, r42, r43
2 6
7
8 mul r46, r44, r45
3 9
10
11
4 12 add r47, r45, r46
13
14 st r47, (B)++
In Intel Itanium, integer registers 32 – 127 are rotating registers
50
Original Code Schedule
ite
Time Mem Adder Multiplier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 mul r43, r42, $c2
4 mul r45, r42, $c3
5 add r44, r42, r43
2 6
7
8 mul r46, r44, r45
3 9
10
11
4 12 add r47, r45, r46
13
14 st r47, (B)++
In Intel Itanium, integer registers 32 – 127 are rotating registers
51
Assume HW Rotation Registers
ite
Time Mem Adder Multiplier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 mul r44, r43, $c2
4 mul r45, r43, $c3
5 add r52, r43, r44
2 6
7
8 mul r48, r53, r46
3 9
10
11
4 12 add r51, r48, r50
13
14 st r51, (B)++
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
52
Rotation Registers in Itanium Processors
Stacked (Rotating)
Static
0
3132
127
General Purpose Registers
Stacked (Rotating)
Static
0
3132
127
FP Registers
063 081
Stacked (Rotating)
Static
01516
630
Predicate Registers
53
Register Rotation (Prolog i0)
ite
Time Mem Adder Multiplier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
54
Register Rotation (Prolog i1)
ite
Time Mem Adder Multiplier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 ld r41, (A)++ mul r44, r43, $c2
4 add r42, r41, $c1 mul r45, r43, $c3
5 add r52, r43, r44
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
55
Register Rotation (Prolog i2)
ite
Time Mem Adder Multiplier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 ld r41, (A)++ mul r44, r43, $c2
4 add r42, r41, $c1 mul r45, r43, $c3
5 add r52, r43, r44
2 6 ld r41, (A)++ mul r44, r43, $c2
7 add r42, r41, $c1 mul r45, r43, $c3
8 add r52, r43, r44 mul r48, r53, r46
3 9
10
11
4 12
13
14
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
56
Register Rotation (Prolog i3)
ite
Time Mem Adder Multiplier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 ld r41, (A)++ mul r44, r43, $c2
4 add r42, r41, $c1 mul r45, r43, $c3
5 add r52, r43, r44
2 6 ld r41, (A)++ mul r44, r43, $c2
7 add r42, r41, $c1 mul r45, r43, $c3
8 add r52, r43, r44 mul r48, r53, r46
3 9 ld r41, (A)++ mul r44, r43, $c2
10 add r42, r41, $c1 mul r45, r43, $c3
11 add r52, r43, r44 mul r48, r53, r46
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
57
Register Rotation (Kernel Steady State i4)
ite
Time Mem Adder Multiplier
0 0 ld r41, (A)++
1 add r42, r41, $c1
2
1 3 ld r41, (A)++ mul r44, r43, $c2
4 add r42, r41, $c1 mul r45, r43, $c3
5 add r52, r43, r44
2 6 ld r41, (A)++ mul r44, r43, $c2
7 add r42, r41, $c1 mul r45, r43, $c3
8 add r52, r43, r44 mul r48, r53, r46
3 9 ld r41, (A)++ mul r44, r43, $c2
10 add r42, r41, $c1 mul r45, r43, $c3
11 add r52, r43, r44 mul r48, r53, r46
4 12 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
13 add r42, r41, $c1 mul r45, r43, $c3
14 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
Registers wrapped around if exceeding specified bound
58
• Execute many iterations in the kernel …
Register Rotation (Kernel)
59
Register Rotation (Kernel to Epilog, i<-4>)
ite
Time Mem Adder Multiplier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11
N-10
N-9
-2 N-8
N-7
N-6
-1 N-5
N-4
N-3
0 N-2
N-1
N
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
60
Register Rotation (Kernel to Epilog, i<-3>)
ite
Time Mem Adder Multiplier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11 add r51, r48, r50 mul r44, r43, $c2
N-10 mul r45, r43, $c3
N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-2 N-8
N-7
N-6
-1 N-5
N-4
N-3
0 N-2
N-1
N
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
61
Register Rotation (Kernel to Epilog, i<-2>)
ite
Time Mem Adder Multiplier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11 add r51, r48, r50 mul r44, r43, $c2
N-10 mul r45, r43, $c3
N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-2 N-8 add r51, r48, r50
N-7
N-6 st r51, (B)++ mul r48, r53, r46
-1 N-5
N-4
N-3
0 N-2
N-1
N
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
62
Register Rotation (Kernel to Epilog, i<-1>)
ite
Time Mem Adder Multiplier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11 add r51, r48, r50 mul r44, r43, $c2
N-10 mul r45, r43, $c3
N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-2 N-8 add r51, r48, r50
N-7
N-6 st r51, (B)++ mul r48, r53, r46
-1 N-5 add r51, r48, r50
N-4
N-3 st r51, (B)++
0 N-2
N-1
N
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
63
Register Rotation (Kernel to Epilog, final ite)
ite
Time Mem Adder Multiplier
-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
N-13 add r42, r41, $c1 mul r45, r43, $c3
N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-3 N-11 add r51, r48, r50 mul r44, r43, $c2
N-10 mul r45, r43, $c3
N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
-2 N-8 add r51, r48, r50
N-7
N-6 st r51, (B)++ mul r48, r53, r46
-1 N-5 add r51, r48, r50
N-4
N-3 st r51, (B)++
0 N-2 add r51, r48, r50
N-1
N st r51, (B)++
Assuming that registers are rotated per iteration automatically
In Intel Itanium, integer registers 32 – 127 are rotating registers
64
Modulo Schedule with Rotating Register Support• No loop unrolling required (required careful
register allocation)• Tighter code, saving space• However, there are still prolog and epilog codes• Can we use the same schedule for prolog/epilog?
– Use stage predicates to execute instructions conditionally
– Require new ISA support (Itanium)
ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2add r42, r41, $c1 mul r45, r43, $c3
st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
65
Predicated Instruction Execution (Prolog i0)
ite
Time
Mem Adder Multiplier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3
4
5
2 6
7
8
3 9
10
11
4 12
13
14
Don’t execute shaded instructions
cc0: only issue ld
cc1: only issue add
cc2: no issue
66
Predicated Prolog (Prolog i1)ite
Time
Mem Adder Multiplier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46
2 6
7
8
3 9
10
11
4 12
13
14
cc3: ld(i1) & mul(i0)
cc4: add(i0) & mul(i0)
cc5: add(i0)
Note that stage predicates also “rotate” per iteration
67
Predicated Prolog (Prolog i2)ite
Time
Mem Adder Multiplier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46
2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
3 9
10
11
4 12
13
14
cc6: ld(i2) & mul(i1)
cc7: add(i2) & mul(i1)
cc8: add(i1) & mul(i0)
68
Predicated Prolog (Prolog i3)ite
Time
Mem Adder Multiplier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46
2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
3 9 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
11 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
4 12
13
14
cc9: ld(i3) & mul(i2)
cc10: add(i3) & mul(i2)
cc11: add(i2) & mul(i1)
69
Predicated Kernel (i4)ite
Time
Mem Adder Multiplier
0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2
1 (p16) add r42, r41, $c1 mul r45, r43, $c3
2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46
1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46
2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
3 9 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2
10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
11 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
4 12 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
14 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
cc12: ld(i4) & add(i0) & mul(i3)cc13: st(i0) & add(i4) & mul(3)cc11: add(i3) & mul(i2)
(p20) is used in iteration 4, not (p19) because of predicate rotation
70
• Execute many iterations in the kernel …
Register Rotation (Kernel)
71
Predicated Epilog (i<-4>)ite
Time Mem Adder Multiplier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11
N-10
N-9
-2 N-8
N-7
N-6
-1 N-5
N-4
N-3
0 N-2
N-1
N
72
Predicated Epilog (i<-3>)ite
Time Mem Adder Multiplier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-2 N-8
N-7
N-6
-1 N-5
N-4
N-3
0 N-2
N-1
N
73
Predicated Epilog (i<-2>)ite
Time Mem Adder Multiplier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-1 N-5
N-4
N-3
0 N-2
N-1
N
74
Predicated Epilog (i<-1>)ite
Time Mem Adder Multiplier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-1 N-5 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-3 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
0 N-2
N-1
N
75
Predicated Epilog (final iteration)ite
Time Mem Adder Multiplier
-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
-1 N-5 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N-3 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
0 N-2 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
N-1 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
N (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
76
Final Modulo Schedule (Itanium-like)
• Before entering the loop, set p16p16 =1 (p16 is the first rotating predicate register)
• When the modulo-scheduled loop branch (e.g. br.ctop) encountered – p63p63 is set to 1 by hardware in the prolog code (see next slide)– All registers (rotating registers and predicate rotating registers) rotate as
each stage (iteration) advances• Only 3 Itanium Instruction Bundles (= 3 VLIWs) needed
– No prolog, epilog codes– No modulo variable expansions that stress registers and blow up code size
(p16) r41 = (A)++ (p20) r51 = r48 + r50
(p20) (B)++ = r51(p16) r42 = r41 + $c1
(p17) r44 = r43 * $c2
(p17) r52 = r43 + r44
mov ar.lc = 196 // loop countmov ar.ec = 5 // epilog stages+1mov pr.rot = 0x10000 // special inst set pr[16]=1 and p[63:17]=0
L1top:
br.ctop L1top
(p17) r45 = r43 * $c3(p18) r48 = r53 * r46
Counted Modulo-scheduled Loop
p20p20
00
p19p19
00
p18p18
00
p17p17
00
p16p16
11
p63p63
11
p62p62
00
Stage 0 (Stage 0 (PrologProlog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
After the first iterationLC = 195, EC = 5
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
00
p19p19
00
p18p18
00
p17p17
00
p16p16
11
p63p63
11
p62p62
00
Stage 1 (Stage 1 (PrologProlog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 2nd iteration
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
00
p19p19
00
p18p18
00
p17p17
00
p16p16
11
p63p63
11
p62p62
11
p61p61
00
Stage 2 (Stage 2 (PrologProlog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 3rd iteration
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
00
p19p19
00
p18p18
00
p17p17
00
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
00
Stage 3 (Stage 3 (PrologProlog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 4th iteration
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
00
p19p19
00
p18p18
00
p17p17
00
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
11
p59p59
00
Stage 4 (Stage 4 (KernelKernel))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 5th iteration
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
In the Kernel
• After Another 191 Iterations …..
Counted Modulo-scheduled Loop
p20p20
11
p19p19
11
p18p18
11
p17p17
11
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
11
p59p59
11
p58p58
11
p57p57
11
p56p56
11
p55p55
11
Stage 195 (Stage 195 (KernelKernel))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 196th iterationLC=0, EC=5
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
11
p19p19
11
p18p18
11
p17p17
11
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
00
p59p59
11
p58p58
11
p57p57
11
p56p56
11
p55p55
11
Stage 195 (Stage 195 (KernelKernel))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
after the 196th iterationEC=4
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
11
p19p19
11
p18p18
11
p17p17
11
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
00
p59p59
11
p58p58
11
p57p57
11
p56p56
11
p55p55
11
Stage 196 (Stage 196 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 197th iterationEC=4
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
11
p19p19
11
p18p18
11
p17p17
11
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
00
p59p59
00
p58p58
11
p57p57
11
p56p56
11
p55p55
11
Stage 197 (Stage 197 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 198th iterationEC=3
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
11
p19p19
11
p18p18
11
p17p17
11
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
00
p59p59
00
p58p58
00
p57p57
11
p56p56
11
p55p55
11
Stage 198 (Stage 198 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 199th iterationEC=2
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
11
p19p19
11
p18p18
11
p17p17
11
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
00
p59p59
00
p58p58
00
p57p57
00
p56p56
11
p55p55
11
Stage 199 (Stage 199 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
Before the 200th iteration (Last iteration)EC=1
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20
Counted Modulo-scheduled Loop
p20p20
11
p19p19
11
p18p18
11
p17p17
11
p16p16
11
p63p63
11
p62p62
11
p61p61
11
p60p60
00
p59p59
00
p58p58
00
p57p57
00
p56p56
00
p55p55
11
Stage 199 (Stage 199 (EpilogEpilog))
Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2
(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3
(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46
After the 200th iteration (Last iteration)EC=0
Rotating PredicateRegisters
p16
p63
p17
p18
p19
p20• “br.ctop” instruction
exits the loop
90
Modulo Scheduling ExampleLoop{
P=A+BQ=C+D;X=PxEY=PxQZ=X+Y
}
Step 1: Data flow graph
xx M2M2
A1A1
AA BB
++ A2A2
CC DD
++
A3A3
ZZ
++
EE
M1M1 xx
Loop{P=A+B
Q=C+D;X=PxEY=PxQZ=X+Y
}
Loop{P=A+B
Q=C+D;X=PxEY=PxQZ=X+Y
}
Loop{P=A+B
Q=C+D;X=PxEY=PxQZ=X+Y
}
Loop{P=A+B
Q=C+D;X=PxEY=PxQZ=X+Y
}
91
Modulo SchedulingStep 2: Generate a list schedule
xx M2M2
A1A1
AA BB
++ A2A2
CC DD
++
A3A3
ZZ
++
EE
M1M1 xx
00
11 11
3333
Execution units:2 Adders – 1cycle latency1 Multiplier – 2 cycle latency
92
Modulo SchedulingStep 2: Generate a list schedule
xx M2M2
A1A1
AA BB
++ A2A2
CC DD
++
A3A3
ZZ
++
EE
M1M1 xx
00
11 11
3333
ReservationReservation TableTable
Time Adder1 Adder2 Mult0 A1
1234
A2
M1
M2
A3
93
Modulo SchedulingGenerating Modulo Schedule:
1. Determine the MII:
Ctyavailabilisource
NdemandsourceMII
:_Re
:_Remax
MII = max[(3/2) ,(2/1)] = 2
94
Modulo SchedulingMapping from list schedule to modulo schedule
Time Modulo 2 Adder1 Adder2 Mult
0 0 A1 A2
1 1 M1
2 0 M2
3 1
4 0 A3
5 1
6 0
List scheduleList schedule
Time Adder1 Adder2 Mult0 A1
1234
A2
M1
M2
A3
Modulo scheduleModulo schedulefor 1 iterationfor 1 iteration
A3
95
Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult
0 0 1:A1 1:A2
1 1 1:M1
2 0 2:A1 2:A2 1:M2
3 1 2:M1
4 0 2:M2
5 1 1:A3
6 0
7 1 2:A3
8 0
inserting iteration 2
96
Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult
0 0 1:A1 1:A2
1 1 1:M1
2 0 2:A1 2:A2 1:M2
3 1 2:M1
4 0 3:A1 3:A2 2:M2
5 1 1:A3 3:M1
6 0 3:M2
7 1 2:A3
8 0
9 1 3:A3
inserting iteration 3
97
Modulo Scheduled Loop
prologprolog
epilogepilog
5x kernel5x kernel
Modulo 2
Adder 1 Adder 2 Mult
0 3:A1 3:A2 2:M2
1 1:A3 3:M1
MRT