upc micro35 istanbul nov. 2002 effective instruction scheduling techniques for an interleaved cache...
Post on 29-Mar-2015
216 Views
Preview:
TRANSCRIPT
UPC
MICRO35Istanbul
Nov. 2002
Effective Instruction Scheduling Techniques for an Interleaved Cache
Clustered VLIW Processor
Effective Instruction Scheduling Techniques for an Interleaved Cache
Clustered VLIW Processor
Enric Gibert1
Jesús Sánchez1,2
Antonio González1,2
1Dept. d’Arquitectura de Computadors
Universitat Politècnica de Catalunya (UPC)
Barcelona
2Intel Barcelona Research CenterIntel LabsBarcelona
UPC
MICRO35Istanbul
Nov. 2002
Motivation
Capacity vs. Communication-bound Clustered microarchitectures
– Simpler + faster– Power consumption– Communications not homogeneous
Clustering embedded/DSP domain
UPC
MICRO35Istanbul
Nov. 2002
Clustered Microarchitectures
CLUSTER 1
Reg. FileReg. File
FUsFUs
CLUSTER 2
Reg. FileReg. File
FUsFUs
CLUSTER 3
Reg. FileReg. File
FUsFUs
CLUSTER 4
Reg. FileReg. File
FUsFUs
Register-to-register communication buses
L1 cacheL1 cache
L2 cacheL2 cache
Memory buses
GOAL: distribute the memory hierarchy!!!
UPC
MICRO35Istanbul
Nov. 2002
Contributions
Distribution of data cache:– Interleaved cache clustered VLIW processor
Hardware enhancement: – Attraction Buffers
Effective instruction scheduling techniques– Modulo scheduling– Loop unrolling + smart assignment of latencies +
padding
UPC
MICRO35Istanbul
Nov. 2002
Talk Outline
MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and
techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC
MICRO35Istanbul
Nov. 2002
MultiVLIW
CLUSTER 1
Register FileRegister File
Func. UnitsFunc. Units
Register-to-register communication buses
cache module
CLUSTER 2
Register FileRegister File
Func. UnitsFunc. Units
cache module
CLUSTER 3
Register FileRegister File
Func. UnitsFunc. Units
cache module
CLUSTER 4
Register FileRegister File
Func. UnitsFunc. Units
cache module
L2 cachecache block
TAG+STATE+DATA TAG+STATE+DATA TAG+STATE+DATA TAG+STATE+DATA
Cache-Coherence Protocol!!!
UPC
MICRO35Istanbul
Nov. 2002
Talk Outline
MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and
techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC
MICRO35Istanbul
Nov. 2002
Interleaved Cache
CLUSTER 1
Register FileRegister File
Func. UnitsFunc. Units
Register-to-register communication buses
cache module
CLUSTER 2
Register FileRegister File
Func. UnitsFunc. Units
cache module
CLUSTER 3
Register FileRegister File
Func. UnitsFunc. Units
cache module
CLUSTER 4
Register FileRegister File
Func. UnitsFunc. Units
cache module
L2 cacheTAG W0 W1 W2 W4 W5 W6 W7W3
TAG W0 W4 TAG W1 W5 TAG W2 W6 TAG W3 W7
subblock 1local hitremote hitlocal missremote miss
cache block
UPC
MICRO35Istanbul
Nov. 2002
Talk Outline
MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and
techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC
MICRO35Istanbul
Nov. 2002
succ
essf
ul
not successful
BASE Scheduling Algorithm
II=II+1
Best profit inoutput edges
START
Sort nodes
Next nodeSelect possible
clusters HowMany?
Least loaded
Schedule it HowMany?
>0
>1
1
0su
cces
sful
not successful
UPC
MICRO35Istanbul
Nov. 2002
Scheduling Algorithm
For word-interleaved cache clustered processors
Scheduling steps:1. Loop unrolling2. Assignment of latencies to memory
instructions latencies stall time + compute time latencies stall time + compute time
3. Order instructions (DDG nodes)4. Cluster assignment and scheduling
UPC
MICRO35Istanbul
Nov. 2002
STEP 1: Loop Unrolling
CLUSTER 1
cache module
a[0] a[4]
CLUSTER 2
cache module
a[1] a[5]
CLUSTER 3
cache module
a[2] a[6]
CLUSTER 4
cache module
a[3] a[7]
for (i=0; i<MAX; i++) { ld r3, a[i] r4 = OP(r3) st r4, b[i]}
ld r31, a[i] ld r32, a[i+1] ld r33, a[i+2] ld r34, a[i+3]
25% local accesses
100% local accesses
for (i=0; i<MAX; i+=4) { ld r31, a[i] (stride 16 bytes) ld r32, a[i+1] (stride 16 bytes) ld r33, a[i+2] (stride 16 bytes) ld r34, a[i+3] (stride 16 bytes) ...}
ld r3, a[i]
25% local accessesSelective unrolling:• No unrolling
• UnrollxN
• OUF unrolling
Strides multiple of NxI
Optimum Unrolling Factor (OUF)
UPC
MICRO35Istanbul
Nov. 2002
STEP 2: Latency Assignment
n1load
n2load
n3add
n4store
n5sub
REC1
distance=1
n6load
n7div
n8add
REC2
memory dependencesregister-flow deps.
distance=1
STEP 2
II stall B
5
10
14
1
3
6.8
5
3.3
2.06
-
5
9
-
0.5
2.7
-
10
3.3
STEP 1
Load Latency
change
II stall B
n1
To LM
To RH
To LH
5
10
14
1
3
6.8
5
3.3
2.06
n2
To LM
To RH
To LH
5
10
14
0.25
0.75
2.95
20
13.3
4.75
LH=1 cycleRH=5 cyclesLM=10 cyclesRM=15 cycles
L=1
L=1
L=1
L=8L=1
L=15
L=15
L=15
MII=33
MII=22L=15
L=10
L=15
MII=28
MII=22L=15
L=5
L=15
MII=23
MII=22L=5
L=1
L=1
MII=9
MII=10
UPC
MICRO35Istanbul
Nov. 2002
Step 3: Order instructions Step 4: Cluster assignment and scheduling
STEPS 3 and 4
UPC
MICRO35Istanbul
Nov. 2002
Scheduling Restrictions
CLUSTER 1
a[0] a[4]
Cache module
CL
US
TE
R 3
CL
US
TE
R 2
CLUSTER 4
a[3] a[7]
Cache module
NEXT MEMORY LEVELNEXT MEMORY LEVEL
memory buses
cycle i - - - store to a[0]
cycle i+1 - - - -
cycle i+2 - - - -
cycle i+3 load from a[0] - - -
NON-DETERMINISTIC BUS LATENCY!!!
UPC
MICRO35Istanbul
Nov. 2002
Step 3: Order instructions Step 4: Cluster assignment and scheduling
– Non-memory instructions same as BASE• Minimize register communications + maximize workload
– Memory instructions:• Memory instructions in same chain same cluster• IPBC (Interleaved Preferred Build Chains)
– Average “preferred cluster” of the chain– Padding meaningful preferred cluster information
» Stack frames» Dynamically allocated data
• IBC (Interleaved Build Chains)– Minimize register communications of 1st instr. of chain
STEPS 3 and 4
NxI boundary
UPC
MICRO35Istanbul
Nov. 2002
Memory Dependent Chains
n1load
n2load
n3add
n4store
n5sub
distance=1
n6load
n7div
n8add
memory dependencesregister-flow deps.
distance=1
Preferred = 1
Preferred = 1
Preferred = 2
Preferred=2
LH=1 cycleRH=5 cyclesLM=10 cyclesRM=15 cycles
L=1
L=1
L=1
L=8L=1
L=5
L=1
L=1
n1 n2 n4 n6
IPBC cluster 1 cluster 2
IBC same as n4 minimize register communications
order={n5, n4, n3, n2, n1, n8, n7, n6}
UPC
MICRO35Istanbul
Nov. 2002
Talk Outline
MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and
techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC
MICRO35Istanbul
Nov. 2002
Attraction Buffers
Cost-effective mechanism local accesses
CLUSTER 1
cache module
a[0] a[4]
CLUSTER 2
cache module
a[1] a[5]
CLUSTER 3
cache module
a[2] a[6]
CLUSTER 4
cache module
a[3] a[7]
ABuffer
ld r3, a[3]ld r3, a[7]...
stride 16 bytes
a[3] a[7]
Local accesses = 0%
Local accesses = 50%
UPC
MICRO35Istanbul
Nov. 2002
Talk Outline
MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and
techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC
MICRO35Istanbul
Nov. 2002
Evaluation Framework
IMPACT C compiler Mediabench benchmark suite
Profile Execution
epicdec test_image titanic
epicenc test_image titanic
g721dec clinton S_16_44
g721enc clinton S_16_44
gsmdec clinton S_16_44
gsmenc clinton S_16_44
jpegdec testimg monalisa
Profile Execution
jpegenc testimg monalisa
mpeg2dec mei16v2 tek6
pegwitdec pegwit techrep
pegwitenc pgptest techrep
pgpdec pgptext techrep
pgpenc pgptest techrep
rasta ex5_c1 ex5_c1
UPC
MICRO35Istanbul
Nov. 2002
Evaluation Framework
Unified cache MultiVLIW Interleaved cache
# clusters 4
Functional units
1 FP / cluster + 1 integer / cluster + 1 memory / cluster
Register buses 4 buses running at ½ the core freq.
Cache configuration
8KB, 2-way set-associative, 32 byte blocks
L2 always hits
Cache latencies
Hit=5
Miss=14
Hit=1
Miss=10
Local Hit=1 Remote Hit=5Local Miss=10
Remote Miss=15
Algorithm BASE IBC IPBC + IBC
Interleaving factor
- - 4 bytes
UPC
MICRO35Istanbul
Nov. 2002
Talk Outline
MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and
techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC
MICRO35Istanbul
Nov. 2002
Local Accesses
0%
25%
50%
75%
100%
Base
OU
FO
UF
+P
OU
F+
P+
NC
Base
OU
FO
UF
+P
OU
F+
P+
NC
Base
OU
FO
UF
+P
OU
F+
P+
NC
Base
OU
FO
UF
+P
OU
F+
P+
NC
Base
OU
FO
UF
+P
OU
F+
P+
NC
Base
OU
FO
UF
+P
OU
F+
P+
NC
Memory Accesses
combined
remote misses
local misses
remote hits
local hits
epicdec gsmdec jpegenc pgpenc rasta AMEAN
OUF=Optimum UFP=PaddingNC=No Chains
UPC
MICRO35Istanbul
Nov. 2002
Why Remote Accesses?
Double precision accesses (mpeg2dec) Unclear “preferred cluster” information
• Indirect accesses (e.g. a[b[i]]) (jpegdec, jpegenc, pegwitdec, pegwitenc)
• Different alignment (epicenc, jpegdec, jpegenc)
• Strides not multiple of NxI (selective unrolling, …)
Memory dependent chains (epicdec, pgpdec, pgpenc, rasta)
for (k=0; k<MAX; k++){ for (i=k; i<MAX; i++) load a[i]}
UPC
MICRO35Istanbul
Nov. 2002
Stall Time
0
0,2
0,4
0,6
0,8
1
1,2
IBC
IBC
+A
BIP
BC
IPB
C+
AB
IBC
IBC
+A
BIP
BC
IPB
C+
AB
IBC
IBC
+A
BIP
BC
IPB
C+
AB
IBC
IBC
+A
BIP
BC
IPB
C+
AB
IBC
IBC
+A
BIP
BC
IPB
C+
AB
IBC
IBC
+A
BIP
BC
IPB
C+
AB
combined
remote misses
local misses
remote hit
No
rmal
ized
sta
ll t
ime
epicdec gsmdec jpegdec pgpenc rasta AMEAN
UPC
MICRO35Istanbul
Nov. 2002
Cycle Count Results
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6m
ultiV
LIW
IPB
C+
AB
IBC
+A
BU
nifie
d
mul
tiVLI
WIP
BC
+A
BIB
C+
AB
Uni
fied
mul
tiVLI
WIP
BC
+A
BIB
C+
AB
Uni
fied
mul
tiVLI
WIP
BC
+A
BIB
C+
AB
Uni
fied
mul
tiVLI
WIP
BC
+A
BIB
C+
AB
Uni
fied
mul
tiVLI
WIP
BC
+A
BIB
C+
AB
Uni
fied
stall time
compute time
epicdec gsmdec jpegdec pgpenc rasta AMEAN
no
rmal
ized
nu
mb
er o
f cy
cles
UPC
MICRO35Istanbul
Nov. 2002
Talk Outline
MultiVLIW Interleaved-cache clustered VLIW processor Instruction scheduling algorithms and
techniques Hardware enhancement: Attraction Buffers Simulation framework Results Conclusions
UPC
MICRO35Istanbul
Nov. 2002
Conclusions
Interleaved cache clustered VLIW processor Effective instruction scheduling techniques
– Smart assignment of latencies – Loop unrolling + padding (27% local hits)
Source of remote accesses and stall time Attraction Buffers ( stall time up to 34%) Cycle count results:
– MultiVLIW (7% slowdown but simpler hardware)– Unified cache (11% speedup)
UPC
MICRO35Istanbul
Nov. 2002
Questions?
UPC
MICRO35Istanbul
Nov. 2002
Question: Latency Assignment
MII(REC1)=20 MII(DDG)=10
Node II stall B(ratio) B(substract)
n1 15 4 3.75 11
n2 10 5 2 5
n3 5 1 5 4
n4 5 1 5 4
n5 10 0 MAX 10
UPC
MICRO35Istanbul
Nov. 2002
Question: Padding
void foo(int *array, int *accum) { *accum = 0; for (i=0; i<MAX; i++) *accum += array[i];}
void main() { int *a, value; a = malloc(MAX*sizeof(int)); foo(a, &value);}
CLUSTER 1
a[0]a[4]...
CLUSTER 2
accuma[1]a[5]...
CLUSTER 3
a[2]a[6]...
CLUSTER 4
a[3]a[7]...
UPC
MICRO35Istanbul
Nov. 2002
Question: Coherence
Memory Dependent Chains– Modified data
• Present in only one Attraction Buffer
– Data present in multiple Attraction Buffers• Replicated in read-only manner
– Local scheduling technique• At end of loop flush Attraction Buffer’s contents
CLUSTER 1
a[2]
ABuffer
CLUSTER 2
a[2]
ABuffer
CLUSTER 3
ABuffer
CLUSTER 4
a[2]
ABuffer
top related