a low energy set-associative i-cache with extended btb
DESCRIPTION
A Low Energy Set-Associative I-Cache with Extended BTB. K. Inoue, V. Moshnyaga, and K. Murakami. Introduction. Increase in cache size. Power consumed in on-chip caches. DEC 21164 CPU*. StrongARM SA-110 CPU*. Bipolar ECL CPU**. 50%. 25%. 43%. - PowerPoint PPT PresentationTRANSCRIPT
A Low Energy Set-Associative I-Cache with Extended BTB
A Low Energy Set-Associative I-Cache with Extended BTB
K. Inoue, V. Moshnyaga, and K. MurakamiK. Inoue, V. Moshnyaga, and K. Murakami
Introduction
DEC 21164 CPU* StrongARM SA-110 CPU* Bipolar ECL CPU**
25% 43% 50%* Kamble et. al., “Analytical energy Dissipation Models for Low Power Caches”, I S LPED’97** Joouppi et. al., “A 300-MHz 115-W 32-b Bipolar ECL Microprocessor” ,IEEE Journal of Solid-State Circuits’93
Increase in cache size
Power consumed in on-chip caches
Problem of Conventional Caches
Cy
cle
1C
yc
le 2
Phased CachePhased Cache
Low energy consumption but Low energy consumption but Slow accessSlow access
Low energy consumption but Low energy consumption but Slow accessSlow access
way0 way1 way2 way3
Conventional CacheConventional Cache
First access but First access but High energy consumptionHigh energy consumption
First access but First access but High energy consumptionHigh energy consumption
Tag Line
Cy
cle
1
Parallel search strategy produces unnecessary way activation!
Attempts to reduce cache-access energy without performance degradation
Reuses tag-check results to eliminate unnecessary way activation
Can achieve 62% of energy reduction with only 0.2% of performance degradation
History-Based Tag-Comparison I-CacheHistory-Based Tag-Comparison I-Cache
Our Proposal
TagCheck!
TagCheck!
TagCheck!
TagCheck!
TagCheck!
Cache is updated only when replacement occurs!A loaded data stays at the same location at least until the next cache-miss takes place
Conventional Tag-Check Scheme
Programs include a lot of loops!A number of instructions are executed repetitively
Ref. A Ref. A Ref. A Ref. A Ref. A
Inst. A is referenced N times
Miss! Miss!
Cache-miss intervalCache-miss interval
time
Cache is stable!
Completely the same tag-check result!Completely the same tag-check result!
Attempts to reuse tag-check results produced before during a cache-miss interval!
Attempts to reuse tag-check results produced before during a cache-miss interval!
History-Based Tag-Comparison (HBTC) Scheme
The target instruction has been referenced before, and
No cache miss has occurred since the previous reference.
TagCheck! Reuse!Reuse!
Ref. A Ref. AMiss! Miss!
Cache-miss intervalCache-miss interval
time
2. If a cache miss occurs, then we invalidate all the stored tag-check results
1. Execute an instruction A at time T• Perform tag check• Save the tag-check result
into extended BTB
way0 way1 way2 way3
[way2] is the Hit-way!
Index
Concept of the HBTC Cache
3. Execute the instruction A at time T+X
• Reuse the tag-check result to activate only the hit-way’s data sub-array
• Reuse the tag-check result to activate only the hit-way’s data sub-array
way0 way1 way2 way3
Index
[way2] is the Hit-way!
Conventional VS. Phased VS. HBTC
ConventionalConventional
way0 way1 way2 way3Tag Line
Cy
cle
1
way0 way1 way2 way3Tag Line
Cy
cle
1
PhasedPhased
Cy
cle
1C
yc
le 2
Cy
cle
1C
yc
le 2
Cy
cle
1C
yc
le 1
Cy
cle
1C
yc
le 1
HBTCHBTC
Cy
cle
1C
yc
le 1
Cac
he H
itC
ache
Hit
Cac
he M
iss
Cac
he M
iss
Reu
seR
euse
No
Reu
seN
o R
euse
HBTC SA I-$ ArchitecturePBAregPBAreg
Branch-Inst. Addr. Target Addr.
Branch-Inst. Addr. Target Addr.
BTB (Branch Target Buffer) NotTakenPC
Tag Check Result
WPvalid
I-Cache
Miss?
valid flag n of way pointers
Taken
Branch Inst. Addr. Pred.Result Address for
writing
WP Recode Reg.
WP Recode Reg.
WP TableWP Table
WP Reg.WP Reg.
Branch Prediction Result
Mode Controller
Mode Controller
Entry of the WP Table
Mode
HBTC I-$ OperationNormal Mode (NM): w/ Tag checksOmitting Mode (OM): w/o Tag checks (Reuse)Tracing Mode (TM): w/ Tag checks (tag-check results are preserved into the WPRreg, and a
re stored into the WP-table on the next BTB hit )
Normal Mode (NM): w/ Tag checksOmitting Mode (OM): w/o Tag checks (Reuse)Tracing Mode (TM): w/ Tag checks (tag-check results are preserved into the WPRreg, and a
re stored into the WP-table on the next BTB hit )
OM
NM TM
BTB HitGOtoNM
GOtoNM
GOtoNMI-Cache miss orBTB replacement orRAS access orBranch misprediction
Mode Transition
All WPs are invalidated!
Valid
InvalidPC and Pred.-result PBAreg
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Inst. Addr. A
Inst. Addr. B
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T NPC
From I-Cache
4-way I-Cache
WPreg0 1 2 3
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Taken
T N
From I-Cache
4-way I-Cache
WPreg
Inst. Addr. A
Inst. Addr. BPCA
0 1 2 3
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Taken
T N
From I-Cache
4-way I-Cache
WPreg
Inst. Addr. A
Inst. Addr. BPCA
NO valid WPs are detected!
0 1 2 3
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
From I-Cache
4-way I-Cache
WPregNO valid WPs are detected!
Inst. Addr. A
Inst. Addr. BPCA
A T
PC and Branch prediction result are saved!
0 1 2 3
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
4-way I-Cache
WPreg
Inst. Addr. A
Inst. Addr. BPC
A T
Conventional Accesses!
Tag-Comparison result is stored into the WPRreg!
0 1 2 3
1
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
4-way I-Cache
WPreg
Inst. Addr. A
Inst. Addr. BPC
A T
Conventional Accesses!
Tag-Comparison result is stored into the WPRreg!
0 1 2 3
3
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
4-way I-Cache
WPreg
Inst. Addr. A
Inst. Addr. BPC
A T
Conventional Accesses!
Tag-Comparison result is stored into the WPRreg!
0 1 2 3
0
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
From I-Cache
4-way I-Cache
WPreg
Inst. Addr. A
Inst. Addr. BPC
A T
The WPRreg is stored into the WP-Table entry pointed by the PBAreg!
BTB Hit!B
0 1 2 3
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
T NA
From I-Cache
4-way I-Cache
WPreg
Inst. Addr. A
Inst. Addr. BPC
Taken0 1 2 3
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
T N
From I-Cache
4-way I-Cache
WPregValid WPs are detected!
0 1 2 3
Inst. Addr. A
Inst. Addr. BPCA
Taken
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
From I-Cache
4-way I-Cache
WPreg
Tag-Comparison Reuse
0 1 2 31
Inst. Addr. A
Inst. Addr. BPC
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
From I-Cache
4-way I-Cache
WPreg
Tag-Comparison Reuse
0 1 2 33
Inst. Addr. A
Inst. Addr. BPC
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
From I-Cache
4-way I-Cache
WPreg
Tag-Comparison Reuse
0 1 2 30
Inst. Addr. A
Inst. Addr. BPC
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
From I-Cache
4-way I-Cache
WPreg
No valid WPs in the WPreg!
0 1 2 3?
Inst. Addr. A
Inst. Addr. BPC
HBTC I-$ Operation Example
OM
NM TM
BTB Hit
GOtoNM
GOtoNM
Mode TransitionValid
Invalid
Target Addr.
Target Addr.
Branch Target Buffer
PBAreg
WP Table
WPRreg
Mode Controller
Pred. (T or N)
T N
From I-Cache
4-way I-Cache
WPreg
Conventional Accesses!
0 1 2 3
Inst. Addr. A
Inst. Addr. BPC
Advantages and Disadvantages
☺Eliminate unnecessary energy consumption w/o performance degradation (during OM)!
way0 way1 way2 way3
Normal Mode (NM) / Tracing Mode (TM)way0 way1 way2 way3
Omitting Mode (OM)
☹BTB energy overhead due to WP-table read-accesses☹BTB access conflict for invalidating all WPs (ca
uses 1 stall cycle)☹BTB access conflict to record WP
s (causes 1 stall cycle)
Evaluation – Environment –• OOO simulation by SimpleScalar
16 KB 4-way I-cache (32 B line size)For others, default parameters were used
• Cache Energy Model based on [Kamble97](including the WP-table read-energy overhead)
• Assume that the BTB is accessed only when branch or jump instructions are executed (instructions are pre-decoded)
099.go, 124.m88ksim, 126.gcc, 129.compress,130.li, 132.ijpeg
102.swim
mpeg2encode, mpeg2decode, adpcm_enc, adpcm_dec
SPECint95
SPECfp95
Mediabench
Benchmark Programs
[Kamble97] M.B.Kamble and K.Ghose, ”Analytical Energy Dissipation Models For Low Power Caches,” ISLPED97
Evaluation – Energy and Performance –
62%62% 0.2%0.2%
099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e)
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.2
1.0
0.8
0.6
0.4
0.2
0.0
No
rmal
ized
En
erg
y (J
ou
le)
No
rmal
ized
En
erg
y (J
ou
le) N
orm
alized E
xe. Tim
e (cycle)N
orm
alized E
xe. Tim
e (cycle)
# of WPs = 4
62% of Ecache reduction with 0.2% of Exe. Time increase Even if in the worst case, about 20% of Ecache reduction
Evaluation – Effect of WP invalidation penalty –
0
0.5
1
1.5
2
2.5
3
1 2 4 8 16 32No
rm. E
xe
. Tim
e (
cyc
le)
WP Invalidation Penalty (cycle)
126.gcc126.gcc
099.go099.go
mpeg2(d)mpeg2(d) 132.ijpeg132.ijpeg
Cache Miss Cache Miss PenaltyPenalty
If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial.
The performance overhead grows after the penalty is more than 4 clock cycles.
Evaluation – Effect of The Number of WPs –
Increasing the number of WPs makes it possible to reuse many tag-check results
But, it produces BTB access energy overhead
1.0
0.8
0.6
0.4
0.2
0.0
No
rmal
ized
En
erg
y (J
ou
le)
Energy Overhead of BTB
1 2 4 8 16 32 1 2 4 8 16 32# of Way Pointer
w/ Pre-Decodingw/ Pre-Decoding
126.gcc
w/o Pre-Decodingw/o Pre-DecodingEnergy for Cache Access1.2
Evaluation – Effect of Cache Associativity –
Conv.: Ecache grows with the increase in assiciativity HBTC: Ecache is reduced with the increase in associativity (n<=
4), after that, It starts to increase (n>4)
0.E+001.E+062.E+063.E+064.E+065.E+066.E+067.E+068.E+06
En
erg
y (J
ou
le) Eothers
EtagEdata,blEdata,prectl
ConventionalConventional HBTCHBTC
Associativity
mpeg2decode
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Conclusions
1. Recodes tag-check results generated by the I-cache into the extended BTB
2. Attempts to reuse them in order to eliminate unnecessary way activation
3. Achieves 62% of I-cache energy reduction with only 0.2% of performance degradation!
History-Based Tag-Comparison Instruction Cache
Future work• Analyze energy consumption based on real chip
design.
Buck Up Slides(History-based Tag-Comparison Cache)
099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e)
0.1
0.0
No
rmal
ized
Tag
-Co
mp
are
Co
un
t
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Interline Sequential approachHistory-Based Look-up CacheCombination of IS and HBL
Evaluation – Comparison with IS Approach –
0.E+00
5.E+06
1.E+07
2.E+07
2.E+07
3.E+07
3.E+07
En
erg
y (J
ou
le)
Eothers EtagEdata,blEdata,prectl
Conventional HBL Cache
1 2 4 8 16 32 64
Associativity
099.go
0.8um CMOS* **
*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10 th Int. Conf. On VLSI Design**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
1 2 4 8 16 32 64
Evaluation – Effects of Cache Associativity –
0.E+00
1.E+07
2.E+07
3.E+07
4.E+07
5.E+07
6.E+07
7.E+07
En
erg
y (J
ou
le)
Eothers EtagEdata,blEdata,prectl
Conventional HBL Cache
Associativity
126.gcc
0.8um CMOS* **
*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10 th Int. Conf. On VLSI Design**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Evaluation – Effects of Cache Associativity –
0.E+00
1.E+07
2.E+07
3.E+07
4.E+07
5.E+07
6.E+07
En
erg
y (J
ou
le)
Eothers EtagEdata,blEdata,prectl
Conventional HBL Cache
Associativity
132.ijpeg
0.8um CMOS* **
*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10 th Int. Conf. On VLSI Design**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Evaluation – Effects of Cache Associativity –
0.E+001.E+062.E+063.E+064.E+065.E+066.E+067.E+068.E+06
En
erg
y (J
ou
le)
Eothers EtagEdata,blEdata,prectlConventional HBL Cache
Associativity
mpeg2decode
0.8um CMOS* **
*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10 th Int. Conf. On VLSI Design**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5
1 2 4 8 16 32 64 1 2 4 8 16 32 64
Evaluation – Effects of Cache Associativity –
1.0
0.8
0.6
0.4
0.2
0.0
No
rmal
ized
En
erg
y (J
ou
le)
Energy Overhead at BTBEnergy for Cache Access
126.gcc126.gcc 132.ijpeg132.ijpeg
1 2 4 8 16 32 1 2 4 8 16 32
# of Way Pointer
w/ Pre-Decoding(BTB access occurs only at branch, or jump, executions)
Evaluation – Effects of # of WPs –
1.0
0.8
0.6
0.4
0.2
0.0
No
rmal
ized
En
erg
y (J
ou
le)
Energy Overhead at BTBEnergy for Cache Access
126.gcc126.gcc 132.ijpeg132.ijpeg
1 2 4 8 16 32 1 2 4 8 16 32
# of Way Pointer
w/o Pre-Decoding(BTB access occurs for all instructions)
Evaluation – Effects of # of WPs –
Evaluation – Effect of WP invalidation penalty –
0
0.5
1
1.5
2
2.5
3
1 2 4 8 16 32
No
rmal
ized
Exe
. T
ime
(c
ycle
)
WP Invalidation Penalty (cycle)
126.gcc126.gcc
099.go099.go
mpeg2(d)mpeg2(d) 132.ijpeg132.ijpeg
0%10%20%30%40%50%60%70%80%90%
100%
099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp.132.ijpeg adpcm(e) mpeg2(e)
Cache Miss Cache Miss PenaltyPenalty
Bre
akd
ow
n o
f W
P invalid
ati
ons
BTB ReplacementCache MissBTB ReplacementCache Miss
If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial.
The performance overhead grows after the penalty is more than 4 clock cycles.