a low energy set-associative i-cache with extended btb

41
A Low Energy Set-Associative I-Cache with Extended BTB K. Inoue, V. Moshnyaga, and K. Murakami

Upload: ryann

Post on 31-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

A Low Energy Set-Associative I-Cache with Extended BTB. K. Inoue, V. Moshnyaga, and K. Murakami. Introduction. Increase in cache size. Power consumed in on-chip caches. DEC 21164 CPU*. StrongARM SA-110 CPU*. Bipolar ECL CPU**. 50%. 25%. 43%. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Low Energy Set-Associative I-Cache with Extended BTB

A Low Energy Set-Associative I-Cache with Extended BTB

A Low Energy Set-Associative I-Cache with Extended BTB

K. Inoue, V. Moshnyaga, and K. MurakamiK. Inoue, V. Moshnyaga, and K. Murakami

Page 2: A Low Energy Set-Associative I-Cache with Extended BTB

Introduction

DEC 21164 CPU* StrongARM SA-110 CPU* Bipolar ECL CPU**

25% 43% 50%* Kamble et. al., “Analytical energy Dissipation Models for Low Power Caches”, I S LPED’97** Joouppi et. al., “A 300-MHz 115-W 32-b Bipolar ECL Microprocessor” ,IEEE Journal of Solid-State Circuits’93

Increase in cache size

Power consumed in on-chip caches

Page 3: A Low Energy Set-Associative I-Cache with Extended BTB

Problem of Conventional Caches

Cy

cle

1C

yc

le 2

Phased CachePhased Cache

Low energy consumption but Low energy consumption but Slow accessSlow access

Low energy consumption but Low energy consumption but Slow accessSlow access

way0 way1 way2 way3

Conventional CacheConventional Cache

First access but First access but High energy consumptionHigh energy consumption

First access but First access but High energy consumptionHigh energy consumption

Tag Line

Cy

cle

1

Parallel search strategy produces unnecessary way activation!

Page 4: A Low Energy Set-Associative I-Cache with Extended BTB

Attempts to reduce cache-access energy without performance degradation

Reuses tag-check results to eliminate unnecessary way activation

Can achieve 62% of energy reduction with only 0.2% of performance degradation

History-Based Tag-Comparison I-CacheHistory-Based Tag-Comparison I-Cache

Our Proposal

Page 5: A Low Energy Set-Associative I-Cache with Extended BTB

TagCheck!

TagCheck!

TagCheck!

TagCheck!

TagCheck!

Cache is updated only when replacement occurs!A loaded data stays at the same location at least until the next cache-miss takes place

Conventional Tag-Check Scheme

Programs include a lot of loops!A number of instructions are executed repetitively

Ref. A Ref. A Ref. A Ref. A Ref. A

Inst. A is referenced N times

Miss! Miss!

Cache-miss intervalCache-miss interval

time

Cache is stable!

Completely the same tag-check result!Completely the same tag-check result!

Page 6: A Low Energy Set-Associative I-Cache with Extended BTB

Attempts to reuse tag-check results produced before during a cache-miss interval!

Attempts to reuse tag-check results produced before during a cache-miss interval!

History-Based Tag-Comparison (HBTC) Scheme

The target instruction has been referenced before, and

No cache miss has occurred since the previous reference.

TagCheck! Reuse!Reuse!

Ref. A Ref. AMiss! Miss!

Cache-miss intervalCache-miss interval

time

Page 7: A Low Energy Set-Associative I-Cache with Extended BTB

2. If a cache miss occurs, then we invalidate all the stored tag-check results

1. Execute an instruction A at time T• Perform tag check• Save the tag-check result

into extended BTB

way0 way1 way2 way3

[way2] is the Hit-way!

Index

Concept of the HBTC Cache

3. Execute the instruction A at time T+X

• Reuse the tag-check result to activate only the hit-way’s data sub-array

• Reuse the tag-check result to activate only the hit-way’s data sub-array

way0 way1 way2 way3

Index

[way2] is the Hit-way!

Page 8: A Low Energy Set-Associative I-Cache with Extended BTB

Conventional VS. Phased VS. HBTC

ConventionalConventional

way0 way1 way2 way3Tag Line

Cy

cle

1

way0 way1 way2 way3Tag Line

Cy

cle

1

PhasedPhased

Cy

cle

1C

yc

le 2

Cy

cle

1C

yc

le 2

Cy

cle

1C

yc

le 1

Cy

cle

1C

yc

le 1

HBTCHBTC

Cy

cle

1C

yc

le 1

Cac

he H

itC

ache

Hit

Cac

he M

iss

Cac

he M

iss

Reu

seR

euse

No

Reu

seN

o R

euse

Page 9: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC SA I-$ ArchitecturePBAregPBAreg

Branch-Inst. Addr. Target Addr.

Branch-Inst. Addr. Target Addr.

BTB (Branch Target Buffer) NotTakenPC

Tag Check Result

WPvalid

I-Cache

Miss?

valid flag n of way pointers

Taken

Branch Inst. Addr. Pred.Result Address for

writing

WP Recode Reg.

WP Recode Reg.

WP TableWP Table

WP Reg.WP Reg.

Branch Prediction Result

Mode Controller

Mode Controller

Entry of the WP Table

Mode

Page 10: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ OperationNormal Mode (NM): w/ Tag checksOmitting Mode (OM): w/o Tag checks (Reuse)Tracing Mode (TM): w/ Tag checks (tag-check results are preserved into the WPRreg, and a

re stored into the WP-table on the next BTB hit )

Normal Mode (NM): w/ Tag checksOmitting Mode (OM): w/o Tag checks (Reuse)Tracing Mode (TM): w/ Tag checks (tag-check results are preserved into the WPRreg, and a

re stored into the WP-table on the next BTB hit )

OM

NM TM

BTB HitGOtoNM

GOtoNM

GOtoNMI-Cache miss orBTB replacement orRAS access orBranch misprediction

Mode Transition

All WPs are invalidated!

Valid

InvalidPC and Pred.-result PBAreg

Page 11: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Inst. Addr. A

Inst. Addr. B

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T NPC

From I-Cache

4-way I-Cache

WPreg0 1 2 3

Page 12: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Taken

T N

From I-Cache

4-way I-Cache

WPreg

Inst. Addr. A

Inst. Addr. BPCA

0 1 2 3

Page 13: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Taken

T N

From I-Cache

4-way I-Cache

WPreg

Inst. Addr. A

Inst. Addr. BPCA

NO valid WPs are detected!

0 1 2 3

Page 14: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

From I-Cache

4-way I-Cache

WPregNO valid WPs are detected!

Inst. Addr. A

Inst. Addr. BPCA

A T

PC and Branch prediction result are saved!

0 1 2 3

Page 15: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

4-way I-Cache

WPreg

Inst. Addr. A

Inst. Addr. BPC

A T

Conventional Accesses!

Tag-Comparison result is stored into the WPRreg!

0 1 2 3

1

Page 16: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

4-way I-Cache

WPreg

Inst. Addr. A

Inst. Addr. BPC

A T

Conventional Accesses!

Tag-Comparison result is stored into the WPRreg!

0 1 2 3

3

Page 17: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

4-way I-Cache

WPreg

Inst. Addr. A

Inst. Addr. BPC

A T

Conventional Accesses!

Tag-Comparison result is stored into the WPRreg!

0 1 2 3

0

Page 18: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

From I-Cache

4-way I-Cache

WPreg

Inst. Addr. A

Inst. Addr. BPC

A T

The WPRreg is stored into the WP-Table entry pointed by the PBAreg!

BTB Hit!B

0 1 2 3

Page 19: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

T NA

From I-Cache

4-way I-Cache

WPreg

Inst. Addr. A

Inst. Addr. BPC

Taken0 1 2 3

Page 20: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

T N

From I-Cache

4-way I-Cache

WPregValid WPs are detected!

0 1 2 3

Inst. Addr. A

Inst. Addr. BPCA

Taken

Page 21: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

From I-Cache

4-way I-Cache

WPreg

Tag-Comparison Reuse

0 1 2 31

Inst. Addr. A

Inst. Addr. BPC

Page 22: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

From I-Cache

4-way I-Cache

WPreg

Tag-Comparison Reuse

0 1 2 33

Inst. Addr. A

Inst. Addr. BPC

Page 23: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

From I-Cache

4-way I-Cache

WPreg

Tag-Comparison Reuse

0 1 2 30

Inst. Addr. A

Inst. Addr. BPC

Page 24: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

From I-Cache

4-way I-Cache

WPreg

No valid WPs in the WPreg!

0 1 2 3?

Inst. Addr. A

Inst. Addr. BPC

Page 25: A Low Energy Set-Associative I-Cache with Extended BTB

HBTC I-$ Operation Example

OM

NM TM

BTB Hit

GOtoNM

GOtoNM

Mode TransitionValid

Invalid

Target Addr.

Target Addr.

Branch Target Buffer

PBAreg

WP Table

WPRreg

Mode Controller

Pred. (T or N)

T N

From I-Cache

4-way I-Cache

WPreg

Conventional Accesses!

0 1 2 3

Inst. Addr. A

Inst. Addr. BPC

Page 26: A Low Energy Set-Associative I-Cache with Extended BTB

Advantages and Disadvantages

☺Eliminate unnecessary energy consumption w/o performance degradation (during OM)!

way0 way1 way2 way3

Normal Mode (NM) / Tracing Mode (TM)way0 way1 way2 way3

Omitting Mode (OM)

☹BTB energy overhead due to WP-table read-accesses☹BTB access conflict for invalidating all WPs (ca

uses 1 stall cycle)☹BTB access conflict to record WP

s (causes 1 stall cycle)

Page 27: A Low Energy Set-Associative I-Cache with Extended BTB

Evaluation  – Environment –• OOO simulation by SimpleScalar

16 KB 4-way I-cache (32 B line size)For others, default parameters were used

• Cache Energy Model based on [Kamble97](including the WP-table read-energy overhead)

• Assume that the BTB is accessed only when branch or jump instructions are executed (instructions are pre-decoded)

099.go, 124.m88ksim, 126.gcc, 129.compress,130.li, 132.ijpeg

102.swim

mpeg2encode, mpeg2decode, adpcm_enc, adpcm_dec

SPECint95

SPECfp95

Mediabench

Benchmark Programs

[Kamble97] M.B.Kamble and K.Ghose, ”Analytical Energy Dissipation Models For Low Power Caches,” ISLPED97

Page 28: A Low Energy Set-Associative I-Cache with Extended BTB

Evaluation  – Energy and Performance –

62%62% 0.2%0.2%

099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e)

1.2

1.0

0.8

0.6

0.4

0.2

0.0

1.2

1.0

0.8

0.6

0.4

0.2

0.0

No

rmal

ized

En

erg

y (J

ou

le)

No

rmal

ized

En

erg

y (J

ou

le) N

orm

alized E

xe. Tim

e (cycle)N

orm

alized E

xe. Tim

e (cycle)

# of WPs = 4

62% of Ecache reduction with 0.2% of Exe. Time increase Even if in the worst case, about 20% of Ecache reduction

Page 29: A Low Energy Set-Associative I-Cache with Extended BTB

Evaluation  – Effect of WP invalidation penalty –

0

0.5

1

1.5

2

2.5

3

1 2 4 8 16 32No

rm. E

xe

. Tim

e (

cyc

le)

WP Invalidation Penalty (cycle)

126.gcc126.gcc

099.go099.go

mpeg2(d)mpeg2(d) 132.ijpeg132.ijpeg

Cache Miss Cache Miss PenaltyPenalty

If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial.

The performance overhead grows after the penalty is more than 4 clock cycles.

Page 30: A Low Energy Set-Associative I-Cache with Extended BTB

Evaluation  – Effect of The Number of WPs –

Increasing the number of WPs makes it possible to reuse many tag-check results

But, it produces BTB access energy overhead

1.0

0.8

0.6

0.4

0.2

0.0

No

rmal

ized

En

erg

y (J

ou

le)

Energy Overhead of BTB

1 2 4 8 16 32 1 2 4 8 16 32# of Way Pointer

w/ Pre-Decodingw/ Pre-Decoding

126.gcc

w/o Pre-Decodingw/o Pre-DecodingEnergy for Cache Access1.2

Page 31: A Low Energy Set-Associative I-Cache with Extended BTB

Evaluation  – Effect of Cache Associativity –

Conv.: Ecache grows with the increase in assiciativity HBTC: Ecache is reduced with the increase in associativity (n<=

4), after that, It starts to increase (n>4)

0.E+001.E+062.E+063.E+064.E+065.E+066.E+067.E+068.E+06

En

erg

y (J

ou

le) Eothers

EtagEdata,blEdata,prectl

ConventionalConventional HBTCHBTC

Associativity

mpeg2decode

1 2 4 8 16 32 64 1 2 4 8 16 32 64

Page 32: A Low Energy Set-Associative I-Cache with Extended BTB

Conclusions

1. Recodes tag-check results generated by the I-cache into the extended BTB

2. Attempts to reuse them in order to eliminate unnecessary way activation

3. Achieves 62% of I-cache energy reduction with only 0.2% of performance degradation!

History-Based Tag-Comparison Instruction Cache

Future work• Analyze energy consumption based on real chip

design.

Page 33: A Low Energy Set-Associative I-Cache with Extended BTB

Buck Up Slides(History-based Tag-Comparison Cache)

Page 34: A Low Energy Set-Associative I-Cache with Extended BTB

099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp. 132.ijpeg adpcm(e) mpeg2(e)

0.1

0.0

No

rmal

ized

Tag

-Co

mp

are

Co

un

t

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Interline Sequential approachHistory-Based Look-up CacheCombination of IS and HBL

Evaluation  – Comparison with IS Approach –

Page 35: A Low Energy Set-Associative I-Cache with Extended BTB

0.E+00

5.E+06

1.E+07

2.E+07

2.E+07

3.E+07

3.E+07

En

erg

y (J

ou

le)

Eothers EtagEdata,blEdata,prectl

Conventional HBL Cache

1 2 4 8 16 32 64

Associativity

099.go

0.8um CMOS* **

*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10 th Int. Conf. On VLSI Design**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5

1 2 4 8 16 32 64

Evaluation  – Effects of Cache Associativity –

Page 36: A Low Energy Set-Associative I-Cache with Extended BTB

0.E+00

1.E+07

2.E+07

3.E+07

4.E+07

5.E+07

6.E+07

7.E+07

En

erg

y (J

ou

le)

Eothers EtagEdata,blEdata,prectl

Conventional HBL Cache

Associativity

126.gcc

0.8um CMOS* **

*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10 th Int. Conf. On VLSI Design**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5

1 2 4 8 16 32 64 1 2 4 8 16 32 64

Evaluation  – Effects of Cache Associativity –

Page 37: A Low Energy Set-Associative I-Cache with Extended BTB

0.E+00

1.E+07

2.E+07

3.E+07

4.E+07

5.E+07

6.E+07

En

erg

y (J

ou

le)

Eothers EtagEdata,blEdata,prectl

Conventional HBL Cache

Associativity

132.ijpeg

0.8um CMOS* **

*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10 th Int. Conf. On VLSI Design**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5

1 2 4 8 16 32 64 1 2 4 8 16 32 64

Evaluation  – Effects of Cache Associativity –

Page 38: A Low Energy Set-Associative I-Cache with Extended BTB

0.E+001.E+062.E+063.E+064.E+065.E+066.E+067.E+068.E+06

En

erg

y (J

ou

le)

Eothers EtagEdata,blEdata,prectlConventional HBL Cache

Associativity

mpeg2decode

0.8um CMOS* **

*) M.B.Kamble and K.ghose, “Energy-Efficiency of VLSI Caches: A Comparative Study,” 10 th Int. Conf. On VLSI Design**) S.J.E.Wilton and N.P.Jouppi, “An Enhanced Access and Cycle Time Model for On-Chip Caches,” WRL Research Report 93/5

1 2 4 8 16 32 64 1 2 4 8 16 32 64

Evaluation  – Effects of Cache Associativity –

Page 39: A Low Energy Set-Associative I-Cache with Extended BTB

1.0

0.8

0.6

0.4

0.2

0.0

No

rmal

ized

En

erg

y (J

ou

le)

Energy Overhead at BTBEnergy for Cache Access

126.gcc126.gcc 132.ijpeg132.ijpeg

1 2 4 8 16 32 1 2 4 8 16 32

# of Way Pointer

w/ Pre-Decoding(BTB access occurs only at branch, or jump, executions)

Evaluation  – Effects of # of WPs –

Page 40: A Low Energy Set-Associative I-Cache with Extended BTB

1.0

0.8

0.6

0.4

0.2

0.0

No

rmal

ized

En

erg

y (J

ou

le)

Energy Overhead at BTBEnergy for Cache Access

126.gcc126.gcc 132.ijpeg132.ijpeg

1 2 4 8 16 32 1 2 4 8 16 32

# of Way Pointer

w/o Pre-Decoding(BTB access occurs for all instructions)

Evaluation  – Effects of # of WPs –

Page 41: A Low Energy Set-Associative I-Cache with Extended BTB

Evaluation  – Effect of WP invalidation penalty –

0

0.5

1

1.5

2

2.5

3

1 2 4 8 16 32

No

rmal

ized

Exe

. T

ime

(c

ycle

)

WP Invalidation Penalty (cycle)

126.gcc126.gcc

099.go099.go

mpeg2(d)mpeg2(d) 132.ijpeg132.ijpeg

0%10%20%30%40%50%60%70%80%90%

100%

099.go 126.gcc 130.li 102.swim adpcm(d) mpeg2(d) 124.m88ksim 129.comp.132.ijpeg adpcm(e) mpeg2(e)

Cache Miss Cache Miss PenaltyPenalty

Bre

akd

ow

n o

f W

P invalid

ati

ons

BTB ReplacementCache MissBTB ReplacementCache Miss

If the penalty is equal to or smaller than 4 clock cycles, the performance overhead is trivial.

The performance overhead grows after the penalty is more than 4 clock cycles.