l1 data cache decomposition for energy efficiency
DESCRIPTION
L1 Data Cache Decomposition for Energy Efficiency. Michael Huang, Joe Renau , Seung-Moon Yoo, Josep Torrellas. University of Illinois at Urbana-Champaign. http://iacoma.cs.uiuc.edu/flexram. Objective. Reduce L1 data cache energy consumption No performance degradation - PowerPoint PPT PresentationTRANSCRIPT
L1 Data Cache Decomposition for Energy Efficiency
Michael Huang, Joe Renau, Seung-Moon Yoo, Josep TorrellasUniversity of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu/flexram
International Symposium on Low Power Electronics and Design, August 2001
2
Objective
Reduce L1 data cache energy consumption No performance degradation
Partition the cache in multiple ways Specialization for stack accesses
International Symposium on Low Power Electronics and Design, August 2001
3
Outline
L1 D-Cache decomposition Specialized Stack Cache Pseudo Set-Associative Cache Simulation Environment Evaluation Conclusions
International Symposium on Low Power Electronics and Design, August 2001
4
L1 D-Cache Decomposition
A Specialized Stack Cache (SSC)
A Pseudo Set-Associative Cache (PSAC)
International Symposium on Low Power Electronics and Design, August 2001
5
Selection
Selection done in decode stage to speed up– Based on instruction address and opcode
2Kbit table to predict the PSAC way
Opcode
Address
PSAC SSC
International Symposium on Low Power Electronics and Design, August 2001
6
Stack Cache
Small, direct-mapped cache Virtually tagged Software optimizations:
– Very important to reduce stack cache size– Avoid trashing: allocate large structs in heap– Easy to implement
International Symposium on Low Power Electronics and Design, August 2001
7
SSC: Specialized Stack Cache
Pointers to reduce traffic: TOS: reduce number write-backs SRB (safe-region-bottom):
reduce unnecessary line-fills for write miss– Region between TOS & SRB
is “safe” (missing lines are non initialized)
Infrequent access
TOS
Stack grows
TOS
SRB
SRB
TOS
SRB
TOS
International Symposium on Low Power Electronics and Design, August 2001
8
Pseudo Set-Associative Cache
Partition the cache in 4 ways
Evaluated activation policies: Sequential, FallBackReg, Phased Cache, FallBackPha, PredictPha
DataTag
International Symposium on Low Power Electronics and Design, August 2001
9
Sequential (Calder ‘96)
cycle 1
cycle 2
cycle 3
International Symposium on Low Power Electronics and Design, August 2001
10
Fallback-regular (Inoue ‘99)
cycle 1
cycle 2
International Symposium on Low Power Electronics and Design, August 2001
11
Phased Cache (Hasegawa ‘95)
cycle 1
cycle 2
International Symposium on Low Power Electronics and Design, August 2001
12
Fallback-phased (ours)
cycle 1
cycle 2
cycle 3
Emphasis in energy reduction
International Symposium on Low Power Electronics and Design, August 2001
13
Predictive Phased (ours)
cycle 1
cycle 2
Emphasis in performance
International Symposium on Low Power Electronics and Design, August 2001
14
Simulation Environment
Baseline configuration: Processor: 1GHz R10000 like L1: 32 KB 2-way L2: 512KB 8-way phased cache Memory: 1 Rambus Channel Energy model: extended CACTI Energy is for data memory hierarchy only
International Symposium on Low Power Electronics and Design, August 2001
15
Applications
Mp3dec: MP3 decoder Mp3enc: MP3 encoder Gzip: Data compression Crafty: Chess game MCF:Traffic model Bsom: data mining Blast:protein matching Treeadd: Olden tree search
Multimedia
SPECint
Scientific
International Symposium on Low Power Electronics and Design, August 2001
16
Adding a Stack Cache
1.01
0.83 0.84
1.00
0.80 0.81
0.99
0.78 0.77
0.99
0.77 0.76
0.99
0.77 0.76
0.98
0.76 0.75
0
0.2
0.4
0.6
0.8
1
Delay Energy E*D
Normalize Baseline
PLAIN 256BSSC 256BPLAIN 512BSSC 512BPLAIN 1KBSSC 1KB
For the same size the Specialized Stack Cache is always better
International Symposium on Low Power Electronics and Design, August 2001
17
Pseudo Set-Associative Cache
1.05
0.680.72
0.99
0.69 0.69
1.05
0.740.78
1.01
0.67 0.68
0.98
0.68 0.67
0
0.2
0.4
0.6
0.8
1
Delay Energy E*D
Normalize Baseline
4-way Sequential4-way FallBackReg4-way Phased4-way FallBackPha4-way PredictPha
PredictPha has the best delay and energy-delay product
International Symposium on Low Power Electronics and Design, August 2001
18
PSAC: 2-way vs. 4-way
0.99
0.78 0.77
0.97
0.79 0.76
0.98
0.68 0.67
0
0.2
0.4
0.6
0.8
1
Delay Energy E*D
Normalize Basline
2-way Sequential2-way PredictPha4-way PredictPha
For E*D, 4-way PSAC is better than 2-way
International Symposium on Low Power Electronics and Design, August 2001
19
Pseudo Set-Associative + Specialized Stack Cache
0.98
0.68 0.67
0.98
0.61 0.60
0.97
0.58 0.56
0.96
0.57 0.55
0
0.2
0.4
0.6
0.8
1
Delay Energy E*D
Normalize Baseline
4-way PredictPha
4-way PredictPha + SSC256B
4-way PredictPha + SSC512B
4-way PredictPha + SSC1KB
Combining PSAC and SSC reduces E*D by 44% on average
International Symposium on Low Power Electronics and Design, August 2001
20
Area Constrained: small PSAC+SSC
0.98
0.74 0.72
0.98
0.61 0.60
0.97
0.58 0.56
0
0.2
0.4
0.6
0.8
1
Delay Energy E*D
Normalize Baseline
24KB 3-way PredictPha 24KB 3-way PredictPha + SSC512B32KB 4-way PredictPha + SSC512B
SSC + small PSAC delivers cost effective E*D design
International Symposium on Low Power Electronics and Design, August 2001
21
Energy Breakdown
0
0.2
0.4
0.6
0.8
1
Baseline
4-way PSAC
SSC512B
Comb
Baseline
4-way PSAC
SSC512B
Comb
Baseline
4-way PSAC
SSC512B
Comb
Normalize Baseline
SSC
L1
L2
Mem
BLAST MCF MP3D
International Symposium on Low Power Electronics and Design, August 2001
22
Conclusions
Stack cache: important for energy-efficiency SW optimization required for stack caches Effective Specialized Stack Cache extensions Pseudo Set-Associative Cache:
– 4-way more effective than 2-way– Predictive Phased PSAC has the lowest E*D
Effective to combine PASC and SSC– E*D reduced by 44% on average
International Symposium on Low Power Electronics and Design, August 2001
23
Backup Slides
International Symposium on Low Power Electronics and Design, August 2001
24
Cache Energy
0
200
400
600
800
1000
1200
1400
1600
1800
2000
4K 8K 16K 32K 64K
Cache Size
Energy (pJ)
4-way
2-way
1-way
International Symposium on Low Power Electronics and Design, August 2001
25
Extended CACTI
New sense amplifier– 15% bit-line swing for reads
Full bit-line swing for writes Different energy for reads, writes,
line-fills, and write backs Multiple optimization parameters
International Symposium on Low Power Electronics and Design, August 2001
26
SSC Energy Overhead
Small energy consumption required to use TOS and SRB
Registers updated at function call and return
Registers check on cache miss
International Symposium on Low Power Electronics and Design, August 2001
27
Miss Rate
0%
2%
4%
6%
8%
10%
12%
4KB 8KB 16KB 32KB 64KB
BLAST BSOM CRAFTY GZIPMCF MP3D MP3E TREE
International Symposium on Low Power Electronics and Design, August 2001
28
Overview