decoupled compressed cache - microarch.org · decoupled compressed cache: exploiting spatial...
Post on 29-Jul-2018
243 Views
Preview:
TRANSCRIPT
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Somayeh Sardashti and David A. Wood University of Wisconsin-Madison
1
Please find the power point presentation in: http://www.cs.wisc.edu/multifacet/papers/micro13_dcc.pptx
2
Communication vs. Computation
Keckler Micro 2011
Improving cache utilization is critical for energy-efficiency!
~200X
Compressed Cache: Compress and Compact Blocks + Higher effective cache size + Small area overhead + Higher system performance + Lower system energy
Previous work limit compression effectiveness: - Limited number of tags - High internal fragmentation - Energy expensive re-compaction
7
Decoupled Compressed Cache (DCC)
Saving system energy by improving LLC utilization through cache compression.
Non-Contiguous Sub-Blocks
Previous work limit compression effectiveness: - Limited number of tags - High Internal Fragmentation - Energy expensive re-compaction
Decoupled Super-Blocks
8
Decoupled Compressed Cache (DCC)
Saving system energy by improving LLC utilization through cache compression.
Outperform 2X LLC 1.08X LLC area 14% higher performance 12% lower energy
Outline
9
Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions
11
Compressed Caching
Compress cache blocks.
Tags Data
Compact compressed blocks, to make room. Add more tags to increase effective capacity.
12
Compression
(1) Compression: how to compress blocks? • There are different compression algorithms. • Not the focus of this work. • But, which algorithm matters!
64 bytes 20 bytes
Compressor
Compression Potentials
13
High compression ratio potentially large normalized effective cache capacity.
1.5
2.8
3.9
Compression Ratio = Original Size / Compressed Size
Cycles to Decompress Compression Algorithm
We use C-PACK+Z for the rest of the talk!
14
Compaction
(2) Compaction: how to store and find blocks? • Critical to achieve the compression potentials. • This work focuses on compaction.
Tags Data
Fixed Sized Compressed Cache (FixedC) [Kim’02, WMPI, Yang Micro 02]
Internal Fragmentation!
15
Compaction
(2) Compaction: how to store and find blocks?
Tags Data
Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002]
Sub-block
Previous Compressed Caches
16
(Limit 1) Limited Tag/Metadata – High Area Overhead by Adding 4X/more Tags
(Limit 2) Internal Fragmentation – Low Cache Capacity Utilization
10B 16B 2.6 2.3
2.0
1.7
Potential: 3.9
3.1
Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks
(Limit 3) Energy-Expensive Re-Compaction
17
3X higher LLC dynamic energy!
Tags Data
VSC requires energy-expensive re-compaction.
Update B B needs 2 sub-blocks
Outline
Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions
18
Decoupled Compressed Cache
19
(1) Exploiting Spatial Locality Low Area Overhead
(2) Decoupling tag/data mapping Eliminate energy expensive re-compaction Reduce internal fragmentation
(3) Co-DCC: Dynamically co-compacting super-blocks Further reduce internal fragmentation
(1) Exploiting Spatial Locality
DCC tracks LLC blocks at Super-Block granularity.
21
4X Tags
Tags Data
2X Tags Quad (Q): A, B, C, D Singleton (S): E
Super-Block Tag Q stat
e A
stat
e B
stat
e C
stat
e D
Super Tags
Up to 4X blocks with low area overheads!
(2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction.
22
Quad (Q): A, B, C, D Singleton (S): E
Super Tags
Quad (Q): A, B, C, D Singleton (S): E
Flexible Allocation Update B
(2) Decoupling tag/data mapping
23
Back pointers identify the owner block of each sub-block.
Quad (Q): A, B, C, D Singleton (S): E
Super Tags
Quad (Q): A, B, C, D Singleton (S): E
Data Back
Pointers
Tag ID Blk ID
(3) Co-compacting super-blocks
Co-DCC dynamically co-compacts super-blocks. Reducing internal fragmentation
24
A sub-block
Quad (Q): A, B, C, D
Outline
Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions
25
Experimental Methodology
26
Integrated DCC with AMD Bulldozer Cache. – We model the timing and allocation constraints of sequential
regions at LLC in detail. – No need for an alignment network.
Verilog implementation and synthesis of the tag match and sub-block selection logic. – One additional cycle of latency due to sub-block selection.
Experimental Methodology
27
Full-system simulation with a simulator based on GEMS. Wide range of applications with different level of cache
sensitivities: –Commercial workloads: apache, jbb, oltp, zeus – Spec-OMP: ammp, applu, equake, mgrid, wupwise –Parsec: blackscholes, canneal, freqmine – Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar-
bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm
Cores Eight OOO cores, 3.2 GHz L1I$/L1D$ Private, 32-KB, 8-way L2$ Private, 256-KB, 8-way L3$ Shared, 8-MB, 16-way, 8 banks Main Memory 4GB, 16 Banks, 800 MHz bus frequency DDR3
Effective LLC Capacity
28
Components FixedC/VSC-2X DCC Co-DCC
Tag Array 6.3% 2.1% 11.3%
Back Pointer Array 0 4.4% 5.4%
(De-)Compressors 1.8% 1.8% 1.8%
Total Area Overhead 8.1% 8.3% 18.5%
1
2
1 2 3
Nor
mal
ized
LLC
Area
Baseline
2X Baseline
VSC DCC Co-DCC
FixedC
Normalized Effective LLC Capacity
Components FixedC/VSC-2X
Tag Array 6.3%
Back Pointer Array 0
(De-)Compressors 1.8%
Total Area Overhead 8.1%
Components FixedC/VSC-2X DCC
Tag Array 6.3% 2.1%
Back Pointer Array 0 4.4%
(De-)Compressors 1.8% 1.8%
Total Area Overhead 8.1% 8.3%
(Co-)DCC Energy Consumption
30
0.93 0.96 0.97
0.91 0.88
(Co-)DCC reduce system energy by reducing number of accesses to the main memory.
Summary
31
Analyze the limits of compressed caching • Limited number of tags • Internal fragmentation • Energy-expensive re-compaction
Decoupled Compressed Cache • Improving performance and energy of compressed caching • Decoupled super-blocks • Non-contiguous sub-blocks
Co-DCC further reduces internal fragmentation Practical designs [details in the paper]
(De-)Compression overhead DCC data array organization with AMD Bulldozer DCC Timing DCC Lookup Applications Co-DCC design LLC effective capacity LLC miss rate Memory dynamic energy LLC dynamic energy
32
Backup
(De-)Compression Overhead
33
Parameters Compressor Decompressor Pipeline Depth
6 2
Latency (cycles)
16 9
Area (𝒎𝒎𝟐)
0.016 0.016
Power Consumption (mW) 25.84 19.01
DCC Data Array Organization AMD Bulldozer
34
B Ph
ase
Flop
A Ph
ase
Flop
A Ph
ase
Flop
B Ph
ase
Flop
A0: uncompressed; B1 and C2 are compressed to 2 sub-blocks
SR0SR 1SR 2SR3
A0.3C3.0
A0.2
B1.1
A0.1B1.0
A0.0C3.1
NSet Addr
4SR0 Addr
SR3 Addr
4
44
SR1 AddrSR2 Addr
Read Data
DCC Lookup
1. Access Super Tags and Back Pointers in parallel 2. Find the matched Back Pointers 3. Read corresponding sub-blocks and decompress
36
Quad (Q): A, B, C, D Singleton (S): E
Super Tags
Data Back Pointers Read C
Q 1 0 S
1 1
1 1
1 1
Applications
37
Spec2006 (m1-m8)
bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm
Sensitive to Cache Capacity and Latency
Sensitive to Cache Capacity
Cache Insensitive
Sensitive to Cache
Latency
Co-DCC Design
38
A2.1
A: <A2,A1,A0>A-ENDA1-Begin
Sub-block 7
…Sub-block 6 Sub-block 5
…Sub-blocks 4-2
A0.0
Sub-block 1 Sub-block 0
A0-Begin
A2.2A0.1A1A2.0
Tag ID Sh
arer
s
4b
Super-Block Tag Cs
tate
3Co
mp3
Csta
te2
Com
p2
Begi
n3
7b
Begi
n2Cs
tate
1Co
mp1
Csta
te0
Com
p0
Begi
n1
Begi
n0
END
7b3b 1b 1b7b3b1b 7b3b 1b 7b3b1b
top related