decoupled compressed cache - microarch.org · decoupled compressed cache: exploiting spatial...

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching

Somayeh Sardashti and David A. Wood University of Wisconsin-Madison

Please find the power point presentation in: http://www.cs.wisc.edu/multifacet/papers/micro13_dcc.pptx

Communication vs. Computation

Keckler Micro 2011

Improving cache utilization is critical for energy-efficiency!

Compressed Cache: Compress and Compact Blocks + Higher effective cache size + Small area overhead + Higher system performance + Lower system energy

Previous work limit compression effectiveness: - Limited number of tags - High internal fragmentation - Energy expensive re-compaction

Decoupled Compressed Cache (DCC)

Saving system energy by improving LLC utilization through cache compression.

Non-Contiguous Sub-Blocks

Previous work limit compression effectiveness: - Limited number of tags - High Internal Fragmentation - Energy expensive re-compaction

Decoupled Super-Blocks

Decoupled Compressed Cache (DCC)

Saving system energy by improving LLC utilization through cache compression.

Outperform 2X LLC 1.08X LLC area 14% higher performance 12% lower energy

Outline

Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions

Uncompressed Caching

A fixed one-to-one tag/data mapping

Tags Data

Compressed Caching

Compress cache blocks.

Tags Data

Compact compressed blocks, to make room. Add more tags to increase effective capacity.

Compression

(1) Compression: how to compress blocks? • There are different compression algorithms. • Not the focus of this work. • But, which algorithm matters!

64 bytes 20 bytes

Compressor

Compression Potentials

High compression ratio potentially large normalized effective cache capacity.

Compression Ratio = Original Size / Compressed Size

Cycles to Decompress Compression Algorithm

We use C-PACK+Z for the rest of the talk!

Compaction

(2) Compaction: how to store and find blocks? • Critical to achieve the compression potentials. • This work focuses on compaction.

Tags Data

Fixed Sized Compressed Cache (FixedC) [Kim’02, WMPI, Yang Micro 02]

Internal Fragmentation!

Compaction

(2) Compaction: how to store and find blocks?

Tags Data

Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002]

Sub-block

Previous Compressed Caches

(Limit 1) Limited Tag/Metadata – High Area Overhead by Adding 4X/more Tags

(Limit 2) Internal Fragmentation – Low Cache Capacity Utilization

10B 16B 2.6 2.3

Potential: 3.9

Normalized Effective Capacity = LLC Number of Valid Blocks / MAX Number of (Uncompressed) Blocks

(Limit 3) Energy-Expensive Re-Compaction

3X higher LLC dynamic energy!

Tags Data

VSC requires energy-expensive re-compaction.

Update B B needs 2 sub-blocks

Outline

Decoupled Compressed Cache

(1) Exploiting Spatial Locality Low Area Overhead

(2) Decoupling tag/data mapping Eliminate energy expensive re-compaction Reduce internal fragmentation

(3) Co-DCC: Dynamically co-compacting super-blocks Further reduce internal fragmentation

(1) Exploiting Spatial Locality

Neighboring blocks co-reside in LLC.

(1) Exploiting Spatial Locality

DCC tracks LLC blocks at Super-Block granularity.

4X Tags

Tags Data

2X Tags Quad (Q): A, B, C, D Singleton (S): E

Super-Block Tag Q stat

Super Tags

Up to 4X blocks with low area overheads!

(2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction.

Quad (Q): A, B, C, D Singleton (S): E

Super Tags

Flexible Allocation Update B

(2) Decoupling tag/data mapping

Back pointers identify the owner block of each sub-block.

Super Tags

Data Back

Pointers

Tag ID Blk ID

(3) Co-compacting super-blocks

Co-DCC dynamically co-compacts super-blocks. Reducing internal fragmentation

A sub-block

Quad (Q): A, B, C, D

Outline

Experimental Methodology

Integrated DCC with AMD Bulldozer Cache. – We model the timing and allocation constraints of sequential

regions at LLC in detail. – No need for an alignment network.

Verilog implementation and synthesis of the tag match and sub-block selection logic. – One additional cycle of latency due to sub-block selection.

Experimental Methodology

Full-system simulation with a simulator based on GEMS. Wide range of applications with different level of cache

sensitivities: –Commercial workloads: apache, jbb, oltp, zeus – Spec-OMP: ammp, applu, equake, mgrid, wupwise –Parsec: blackscholes, canneal, freqmine – Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astar-

bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm

Cores Eight OOO cores, 3.2 GHz L1I$/L1D$ Private, 32-KB, 8-way L2$ Private, 256-KB, 8-way L3$ Shared, 8-MB, 16-way, 8 banks Main Memory 4GB, 16 Banks, 800 MHz bus frequency DDR3

Effective LLC Capacity

Components FixedC/VSC-2X DCC Co-DCC

Tag Array 6.3% 2.1% 11.3%

Back Pointer Array 0 4.4% 5.4%

(De-)Compressors 1.8% 1.8% 1.8%

Total Area Overhead 8.1% 8.3% 18.5%

Baseline

2X Baseline

VSC DCC Co-DCC

FixedC

Normalized Effective LLC Capacity

Components FixedC/VSC-2X

Tag Array 6.3%

Back Pointer Array 0

(De-)Compressors 1.8%

Total Area Overhead 8.1%

Components FixedC/VSC-2X DCC

Tag Array 6.3% 2.1%

Back Pointer Array 0 4.4%

(De-)Compressors 1.8% 1.8%

Total Area Overhead 8.1% 8.3%

(Co-)DCC Performance

0.93 0.96 0.95

0.90 0.86

(Co-)DCC boost system performance significantly.

(Co-)DCC Energy Consumption

0.93 0.96 0.97

0.91 0.88

(Co-)DCC reduce system energy by reducing number of accesses to the main memory.

Summary

Analyze the limits of compressed caching • Limited number of tags • Internal fragmentation • Energy-expensive re-compaction

Decoupled Compressed Cache • Improving performance and energy of compressed caching • Decoupled super-blocks • Non-contiguous sub-blocks

Co-DCC further reduces internal fragmentation Practical designs [details in the paper]

(De-)Compression overhead DCC data array organization with AMD Bulldozer DCC Timing DCC Lookup Applications Co-DCC design LLC effective capacity LLC miss rate Memory dynamic energy LLC dynamic energy

Backup

(De-)Compression Overhead

Parameters Compressor Decompressor Pipeline Depth

Latency (cycles)

Area (𝒎𝒎𝟐)

0.016 0.016

Power Consumption (mW) 25.84 19.01

DCC Data Array Organization AMD Bulldozer

A0: uncompressed; B1 and C2 are compressed to 2 sub-blocks

SR0SR 1SR 2SR3

A0.3C3.0

A0.1B1.0

A0.0C3.1

NSet Addr

4SR0 Addr

SR3 Addr

SR1 AddrSR2 Addr

Read Data

DCC Timing

DCC Lookup

1. Access Super Tags and Back Pointers in parallel 2. Find the matched Back Pointers 3. Read corresponding sub-blocks and decompress

Super Tags

Data Back Pointers Read C

Q 1 0 S

Applications

Spec2006 (m1-m8)

bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm

Sensitive to Cache Capacity and Latency

Sensitive to Cache Capacity

Cache Insensitive

Sensitive to Cache

Latency

Co-DCC Design

A: <A2,A1,A0>A-ENDA1-Begin

Sub-block 7

…Sub-block 6 Sub-block 5

…Sub-blocks 4-2

Sub-block 1 Sub-block 0

A0-Begin

A2.2A0.1A1A2.0

Tag ID Sh

Super-Block Tag Cs

7b3b 1b 1b7b3b1b 7b3b 1b 7b3b1b

LLC Effective Cache Capacity

LLC Miss Rate

Memory Dynamic Energy

LLC Dynamic Energy

decoupled compressed cache - microarch.org · decoupled compressed cache: exploiting spatial...

Documents

decoupled apis through microservices

decoupled architectureand mvc

coupled economies, decoupled forecasters?

decoupled pipelines: rationale, analysis, and evaluation

methanesulfonic acid-based electrode-decoupled vanadium

decoupled cms sunshinephp 2014

decoupled sectors and wolf-rayet galaxies

solar thermal decoupled electrolysis: reaction mechanism

compressed instruction cache prepared by: nicholas meloche,...

decoupled drupal practitioner secrets of the of the...

preconditioned explicit decoupled group methods...

compressed tag architecture for low-power embedded cache...

daft: decoupled acyclic fault tolerance

decoupled architecture and wordpress

generating handwriting via decoupled style descriptors

a compressed depth cache - lunds tekniska...

decoupled library packages for php 5.4

decoupled power flow algorithms

decoupled compressed cache: exploiting spatial locality ...

decoupled parallel backpropagation with convergence...