cloudcache expanding and shrinking private caches

CloudCacheExpanding and Shrinking Private

Caches

Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers

L2 cache design challenges

• Heterogeneous workloads– Multiple VMs/apps in a single chip– Ave. data center utilization: 15~30%

• Tiled many-core CMPs– Intel 48-core SCC, Tilera 100-core CMP– Many L2 banks: 10s ~ 100s

• L2 cache management– Capacity allocation– Remote L2 access latency– Distributed on-chip directory

Question: How to manage L2 cache resource of many-core CMPs?

CloudCache approach

Design philosophyTo aggressively and flexibly allocate capacity based on workloads’ demand

Key techniquesGlobal capacity partitioningCache chain links w/ nearby L2 banksLimited target broadcast

Beneficiaries:Gain performance

Benefactors:Sacrifice performance

CloudCache example

Threads

Capacity demand

Tiled 64-core CMP

CloudCache example

Threads

Capacity demand

Tiled 64-core CMP

Remote L2 access still slow

Tiled 64-core CMP

Remote L2 access still slow

Tiled 64-core CMP

• Remote directory access is on critical path

CloudCache solution

• Remote directory access is on critical path

• Limited target broadcast (LTB)

Tiled 64-core CMP

Purposes of proposed techniques

I. Cloud (partitioned capacity) formation• Global capacity partitioning• Distance aware cache chain links

II. Directory access latency minimization• Limited target broadcast

I. Cloud formation

Four step process to forming clouds based on workload demand

Step 1: MonitoringStep 2: Capacity determinationStep 3: L2 bank/token allocationStep 4: Chain links formation

Step 1: Monitoring

• GCA (Global capacity allocator)

• Hit count per LRU position

“allocated capacity” + “monitoring capacity” (of 32 ways)

• Partitioning: Utility [Qureshi & Patt, MICRO ‘06] and

network

Hit countsCache

alloc. info

Step 2: Capacity determination

from allocated capacity

from monitoring capacity

Allocationengine

Output: capacity to minimize overall misses

Capacity

Step 3: Bank/token allocation

1. Local L2 cache first

2. Threads w/ larger capacity demand first3. Closer L2 banks first

2.751.75 0

Total cap. Cap. to allocate

1.250.25

Repeat!

Bank / token

0 / 12 / 13 / 0.75

1 / 13 / 0.25

Output: bank / token for each thread

Step 4: Building cache chain links

0 2 3MRU LRU

Virtual L2 cache

Hop distance 0 1 2

0 2.75

Total cap. Bank / token

0 / 12 / 13 / 0.75

I. Cloud formation• Global capacity partitioning• Distance aware cache chain links

II. Limited target broadcast (LTB)

• Private data:– LTB w/o dir. lookup– Update directory

II. Limited target broadcast (LTB)

• Private data:– LTB w/o dir. lookup– Update directory

• Shared data:– Dir. based coherence

• Private >> Shared

Limited target broadcast protocol

• Data is fetched w/ LTB

• Before dir. is updated, another core accesses dir. for the fetched data–Stale data in dir.

• Protocol is detailed in the paper

Experimental setup

• TPTS [Lee et al., SPE ‘10, Cho et al., ICPP ‘08]

– 64-core CMP with 8×8 2D mesh, 4-cycles/hop– Core: Intel’s ATOM-like two-issue in-order pipeline– Directory-based MESI protocol– Four independent DRAM controllers, four ports/controller– DRAM with Samsung DDR3-1600 timing

• Workloads– SPEC2006 (10B cycles)

• High/medium/low based on MPKI for varying cache capacity

– PARSEC (simlarge input set)• 16 threads/application

Evaluated schemes

• Shared

• Private

• DSR [HPCA 2009]– Spiller and Receiver

• ECC [ISCA 2010]– Local partitioning/monitoring

• CloudCache

Impact of global partitioning

Medium

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Medium

KI CloudCache aggressively allocates

capacity with global information

L2 cache access latencyAccess #

401.bzip2

2E+060E+00

access latency

0 36 72 1081441802162522883243603960E+00

Shared

Private

DSR & ECC

CloudCache

widely spread latency

fast local access + off-chip access

fast local access + widely spread la-tency

fast local access + fast remote access

16 threads, throughput

Comb Light Medium Heavy AVG0%

SharedDSRECCCLOUD

vate Pri-

SharedDSRECCCLOUD

vate Pri-

SharedDSRECCCLOUD

Pri-vate

SharedDSRECCCLOUD

Pri-vate

16 threads, beneficiaries

0123456789

101112

Shared DSR ECC Cloud

30%Shared DSR ECC Cloud

# of beneficia-ries

Avg. speed up

Max. speed up

0123456789

101112

# of beneficia-ries

Avg. speed up

Max. speed up

• Benefactors’ performance: <1% degradation– Please see paper for graphs

0123456789

101112

# of beneficia-ries

Avg. speed up

Max. speed up

32 / 64 threads, throughput

Comb Light AVE0%

20%40%60%80%

100%120%140%

64 threads

SharedDSRECCCLOUD

Comb Light AVE0%

20%40%60%80%

100%120%140%

32 threads

Mutithreaded workload (PARSEC)

Comb1 Comb2 Comb3 Comb4 Comb50

1.6Shared DSR ECC CLOUD

Conclusion

• Unbounded shared capacity is EVIL

• CloudCache: private caches to threads– Capacity allocation with global partitioning– Cache chain link with nearby L2 banks– Limited target broadcast

• HW overhead is very small (~5KB).

Use CloudCache!

CloudCacheExpanding and Shrinking Private

Caches

Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers

cloudcache expanding and shrinking private caches

l2 cache resource

nearby l2 cache banks

allocated cache capacity

remote l2 cache access

threads cache capacity

core cmpmany l2 banks

example cache formation

s l2 banks

Documents

practical caches

non-blocking caches

lokaleorganisatienetwerken: waarom en hoe?€¦ ·...

reverse proxy caches

morphological operatorssecure site...

this unit: caches types of memory introduction to computer...

cache memories september 30, 2008 topics generic cache...

cis 501 (martin): caches 1 cis 501 (martin): caches...

6.1 caches

distributed caches

cloudcache: expanding and shrinking private · pdf...

laser caches

efficient packet classification with digest caches ·...

trace caches

cpu caches and why you care - scott meyers caches and why...

cache memories topics generic cache memory organization...

caches master

caches (continued)

10 caches detail

caches iv - courses.cs.washington.edu