cloudcache expanding and shrinking private caches

Post on 02-Jan-2016

36 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

CloudCache Expanding and Shrinking Private Caches. Hyunjin Lee , Sangyeun Cho, and Bruce R. Childers. L2 cache design challenges. Heterogeneous workloads Multiple VMs/apps in a single chip Ave. data center utilization: 15~30% Tiled m any-core CMPs - PowerPoint PPT Presentation

TRANSCRIPT

CloudCacheExpanding and Shrinking Private

Caches

Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers

L2 cache design challenges

• Heterogeneous workloads– Multiple VMs/apps in a single chip– Ave. data center utilization: 15~30%

• Tiled many-core CMPs– Intel 48-core SCC, Tilera 100-core CMP– Many L2 banks: 10s ~ 100s

• L2 cache management– Capacity allocation– Remote L2 access latency– Distributed on-chip directory

Question: How to manage L2 cache resource of many-core CMPs?

CloudCache approach

Design philosophyTo aggressively and flexibly allocate capacity based on workloads’ demand

Key techniquesGlobal capacity partitioningCache chain links w/ nearby L2 banksLimited target broadcast

Beneficiaries:Gain performance

Benefactors:Sacrifice performance

CloudCache example

Threads

Capacity demand

Tiled 64-core CMP

CloudCache example

Threads

Capacity demand

Tiled 64-core CMP

Remote L2 access still slow

Tiled 64-core CMP

Remote L2 access still slow

Tiled 64-core CMP

Dir.

data

• Remote directory access is on critical path

CloudCache solution

• Remote directory access is on critical path

• Limited target broadcast (LTB)

Tiled 64-core CMP

Purposes of proposed techniques

I. Cloud (partitioned capacity) formation• Global capacity partitioning• Distance aware cache chain links

II. Directory access latency minimization• Limited target broadcast

I. Cloud formation

Four step process to forming clouds based on workload demand

Step 1: MonitoringStep 2: Capacity determinationStep 3: L2 bank/token allocationStep 4: Chain links formation

Step 1: Monitoring

• GCA (Global capacity allocator)

• Hit count per LRU position

“allocated capacity” + “monitoring capacity” (of 32 ways)

• Partitioning: Utility [Qureshi & Patt, MICRO ‘06] and

QoS

network

network

GCA

Hit countsCache

alloc. info

Step 2: Capacity determination

from allocated capacity

from monitoring capacity

Allocationengine

GCA

Output: capacity to minimize overall misses

Capacity

Capacity

Capacity

0.75

Step 3: Bank/token allocation

1. Local L2 cache first

2. Threads w/ larger capacity demand first3. Closer L2 banks first

2.751.75 0

3

1

2

0

1

2.75

1.25

Total cap. Cap. to allocate

1.250.25

0

0

Repeat!

Bank / token

0 / 12 / 13 / 0.75

1 / 13 / 0.25

Output: bank / token for each thread

Step 4: Building cache chain links

0

3

1

2

0 2 3MRU LRU

Virtual L2 cache

Hop distance 0 1 2

0 2.75

Total cap. Bank / token

0 / 12 / 13 / 0.75

Purposes of proposed techniques

I. Cloud formation• Global capacity partitioning• Distance aware cache chain links

II. Directory access latency minimization• Limited target broadcast

Purposes of proposed techniques

I. Cloud formation• Global capacity partitioning• Distance aware cache chain links

II. Directory access latency minimization• Limited target broadcast

II. Limited target broadcast (LTB)

• Private data:– LTB w/o dir. lookup– Update directory

II. Limited target broadcast (LTB)

• Private data:– LTB w/o dir. lookup– Update directory

• Shared data:– Dir. based coherence

• Private >> Shared

Limited target broadcast protocol

• Data is fetched w/ LTB

• Before dir. is updated, another core accesses dir. for the fetched data–Stale data in dir.

• Protocol is detailed in the paper

Experimental setup

• TPTS [Lee et al., SPE ‘10, Cho et al., ICPP ‘08]

– 64-core CMP with 8×8 2D mesh, 4-cycles/hop– Core: Intel’s ATOM-like two-issue in-order pipeline– Directory-based MESI protocol– Four independent DRAM controllers, four ports/controller– DRAM with Samsung DDR3-1600 timing

• Workloads– SPEC2006 (10B cycles)

• High/medium/low based on MPKI for varying cache capacity

– PARSEC (simlarge input set)• 16 threads/application

Evaluated schemes

• Shared

• Private

• DSR [HPCA 2009]– Spiller and Receiver

• ECC [ISCA 2010]– Local partitioning/monitoring

• CloudCache

Impact of global partitioning

High

Medium

Low

High

MP

KI

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECCCloud-Cache

Impact of global partitioning

High

Medium

Low

High

MP

KI

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECCCloud-Cache

Impact of global partitioning

High

Medium

Low

High

MP

KI

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECCCloud-Cache

Impact of global partitioning

High

Medium

Low

High

MP

KI CloudCache aggressively allocates

capacity with global information

L2 cache access latencyAccess #

401.bzip2

0E+00

1E+06

2E+06

0E+00

1E+06

2E+060E+00

1E+06

2E+06

access latency

0 36 72 1081441802162522883243603960E+00

1E+06

2E+06

Shared

Private

DSR & ECC

CloudCache

widely spread latency

fast local access + off-chip access

fast local access + widely spread la-tency

fast local access + fast remote access

16 threads, throughput

Comb Light Medium Heavy AVG0%

20%

40%

60%

80%

100%

120%

140%

SharedDSRECCCLOUD

Rel.

Spee

d to

Pri

vate Pri-

vate

16 threads, throughput

Comb Light Medium Heavy AVG0%

20%

40%

60%

80%

100%

120%

140%

SharedDSRECCCLOUD

Rel.

Spee

d to

Pri

vate Pri-

vate

Comb Light Medium Heavy AVG0%

20%

40%

60%

80%

100%

120%

140%

SharedDSRECCCLOUD

Rel.

Spee

d to

Pri

vate

16 threads, throughput

Pri-vate

Comb Light Medium Heavy AVG0%

20%

40%

60%

80%

100%

120%

140%

SharedDSRECCCLOUD

Rel.

Spee

d to

Pri

vate

16 threads, throughput

Pri-vate

16 threads, beneficiaries

0123456789

101112

Shared DSR ECC Cloud

0%

5%

10%

15%

20%

25%

30%Shared DSR ECC Cloud

0%

10%

20%

30%

40%

50%

60%

70%

80%Shared DSR ECC Cloud

# of beneficia-ries

Avg. speed up

Max. speed up

16 threads, beneficiaries

0123456789

101112

Shared DSR ECC Cloud

0%

5%

10%

15%

20%

25%

30%Shared DSR ECC Cloud

0%

10%

20%

30%

40%

50%

60%

70%

80%Shared DSR ECC Cloud

# of beneficia-ries

Avg. speed up

Max. speed up

• Benefactors’ performance: <1% degradation– Please see paper for graphs

16 threads, beneficiaries

0123456789

101112

Shared DSR ECC Cloud

0%

5%

10%

15%

20%

25%

30%Shared DSR ECC Cloud

0%

10%

20%

30%

40%

50%

60%

70%

80%Shared DSR ECC Cloud

# of beneficia-ries

Avg. speed up

Max. speed up

32 / 64 threads, throughput

Comb Light AVE0%

20%40%60%80%

100%120%140%

64 threads

SharedDSRECCCLOUD

Comb Light AVE0%

20%40%60%80%

100%120%140%

32 threads

Mutithreaded workload (PARSEC)

Comb1 Comb2 Comb3 Comb4 Comb50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6Shared DSR ECC CLOUD

sp

ee

du

p t

o p

riv

ate

Conclusion

• Unbounded shared capacity is EVIL

• CloudCache: private caches to threads– Capacity allocation with global partitioning– Cache chain link with nearby L2 banks– Limited target broadcast

• HW overhead is very small (~5KB).

Use CloudCache!

CloudCacheExpanding and Shrinking Private

Caches

Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers

top related