cloudcache expanding and shrinking private caches

37
CloudCache Expanding and Shrinking Private Caches Hyunjin Lee , Sangyeun Cho, and Bruce R. Childers

Upload: brian-hartman

Post on 02-Jan-2016

34 views

Category:

Documents


3 download

DESCRIPTION

CloudCache Expanding and Shrinking Private Caches. Hyunjin Lee , Sangyeun Cho, and Bruce R. Childers. L2 cache design challenges. Heterogeneous workloads Multiple VMs/apps in a single chip Ave. data center utilization: 15~30% Tiled m any-core CMPs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CloudCache Expanding and Shrinking Private Caches

CloudCacheExpanding and Shrinking Private

Caches

Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers

Page 2: CloudCache Expanding and Shrinking Private Caches

L2 cache design challenges

• Heterogeneous workloads– Multiple VMs/apps in a single chip– Ave. data center utilization: 15~30%

• Tiled many-core CMPs– Intel 48-core SCC, Tilera 100-core CMP– Many L2 banks: 10s ~ 100s

• L2 cache management– Capacity allocation– Remote L2 access latency– Distributed on-chip directory

Question: How to manage L2 cache resource of many-core CMPs?

Page 3: CloudCache Expanding and Shrinking Private Caches

CloudCache approach

Design philosophyTo aggressively and flexibly allocate capacity based on workloads’ demand

Key techniquesGlobal capacity partitioningCache chain links w/ nearby L2 banksLimited target broadcast

Beneficiaries:Gain performance

Benefactors:Sacrifice performance

Page 4: CloudCache Expanding and Shrinking Private Caches

CloudCache example

Threads

Capacity demand

Tiled 64-core CMP

Page 5: CloudCache Expanding and Shrinking Private Caches

CloudCache example

Threads

Capacity demand

Tiled 64-core CMP

Page 6: CloudCache Expanding and Shrinking Private Caches

Remote L2 access still slow

Tiled 64-core CMP

Page 7: CloudCache Expanding and Shrinking Private Caches

Remote L2 access still slow

Tiled 64-core CMP

Dir.

data

• Remote directory access is on critical path

Page 8: CloudCache Expanding and Shrinking Private Caches

CloudCache solution

• Remote directory access is on critical path

• Limited target broadcast (LTB)

Tiled 64-core CMP

Page 9: CloudCache Expanding and Shrinking Private Caches

Purposes of proposed techniques

I. Cloud (partitioned capacity) formation• Global capacity partitioning• Distance aware cache chain links

II. Directory access latency minimization• Limited target broadcast

Page 10: CloudCache Expanding and Shrinking Private Caches

I. Cloud formation

Four step process to forming clouds based on workload demand

Step 1: MonitoringStep 2: Capacity determinationStep 3: L2 bank/token allocationStep 4: Chain links formation

Page 11: CloudCache Expanding and Shrinking Private Caches

Step 1: Monitoring

• GCA (Global capacity allocator)

• Hit count per LRU position

“allocated capacity” + “monitoring capacity” (of 32 ways)

• Partitioning: Utility [Qureshi & Patt, MICRO ‘06] and

QoS

network

network

GCA

Hit countsCache

alloc. info

Page 12: CloudCache Expanding and Shrinking Private Caches

Step 2: Capacity determination

from allocated capacity

from monitoring capacity

Allocationengine

GCA

Output: capacity to minimize overall misses

Capacity

Capacity

Capacity

Page 13: CloudCache Expanding and Shrinking Private Caches

0.75

Step 3: Bank/token allocation

1. Local L2 cache first

2. Threads w/ larger capacity demand first3. Closer L2 banks first

2.751.75 0

3

1

2

0

1

2.75

1.25

Total cap. Cap. to allocate

1.250.25

0

0

Repeat!

Bank / token

0 / 12 / 13 / 0.75

1 / 13 / 0.25

Output: bank / token for each thread

Page 14: CloudCache Expanding and Shrinking Private Caches

Step 4: Building cache chain links

0

3

1

2

0 2 3MRU LRU

Virtual L2 cache

Hop distance 0 1 2

0 2.75

Total cap. Bank / token

0 / 12 / 13 / 0.75

Page 15: CloudCache Expanding and Shrinking Private Caches

Purposes of proposed techniques

I. Cloud formation• Global capacity partitioning• Distance aware cache chain links

II. Directory access latency minimization• Limited target broadcast

Page 16: CloudCache Expanding and Shrinking Private Caches

Purposes of proposed techniques

I. Cloud formation• Global capacity partitioning• Distance aware cache chain links

II. Directory access latency minimization• Limited target broadcast

Page 17: CloudCache Expanding and Shrinking Private Caches

II. Limited target broadcast (LTB)

• Private data:– LTB w/o dir. lookup– Update directory

Page 18: CloudCache Expanding and Shrinking Private Caches

II. Limited target broadcast (LTB)

• Private data:– LTB w/o dir. lookup– Update directory

• Shared data:– Dir. based coherence

• Private >> Shared

Page 19: CloudCache Expanding and Shrinking Private Caches

Limited target broadcast protocol

• Data is fetched w/ LTB

• Before dir. is updated, another core accesses dir. for the fetched data–Stale data in dir.

• Protocol is detailed in the paper

Page 20: CloudCache Expanding and Shrinking Private Caches

Experimental setup

• TPTS [Lee et al., SPE ‘10, Cho et al., ICPP ‘08]

– 64-core CMP with 8×8 2D mesh, 4-cycles/hop– Core: Intel’s ATOM-like two-issue in-order pipeline– Directory-based MESI protocol– Four independent DRAM controllers, four ports/controller– DRAM with Samsung DDR3-1600 timing

• Workloads– SPEC2006 (10B cycles)

• High/medium/low based on MPKI for varying cache capacity

– PARSEC (simlarge input set)• 16 threads/application

Page 21: CloudCache Expanding and Shrinking Private Caches

Evaluated schemes

• Shared

• Private

• DSR [HPCA 2009]– Spiller and Receiver

• ECC [ISCA 2010]– Local partitioning/monitoring

• CloudCache

Page 22: CloudCache Expanding and Shrinking Private Caches

Impact of global partitioning

High

Medium

Low

High

MP

KI

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECCCloud-Cache

Page 23: CloudCache Expanding and Shrinking Private Caches

Impact of global partitioning

High

Medium

Low

High

MP

KI

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECCCloud-Cache

Page 24: CloudCache Expanding and Shrinking Private Caches

Impact of global partitioning

High

Medium

Low

High

MP

KI

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECCCloud-Cache

Page 25: CloudCache Expanding and Shrinking Private Caches

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECC Cloud-Cache

Shared Private DSR ECCCloud-Cache

Shared Private DSR ECCCloud-Cache

Impact of global partitioning

High

Medium

Low

High

MP

KI CloudCache aggressively allocates

capacity with global information

Page 26: CloudCache Expanding and Shrinking Private Caches

L2 cache access latencyAccess #

401.bzip2

0E+00

1E+06

2E+06

0E+00

1E+06

2E+060E+00

1E+06

2E+06

access latency

0 36 72 1081441802162522883243603960E+00

1E+06

2E+06

Shared

Private

DSR & ECC

CloudCache

widely spread latency

fast local access + off-chip access

fast local access + widely spread la-tency

fast local access + fast remote access

Page 27: CloudCache Expanding and Shrinking Private Caches

16 threads, throughput

Comb Light Medium Heavy AVG0%

20%

40%

60%

80%

100%

120%

140%

SharedDSRECCCLOUD

Rel.

Spee

d to

Pri

vate Pri-

vate

Page 28: CloudCache Expanding and Shrinking Private Caches

16 threads, throughput

Comb Light Medium Heavy AVG0%

20%

40%

60%

80%

100%

120%

140%

SharedDSRECCCLOUD

Rel.

Spee

d to

Pri

vate Pri-

vate

Page 29: CloudCache Expanding and Shrinking Private Caches

Comb Light Medium Heavy AVG0%

20%

40%

60%

80%

100%

120%

140%

SharedDSRECCCLOUD

Rel.

Spee

d to

Pri

vate

16 threads, throughput

Pri-vate

Page 30: CloudCache Expanding and Shrinking Private Caches

Comb Light Medium Heavy AVG0%

20%

40%

60%

80%

100%

120%

140%

SharedDSRECCCLOUD

Rel.

Spee

d to

Pri

vate

16 threads, throughput

Pri-vate

Page 31: CloudCache Expanding and Shrinking Private Caches

16 threads, beneficiaries

0123456789

101112

Shared DSR ECC Cloud

0%

5%

10%

15%

20%

25%

30%Shared DSR ECC Cloud

0%

10%

20%

30%

40%

50%

60%

70%

80%Shared DSR ECC Cloud

# of beneficia-ries

Avg. speed up

Max. speed up

Page 32: CloudCache Expanding and Shrinking Private Caches

16 threads, beneficiaries

0123456789

101112

Shared DSR ECC Cloud

0%

5%

10%

15%

20%

25%

30%Shared DSR ECC Cloud

0%

10%

20%

30%

40%

50%

60%

70%

80%Shared DSR ECC Cloud

# of beneficia-ries

Avg. speed up

Max. speed up

Page 33: CloudCache Expanding and Shrinking Private Caches

• Benefactors’ performance: <1% degradation– Please see paper for graphs

16 threads, beneficiaries

0123456789

101112

Shared DSR ECC Cloud

0%

5%

10%

15%

20%

25%

30%Shared DSR ECC Cloud

0%

10%

20%

30%

40%

50%

60%

70%

80%Shared DSR ECC Cloud

# of beneficia-ries

Avg. speed up

Max. speed up

Page 34: CloudCache Expanding and Shrinking Private Caches

32 / 64 threads, throughput

Comb Light AVE0%

20%40%60%80%

100%120%140%

64 threads

SharedDSRECCCLOUD

Comb Light AVE0%

20%40%60%80%

100%120%140%

32 threads

Page 35: CloudCache Expanding and Shrinking Private Caches

Mutithreaded workload (PARSEC)

Comb1 Comb2 Comb3 Comb4 Comb50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6Shared DSR ECC CLOUD

sp

ee

du

p t

o p

riv

ate

Page 36: CloudCache Expanding and Shrinking Private Caches

Conclusion

• Unbounded shared capacity is EVIL

• CloudCache: private caches to threads– Capacity allocation with global partitioning– Cache chain link with nearby L2 banks– Limited target broadcast

• HW overhead is very small (~5KB).

Use CloudCache!

Page 37: CloudCache Expanding and Shrinking Private Caches

CloudCacheExpanding and Shrinking Private

Caches

Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers