stimuluscache: boosting performance of chip multiprocessors with excess cache hyunjin lee sangyeun...

StimulusCache: Boosting Performance of Chip Multiprocessors

with Excess Cache

Hyunjin Lee Sangyeun Cho Bruce R. Childers

Dept. of Computer ScienceUniversity of Pittsburgh

University of Pittsburgh

Staggering processor chip yield

1 2 3 4 5 6 7 80%

5%

10%

15%

20%

25%

30%

35%8-core CMP

# of sound cores

Pro

bab

ilit

y

IBM Cell initial yield = 14%

Two sources of yield loss• Physical defects• Process variations

Smaller device sizes• Critical defect size shrinks• Process variations become

more severe

As a result, yield is se-verely limited


Core disabling to rescue

Recent multicore processors employ “core disabling”• Disable failed cores to salvage sound cores in a chip• Significant yield improvement, • IBM Cell: 14% 40% with core disabling of a single faulty core

Yet this approach unnecessarily disables many “good components”• Ex: AMD Phenom X3 disables L2 cache of faulty cores

But… is it economical?


Core disabling uneconomical

Many unused L2 caches exist Problem exacerbated with many cores

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320%

5%

10%

15%

20%

25%

30%

35%

40%L2 cacheprocessing logiccore (L2 + proc. Logic)

# of sound cores/ L2 caches

prob

abili

ty

32 core

1 2 3 4 5 6 7 80%

10%

20%

30%

40%

50%

60%

70%

80%8 core L2 cache

processing logic

core (L2 + proc. logic)

# of sound cores/ L2 caches

Prob

abili

ty


StimulusCache Basic idea

• Exploit “excess cache” (EC) in a failed core

• Core disabling (yield ↑) + larger cache capacity (performance ↑)

Simple HW architecture extension• Cache controller has knowledge about EC utilization• L2 cache are chain linked using vector tables

Modest OS support• OS manages the hardware data structures in cache controllers

to set up EC utilization policies

Sizable performance improvement


StimulusCache design issues

Questions

1: How to arrange ECs to give to cores?

2: Where to place data, in ECs or local L2?

3: What HW support is needed?

4: How to flexibly and dynamically allocate ECs?

Core 4 Core 5 Core 7Core 6

Core 0 Core 1 Core 2 Core 3Excess caches

Working cores


Excess cache utilization policies

Static Dynamic

Private

Sharing

Tempo-ralSpa-

tial

Simple

Limited Performance

Complex

Maximized Performance


Static Dynamic

Private

Sharing Shared by multiple cores

Performance interference Maximum capacity usage


Tempo-ralSpa-

tial Exclusive to a core

Performance isolation Limited capacity usage



Static Dynamic

Private

Sharing

Tempo-ralSpa-

tial

Static pri-vate

Dynamic shar-ing

Static shar-ing

BAD(not evalu-

ated)


Static private policy

Symmetric alloca-tion

Asymmetric alloca-tion

L2Core 0 EC

EC

EC

EC

L2Core 1

L2Core 2

L2Core 3

EC

EC

EC

EC

L2Core 0

L2Core 1

L2Core 2

L2Core 3


Static sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

All ECs shared by all cores

Mainmem-ory

Mainmem-oryEC

#1EC #2


Flow-in#N: data block counts to EC#N

Hit#N: data block counts hit at EC#N

Dynamic sharing policy

L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

Mainmem-ory

Flow-in#1 EC

#2

Hits#1

Hits#2

Flow-in#2Flow-in#2


Hit/Flow-in ↑ more ECs Hit/Flow-in ↓ less ECs


L2Core 0

L2Core 1

L2Core 2

L2Core 3

EC #1

Mainmem-ory

Flow-in#1 EC

#2

Hits#1

Hits#2

Flow-in#2Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 0: at least 1 EC no harmful effect on EC#2

allocate 2 ECs


EC #1

Mainmem-oryEC

#2

Hits#1

Flow-in#1


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 1: at least 2 ECs

allocate 2 ECs


EC #1

Mainmem-oryEC

#2

Flow-in#1

Hits#2

Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: at least 1 EC harmful effect on EC#2

allocate 1 EC


EC #1

Mainmem-oryEC

#2

Flow-in#1

Hits#1

Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Core 2: no benefit with ECs

allocate 0 EC


EC #1

Mainmem-oryEC

#2Flow

-in#1 Flow-in#2


L2Core 0

L2Core 1

L2Core 2

L2Core 3

Maximized capacity utilization

Minimized capacity interference


Mainmem-oryEC

#1EC #2

2

2

1

0

EC#


HW architecture: Vector table

1 0 1 1 0 0 0 0 ECAV: Excess Cache Allocation Vector

1 0 1

NECP: Next Excess Cache Pointers

1

0 0 0

0 1 1

0

1

Valid

Next coreCore

0Core 1

Core 7

0 7

0 0 0 0 0 0 0 0 SCV: Shared Core Vector

1 2 3 4 5 6Core

Core


ECAV: Excess cache allocation vec-tor Data search support



Working cores

ECAV 1 0 1 1 0 0 0 0Core 6

0 1 2 3 4 5 6 7Core

Core


SCV: Shared core vector

Cache coherency support

SCV 0 0 0 0 0 0 1 0Core 0,2, and 3


Core 0 Core 1 Core 2 Core 3

0 1 2 3 4 5 6 7Core

Core

Excess caches

Working cores


NECP: Next excess cache pointers

Data promotion/ eviction support

Valid

In-dex0 1 1Core

61 EC at Core 3

0 1 0Core 3

1 EC at Core 2

0 0 0Core 2

1 EC at Core 0

0 0 0Core 0

0 Main memory



Working cores


Software support

OS enforces an excess cache utilization policy before programming cache controllers• Explicit decision by administrator• Autonomous decision based on system monitoring

OS may program cache controllers• At system boot-up time• Before a program starts• During a program execution

OS may take into account workload characteristics be-fore programming


Experimental setup

Intel ATOM-like cores w/ a 16-stage pipeline @ 2GHz

Memory hierarchy L1: 32KB I/D, 1 cycle L2: 512KB, 10 cycles Main memory: 300 cycles, contention modeled

On-chip network Crossbar for 8-core CMP (4 cores + 4 ECs) 2D mesh for 32-core CMP (16 cores + 16 ECs) Contention modeled

SPEC CPU2006, SPLASH-2, and SPECjbb2005


Static private – single thread

h2

64

ref

hm

me

r

asta

r

bzip

2

mcf

gcc

gro

ma

cs

ga

me

ss

so

ple

x

sp

hin

x3

Ge

msFD

TD

mil

c

Light Medium Heavy Light Medium HeavyINT FP

-20%

0%

20%

40%

60%

80%

100%

120%

140%4 EC 3 EC 2 EC 1 EC

Pe

rfo

rma

nce

im

pro

ve

men

t


With “H” workloadsAll “H” workloadsWithout “H” workloadsMore than 40% improve-ment

HHHH

LLHH2

LLHH4

LLMM1

LLMM3

MMHH1

MMHH3

MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% static.private

pe

rfo

rma

nce

im

pro

ve

men

t

Static private – multithread

Multi-pro-grammed

Multi-threaded


HHHH

LLHH2

LLHH4

LLMM1

LLMM3

MMHH1

MMHH3

MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% static.sharing static.private

pe

rfo

rma

nce

im

pro

ve

men

t Capacity interferenceMultithreaded work-loads

Significant improve-ment

Static sharing – multithread

Multi-pro-grammed

Multi-threaded


HHHH

LLHH2

LLHH4

LLMM1

LLMM3

MMHH1

MMHH3

MMMM

choles

ky lu-5%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50% dynamic.sharing static.sharing static.private

pe

rfo

rma

nce

im

pro

ve

men

t Additional improvementCapacity interference avoided

Dynamic sharing – multithread

Multi-pro-grammed

Multi-threaded


!!! StimulusCache is always better than the baseline

Dynamic sharing – individual work-loads

LLLL

LLMM1

LLMM2

LLMM3

LLMM4

LLHH1

LLHH2

LLHH3

LLHH4

MMMM

MMHH1

MMHH2

MMHH3

MMHH4

HHHH-2%

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%core 0 core 1 core 2 core 3

Ad

dit

ion

al p

erf

orm

an

ce

imp

rovem

en

t

Significant additional

improvement

over static sharing

Minimum degradation over

static sharing


Conclusions

Processing logic yield vs. L2 cache yield• A large number of excess L2 caches

StimulusCache• Core disabling (yield ↑) + larger cache capacity (performance ↑)• Simple HW architecture extension + modest OS support

Three excess cache utilization policies• Static private, static sharing, and dynamic sharing

Performance improvement by up to 135%.

stimuluscache: boosting performance of chip multiprocessors with excess cache hyunjin lee sangyeun...

Documents

core disablingdisable

ec utilizationl2 cache

single faulty core

university of pittsburgh3core

university of pittsburgh11flow

l2 cache of faulty cores

ecs dynamic sharing

cache controllers