stimuluscache: boosting performance of chip multiprocessors with excess cache hyunjin lee sangyeun...
TRANSCRIPT
StimulusCache: Boosting Performance of Chip Multiprocessors
with Excess Cache
Hyunjin Lee Sangyeun Cho Bruce R. Childers
Dept. of Computer ScienceUniversity of Pittsburgh
University of Pittsburgh
Staggering processor chip yield
1 2 3 4 5 6 7 80%
5%
10%
15%
20%
25%
30%
35%8-core CMP
# of sound cores
Pro
bab
ilit
y
IBM Cell initial yield = 14%
Two sources of yield loss• Physical defects• Process variations
Smaller device sizes• Critical defect size shrinks• Process variations become
more severe
As a result, yield is se-verely limited
University of Pittsburgh
Core disabling to rescue
Recent multicore processors employ “core disabling”• Disable failed cores to salvage sound cores in a chip• Significant yield improvement, • IBM Cell: 14% 40% with core disabling of a single faulty core
Yet this approach unnecessarily disables many “good components”• Ex: AMD Phenom X3 disables L2 cache of faulty cores
But… is it economical?
University of Pittsburgh
Core disabling uneconomical
Many unused L2 caches exist Problem exacerbated with many cores
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 320%
5%
10%
15%
20%
25%
30%
35%
40%L2 cacheprocessing logiccore (L2 + proc. Logic)
# of sound cores/ L2 caches
prob
abili
ty
32 core
1 2 3 4 5 6 7 80%
10%
20%
30%
40%
50%
60%
70%
80%8 core L2 cache
processing logic
core (L2 + proc. logic)
# of sound cores/ L2 caches
Prob
abili
ty
University of Pittsburgh
StimulusCache Basic idea
• Exploit “excess cache” (EC) in a failed core
• Core disabling (yield ↑) + larger cache capacity (performance ↑)
Simple HW architecture extension• Cache controller has knowledge about EC utilization• L2 cache are chain linked using vector tables
Modest OS support• OS manages the hardware data structures in cache controllers
to set up EC utilization policies
Sizable performance improvement
University of Pittsburgh
StimulusCache design issues
Questions
1: How to arrange ECs to give to cores?
2: Where to place data, in ECs or local L2?
3: What HW support is needed?
4: How to flexibly and dynamically allocate ECs?
Core 4 Core 5 Core 7Core 6
Core 0 Core 1 Core 2 Core 3Excess caches
Working cores
University of Pittsburgh
Excess cache utilization policies
Static Dynamic
Private
Sharing
Tempo-ralSpa-
tial
Simple
Limited Performance
Complex
Maximized Performance
University of Pittsburgh
Static Dynamic
Private
Sharing Shared by multiple cores
Performance interference Maximum capacity usage
Excess cache utilization policies
Tempo-ralSpa-
tial Exclusive to a core
Performance isolation Limited capacity usage
University of Pittsburgh
Excess cache utilization policies
Static Dynamic
Private
Sharing
Tempo-ralSpa-
tial
Static pri-vate
Dynamic shar-ing
Static shar-ing
BAD(not evalu-
ated)
University of Pittsburgh
Static private policy
Symmetric alloca-tion
Asymmetric alloca-tion
L2Core 0 EC
EC
EC
EC
L2Core 1
L2Core 2
L2Core 3
EC
EC
EC
EC
L2Core 0
L2Core 1
L2Core 2
L2Core 3
University of Pittsburgh
Static sharing policy
L2Core 0
L2Core 1
L2Core 2
L2Core 3
All ECs shared by all cores
Mainmem-ory
Mainmem-oryEC
#1EC #2
University of Pittsburgh
Flow-in#N: data block counts to EC#N
Hit#N: data block counts hit at EC#N
Dynamic sharing policy
L2Core 0
L2Core 1
L2Core 2
L2Core 3
EC #1
Mainmem-ory
Flow-in#1 EC
#2
Hits#1
Hits#2
Flow-in#2Flow-in#2
University of Pittsburgh
Hit/Flow-in ↑ more ECs Hit/Flow-in ↓ less ECs
Dynamic sharing policy
L2Core 0
L2Core 1
L2Core 2
L2Core 3
EC #1
Mainmem-ory
Flow-in#1 EC
#2
Hits#1
Hits#2
Flow-in#2Flow-in#2
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Core 0: at least 1 EC no harmful effect on EC#2
allocate 2 ECs
Dynamic sharing policy
EC #1
Mainmem-oryEC
#2
Hits#1
Flow-in#1
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Core 1: at least 2 ECs
allocate 2 ECs
Dynamic sharing policy
EC #1
Mainmem-oryEC
#2
Flow-in#1
Hits#2
Flow-in#2
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Core 2: at least 1 EC harmful effect on EC#2
allocate 1 EC
Dynamic sharing policy
EC #1
Mainmem-oryEC
#2
Flow-in#1
Hits#1
Flow-in#2
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Core 2: no benefit with ECs
allocate 0 EC
Dynamic sharing policy
EC #1
Mainmem-oryEC
#2Flow
-in#1 Flow-in#2
University of Pittsburgh
L2Core 0
L2Core 1
L2Core 2
L2Core 3
Maximized capacity utilization
Minimized capacity interference
Dynamic sharing policy
Mainmem-oryEC
#1EC #2
2
2
1
0
EC#
University of Pittsburgh
HW architecture: Vector table
1 0 1 1 0 0 0 0 ECAV: Excess Cache Allocation Vector
1 0 1
NECP: Next Excess Cache Pointers
1
0 0 0
0 1 1
0
1
Valid
Next coreCore
0Core 1
Core 7
0 7
0 0 0 0 0 0 0 0 SCV: Shared Core Vector
1 2 3 4 5 6Core
Core
University of Pittsburgh
ECAV: Excess cache allocation vec-tor Data search support
Core 4 Core 5 Core 7Core 6
Core 0 Core 1 Core 2 Core 3Excess caches
Working cores
ECAV 1 0 1 1 0 0 0 0Core 6
0 1 2 3 4 5 6 7Core
Core
University of Pittsburgh
SCV: Shared core vector
Cache coherency support
SCV 0 0 0 0 0 0 1 0Core 0,2, and 3
Core 4 Core 5 Core 7Core 6
Core 0 Core 1 Core 2 Core 3
0 1 2 3 4 5 6 7Core
Core
Excess caches
Working cores
University of Pittsburgh
NECP: Next excess cache pointers
Data promotion/ eviction support
Valid
In-dex0 1 1Core
61 EC at Core 3
0 1 0Core 3
1 EC at Core 2
0 0 0Core 2
1 EC at Core 0
0 0 0Core 0
0 Main memory
Core 4 Core 5 Core 7Core 6
Core 0 Core 1 Core 2 Core 3Excess caches
Working cores
University of Pittsburgh
Software support
OS enforces an excess cache utilization policy before programming cache controllers• Explicit decision by administrator• Autonomous decision based on system monitoring
OS may program cache controllers• At system boot-up time• Before a program starts• During a program execution
OS may take into account workload characteristics be-fore programming
University of Pittsburgh
Experimental setup
Intel ATOM-like cores w/ a 16-stage pipeline @ 2GHz
Memory hierarchy L1: 32KB I/D, 1 cycle L2: 512KB, 10 cycles Main memory: 300 cycles, contention modeled
On-chip network Crossbar for 8-core CMP (4 cores + 4 ECs) 2D mesh for 32-core CMP (16 cores + 16 ECs) Contention modeled
SPEC CPU2006, SPLASH-2, and SPECjbb2005
University of Pittsburgh
Static private – single thread
h2
64
ref
hm
me
r
asta
r
bzip
2
mcf
gcc
gro
ma
cs
ga
me
ss
so
ple
x
sp
hin
x3
Ge
msFD
TD
mil
c
Light Medium Heavy Light Medium HeavyINT FP
-20%
0%
20%
40%
60%
80%
100%
120%
140%4 EC 3 EC 2 EC 1 EC
Pe
rfo
rma
nce
im
pro
ve
men
t
University of Pittsburgh
With “H” workloadsAll “H” workloadsWithout “H” workloadsMore than 40% improve-ment
HHHH
LLHH2
LLHH4
LLMM1
LLMM3
MMHH1
MMHH3
MMMM
choles
ky lu-5%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50% static.private
pe
rfo
rma
nce
im
pro
ve
men
t
Static private – multithread
Multi-pro-grammed
Multi-threaded
University of Pittsburgh
HHHH
LLHH2
LLHH4
LLMM1
LLMM3
MMHH1
MMHH3
MMMM
choles
ky lu-5%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50% static.sharing static.private
pe
rfo
rma
nce
im
pro
ve
men
t Capacity interferenceMultithreaded work-loads
Significant improve-ment
Static sharing – multithread
Multi-pro-grammed
Multi-threaded
University of Pittsburgh
HHHH
LLHH2
LLHH4
LLMM1
LLMM3
MMHH1
MMHH3
MMMM
choles
ky lu-5%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50% dynamic.sharing static.sharing static.private
pe
rfo
rma
nce
im
pro
ve
men
t Additional improvementCapacity interference avoided
Dynamic sharing – multithread
Multi-pro-grammed
Multi-threaded
University of Pittsburgh
!!! StimulusCache is always better than the baseline
Dynamic sharing – individual work-loads
LLLL
LLMM1
LLMM2
LLMM3
LLMM4
LLHH1
LLHH2
LLHH3
LLHH4
MMMM
MMHH1
MMHH2
MMHH3
MMHH4
HHHH-2%
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%core 0 core 1 core 2 core 3
Ad
dit
ion
al p
erf
orm
an
ce
imp
rovem
en
t
Significant additional
improvement
over static sharing
Minimum degradation over
static sharing
University of Pittsburgh
Conclusions
Processing logic yield vs. L2 cache yield• A large number of excess L2 caches
StimulusCache• Core disabling (yield ↑) + larger cache capacity (performance ↑)• Simple HW architecture extension + modest OS support
Three excess cache utilization policies• Static private, static sharing, and dynamic sharing
Performance improvement by up to 135%.