stageweb : interweaving pipeline stages into a wearout and variation tolerant cmp fabric
DESCRIPTION
StageWeb : Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric. Shantanu Gupta Amin Ansari Shuguang Feng Scott Mahlke University of Michigan - Ann Arbor June 29, 2010. Reliability Threats. Transient Faults due to Cosmic Rays & Alpha Particles - PowerPoint PPT PresentationTRANSCRIPT
University of MichiganAdvanced Computer Architecture Laboratory
1
StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric
Shantanu Gupta Amin Ansari Shuguang Feng Scott Mahlke
University of Michigan - Ann Arbor
June 29, 2010
University of MichiganAdvanced Computer Architecture Laboratory
2
Reliability ThreatsTransient Faults due to
Cosmic Rays & Alpha Particles(Increase exponentially withnumber of devices on chip)
N+ N+
Source DrainGate
P--+-+
-+-+
-+
Silicon Defects(Manufacturing defects and device wear-out)
Negative Bias Threshold Inversion
Oxide
Oxide Breakdown
Electromigration
C C C
C C C
C C C
Frequency
Process Variation(random and systematic variations
Intra-die ILD thicknessSpeed binning on a die
University of MichiganAdvanced Computer Architecture Laboratory
3
Fault Tolerance Aspects
Detect and Diagnose Reconfigure Recover
Has anything gone wrong?
Figure out the cause
Isolate the broken
components
Resume execution
from a safe point
University of MichiganAdvanced Computer Architecture Laboratory
4
Reconfiguring a Multi-core• At the coarsest level, cores can be disabled.
• Rumors that industry already uses this….► IBM Cell w/ 7 SPEs, AMD Tri-Core
• Can’t scale to higher failure rates!
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
C C C
Year 1 Year 3 Year 5 Year 7
University of MichiganAdvanced Computer Architecture Laboratory
5
Reconfiguration Granularity
Lower complexity
FETCHDEC
EXEC
WB
MEM
CORE level STAGE level MODULE level
• ElastIC, DT’ 06• Reunion, MICRO’06• Configurable Isolation, ISCA’07
• Online Diagnosis of Hard Faults, MICRO’ 05• Ultra Low-Cost Defect Protection, ASPLOS’ 06
Better resource utilizationFor 100% area overhead (redundancy)
-- Poor MTTF gains+ Easy to implement
+ Good MTTF gains+ Circuit / Architectural boundary+ Full coverage
+ Best MTTF gains-- Complex implementation
100% MTTF ↑ 170% MTTF ↑ 200% MTTF ↑
University of MichiganAdvanced Computer Architecture Laboratory
6
CMP Fabric
Core 2
Core 0 Core 1
Core 3
Stage1
StageN
Stage2
Stage3
Stage1
StageN
Stage2
Stage3
Stage1
StageN
Stage2
Stage3
Stage1
StageN
Stage2
Stage3
Stage1
Latch
Stage2
Latch
Stage3
StageN
University of MichiganAdvanced Computer Architecture Laboratory
7
The StageNet (SN) Fabric
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Stage1 StageNStage2 Stage3
Configuration Manager
StageNet Slice (SNS)
Wearout Sensors• Delay• Temperature• Current
Crossbar Switch
Inpu
ts
Outputs
University of MichiganAdvanced Computer Architecture Laboratory
8
A 4-Slice SN chipFetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Configuration Manager
University of MichiganAdvanced Computer Architecture Laboratory
9
> 5X slowdown
Performance Comparison: Pipline vs. SN Slice
123
89
10
67
45
BR
register dependency
Commit Time1 2 3 6 7 8 9 10
5 stage pipeline
1 2 3 6 7 8 9 10
SN Slice
3. Transmission delays 2. Data forwarding1. Control stall
register wb
IssueFetch Decode Ex/Mem WB
LATC
H
LATC
H
LATC
H
LATC
H
GenPC
BranchPredictor
Register File
branch resolution bypass
Decode Ex/MemFetch
GenPC
BranchPredictor
Issue
Register File
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
University of MichiganAdvanced Computer Architecture Laboratory
10
2. Data Forwarding
Bypass $
• Stores previous results
• Fully associative structure
• Emulates data forwarding
Stream ID• Control flow handling
• Eliminates flush signals
3. Transmission Delays1. Control Handling
>>
ST
LD
+
/
>>
&
<<
ST
+
LD
Macro-Ops
• Send instruction bundles
• Amortizes transfer delay
• Increases system utilization
01
Decode Ex/MemFetch
GenPC
BranchPredictor
Issue
Register File
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
doub
lebu
ffer
SID SID
Macro-op Generator
Bypass $
SN Slice Microarchitecture [MICRO’08]
University of MichiganAdvanced Computer Architecture Laboratory
11
SN Slice Performance [MICRO’08]
0
1
2
3
4
5
6
3des
g721
deco
de
g721
enco
deidc
t
rawca
udio
rawda
udio
rijnda
elmcf eq
ngre
p wcMea
n
Nor
mal
ized
Run
time
SNS + StreamIDSNS + StreamID + Bypass$SNS + Stream ID + Bypass$ + MOPs
10% slowdown
University of MichiganAdvanced Computer Architecture Laboratory
12
SN System - scaling to 100+ cores?F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
1. Crossbars don’t scale well due to wiring / layout complexity - Area - Delay - Power
2. Interconnection prone to failures - Single point of failure - Links have no redundancy
University of MichiganAdvanced Computer Architecture Laboratory
13
D IF E/M
D IF E/M
D IF E/M
D IF E/M
D IF E/M
D IF E/M
D IF E/M
L2 $ L2 $ L2 $
L2 $ L2 $ L2 $
L2 $
L2 $
L2 $
L2 $L2 $
L 2 $
StageWeb: Scaling to 100+ cores• In a large many-core system, small groups of cores can form SN• What’s the right size for a SN island?
Traditional many-core
SN Island
SN SN SN SN
SN SN SN SN
SN SN SN SN
SN SN SN SN
StageWeb many-core
University of MichiganAdvanced Computer Architecture Laboratory
14
StageWeb: Scaling to 100+ cores• In a large many-core system, small groups of cores can form SN• What’s the right size for a SN island?
• Unfortunately, a single crossbar can’t scale to 8-10 pipelines!
Good scaling Poor scaling
University of MichiganAdvanced Computer Architecture Laboratory
15
Front-end
Back-end
Front-end
Back-end
Interconnection AlternativesFetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Isla
nd 1
Isla
nd 2
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode IssueIsla
nd 3
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode IssueIsla
nd 4
1. Connectivitya) Singleb) Single + Front-Backc) Overlapd) Overlap + Front-Back
University of MichiganAdvanced Computer Architecture Laboratory
16
Interconnection AlternativesFetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Isla
nd 1
Isla
nd 2
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode IssueIsla
nd 3
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode IssueIsla
nd 4
1. Connectivitya) Singleb) Single + Front-Backc) Overlapd) Overlap + Front-Back
2. ReliabilityIn
puts
Outputs
a) crossbarIn
puts
Outputs
b) crossbar with spares
Inpu
tsO
utpu
ts
c) fault-tolerant crossbar
University of MichiganAdvanced Computer Architecture Laboratory
17
Interconnection Configuration• Faults in stages, crossbar ports, links, force a
reconfiguration….
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Isla
nd 1
Isla
nd 2
University of MichiganAdvanced Computer Architecture Laboratory
18
Interconnection Configuration• Single crossbar configuration
► Local to every island
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Issue
Ex/MemDecode
Isla
nd 1
Isla
nd 2 Ex/MemDecode
Fetch Issue
University of MichiganAdvanced Computer Architecture Laboratory
19
Interconnection Configuration• Overlap crossbar configuration
► Sweep islands, forming pipelines opportunistically
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Isla
nd 1
Isla
nd 2
Isla
nd 3
University of MichiganAdvanced Computer Architecture Laboratory
20
StageWeb Benefits
1. Scalability► Scaling SN to benefit 100+ core systems
2. Interconnection Reliability► Handling faults in crossbars and links
3. Process Variation► Slower components can be isolated in a multi-core chip
University of MichiganAdvanced Computer Architecture Laboratory
21
Mitigating Process Variation
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Severe process variation and lifetime wearout can result in a disparity of health for various resourcesStageNet can effectively isolate strong/weak resources
Ex/MemIssue
Fetch Decode
Fast
Medium
Slow
Fast
Frequency
University of MichiganAdvanced Computer Architecture Laboratory
22
Evaluation• Open RISC 1200 cores (4-stage in-order)• 12 configurations compared, 64-cores each
• Experiments► Lifetime evaluations - throughput and total work► Process variation - speed binning on a die
SingleSingle + Front/Back
OverlappingOverlapping +
Front/Back
W/O sparesW/ spares
Fault-tolerant
Interconnections Crossbar types
University of MichiganAdvanced Computer Architecture Laboratory
23
Lifetime Reliability Evaluations• Monte Carlo simulation with 300+ lifetime experiments
• Where, each lifetime experiment involves -► Assigning a time-to-failure to all stages► Killing components at their failure times► Reconfiguring system to isolate broken components► Repeating this until no logical pipeline can be formed
• Cumulative work and throughput are recorded► Number of cores: 64► Technology node: 90 nm
University of MichiganAdvanced Computer Architecture Laboratory
24
Cumulative Work
Xbar (w/o spare) Xbar (w/ spare) Fault-Tolerant Xbar0.80.9
11.11.21.31.41.51.61.71.8
Single Xbar Single + F/B Xbar Overlap Xbar Overlap + F/B Xbar
Nor
mal
ized
Cum
ulat
ive
Wor
k ~70% more work!
University of MichiganAdvanced Computer Architecture Laboratory
25
Cumulative Work (area neutral)
Xbar (w/o spare) Xbar (w/ spare) Fault-Tolerant Xbar0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
Single Xbar Single + F/B Xbar Overlap Xbar Overlap + F/B Xbar
Nor
mal
ized
Cum
ulat
ive
Wor
k 52 cores
Best StageWeb Configuration
• Overlapping interconnection network• 52 cores• 6 adjacent slices connected by each crossbar• Fault-tolerant crossbars
University of MichiganAdvanced Computer Architecture Laboratory
26
Throughput over time
0.000000.100000.200000.300000.400000.500000.600000.700000.800000.900001.000001.100001.200001.300001.400001.500001.600001.700001.800001.900002.000002.100002.200002.300002.400002.500002.600002.700002.800002.900003.000003.100003.200003.300003.400003.500003.600003.700003.800003.900004.000004.100004.200004.300004.400004.500004.600004.700004.800004.900005.000005.100005.200005.300005.400005.500005.600005.700005.800005.900006.000006.100006.200006.300006.400006.500006.600006.700006.800006.900007.000007.100007.200007.300007.400007.500007.600007.700007.800007.900008.000008.100008.200008.300008.400008.500008.600008.700008.800008.900009.000009.100009.200009.300009.400009.500009.600009.700009.800009.9000010.0000010.1000010.2000010.3000010.4000010.5000010.6000010.7000010.8000010.9000011.0000011.1000011.2000011.3000011.4000011.5000011.6000011.7000011.8000011.9000012.0000012.1000012.2000012.3000012.4000012.5000012.6000012.7000012.8000012.9000013.0000013.1000013.2000013.3000013.4000013.5000013.6000013.7000013.8000013.9000014.0000014.1000014.2000014.3000014.4000014.5000014.6000014.7000014.8000014.9000015.0000015.1000015.2000015.3000015.4000015.5000015.6000015.7000015.8000015.9000016.0000016.1000016.2000016.3000016.4000016.5000016.6000016.7000016.8000016.9000017.0000017.1000017.2000017.3000017.4000017.5000017.6000017.7000017.8000017.9000018.0000018.1000018.2000018.3000018.4000018.5000018.6000018.7000018.8000018.9000019.0000019.1000019.2000019.3000019.4000019.5000019.6000019.7000019.8000019.9000020.0000020.1000020.2000020.3000020.4000020.5000020.6000020.7000020.8000020.9000021.0000021.1000021.2000021.3000021.4000021.5000021.6000021.7000021.8000021.9000022.0000022.1000022.2000022.3000022.4000022.5000022.6000022.7000022.8000022.9000023.0000023.1000023.2000023.3000023.4000023.5000023.6000023.7000023.8000023.9000024.0000024.1000024.2000024.3000024.4000024.5000024.6000024.7000024.8000024.9000025.0000025.1000025.2000025.3000025.4000025.5000025.6000025.7000025.8000025.9000026.0000026.1000026.2000026.3000026.4000026.5000026.6000026.7000026.8000026.9000027.0000027.1000027.2000027.3000027.4000027.5000027.6000027.7000027.8000027.9000028.0000028.1000028.2000028.3000028.4000028.5000028.6000028.7000028.8000028.9000029.0000029.1000029.2000029.3000029.4000029.5000029.6000029.7000029.8000029.900000
10
20
30
40
50
60
CMP StageWeb StageWeb (area neutral)
Time (in years)
Peak
Thr
ough
put (
IPC
)
University of MichiganAdvanced Computer Architecture Laboratory
27
Mitigating Process Variation
0.730
0000
0000
0001
0.760
0000
0000
0001 0.7
9
0.820
0000
0000
0001
0.850
0000
0000
0001 0.8
80.9
1
0.940
0000
0000
0001 0.9
7
0.999
9999
9999
9999
0
4
8
12
16
Traditional CMP StageWeb CMP
Frequency (normalized)
Num
ber o
f cor
es
Freq
27
45For a given frequency target, StageWeb can operate:1. More cores, OR2. Same # of cores at lower voltage
University of MichiganAdvanced Computer Architecture Laboratory
28
Conclusions• Architectural innovations will be crucial in tackling
technological uncertainties
• StageWeb is a potential solution► Allows fine-grained isolation of failures► Most reliability gains from grouping 8-10 pipelines► Scalable to 100+ cores
• StageWeb can also mitigate process variation by grouping together faster and slower parts
University of MichiganAdvanced Computer Architecture Laboratory
29
Thank You
http://cccp.eecs.umich.edu
University of MichiganAdvanced Computer Architecture Laboratory
30
Back up slides
University of MichiganAdvanced Computer Architecture Laboratory
31
Impact of Defects on CMP Yield
University of MichiganAdvanced Computer Architecture Laboratory
32
Overlapping Network
University of MichiganAdvanced Computer Architecture Laboratory
33
Simple + 2nd Level Crossbars
University of MichiganAdvanced Computer Architecture Laboratory
34
Overlapping + 2nd Level Crossbar
University of MichiganAdvanced Computer Architecture Laboratory
35
Back-endFront-end
Interconnection Alternatives
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Fetch Ex/MemDecode Issue
Isla
nd 1
Isla
nd 2
1. Connectivitya) Simpleb) Simple + Front-Backc) Overlapd) Overlap + Front-Back
2. Reliability
Inpu
ts
Outputs
a) crossbar
Inpu
ts
Outputs
b) crossbar with spares
Inpu
tsO
utpu
ts
c) fault-tolerant crossbar
University of MichiganAdvanced Computer Architecture Laboratory
36
SN System Level IssuesF D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
F D E/MI
1. Crossbars don’t scale well due to wiring / layout complexity - Area - Delay - Power
2. Interconnection prone to failures - Single point of failure - Links have no redundancy