Transient Fault Detection Transient Fault Detection via Simultaneous Multithreadingvia Simultaneous Multithreading
Shubhendu S. MukherjeeShubhendu S. [email protected]@compaq.com
VSSAD, Alpha TechnologyVSSAD, Alpha TechnologyCompaq Computer CorporationCompaq Computer Corporation
Shrewsbury, MassachusettsShrewsbury, Massachusetts
Steven K. ReinhardtSteven K. [email protected]@eecs.umich.eduElectrical Engineering & Computer SciencesElectrical Engineering & Computer SciencesUniversity of MichiganUniversity of MichiganAnn Arbor, MichiganAnn Arbor, Michigan
27th Annual International Symposium on Computer Architecture (ISCA), 2000
Slide 2
Transient Faults Transient Faults
Faults that persist for a “short” durationFaults that persist for a “short” duration Cause: Cause: cosmic rays (e.g., neutrons)cosmic rays (e.g., neutrons) Effect: Effect: knock off electrons, discharge capacitorknock off electrons, discharge capacitor SolutionSolution
no practical absorbent for cosmic raysno practical absorbent for cosmic rays– 1 fault per 1000 computers per year (estimated fault rate)1 fault per 1000 computers per year (estimated fault rate)
Future is worseFuture is worse smaller feature size, reduce voltage, higher transistor smaller feature size, reduce voltage, higher transistor
count, reduced noise margincount, reduced noise margin
Slide 3
Fault Detection in Compaq Himalaya SystemFault Detection in Compaq Himalaya System
R1 (R2)
InputReplication
OutputComparison
Memory covered by ECCRAID array covered by parityServernet covered by CRC
R1 (R2)
microprocessor microprocessor
Replicated Microprocessors + Cycle-by-Cycle Lockstepping
Slide 4
Fault Detection via Simultaneous MultithreadingFault Detection via Simultaneous Multithreading
R1 (R2)
InputReplication
OutputComparison
Memory covered by ECCRAID array covered by parityServernet covered by CRC
R1 (R2)
THREAD THREAD
Replicated Microprocessors + Cycle-by-Cycle LocksteppingThreads ?
Slide 5
Simultaneous Multithreading (SMT)Simultaneous Multithreading (SMT)
FunctionalUnits
InstructionScheduler
Thread1 Thread2
Example: Alpha 21464
Slide 6
Simultaneous & Redundantly Threaded Processor Simultaneous & Redundantly Threaded Processor (SRT)(SRT)
+ Less hardware+ Less hardware compared to replicated microprocessorscompared to replicated microprocessors
SMT needs ~5% more hardware over uniprocessorSMT needs ~5% more hardware over uniprocessor
SRT adds very little hardware overhead to existing SMTSRT adds very little hardware overhead to existing SMT
+ Better performance than complete replication+ Better performance than complete replication
better use of resourcesbetter use of resources
+ Lower cost+ Lower cost
avoids complete replicationavoids complete replication
market volume of SMT & SRTmarket volume of SMT & SRT
SRT = SMT + Fault Detection
Slide 7
SRT Design ChallengesSRT Design Challenges
Lockstepping doesn’t workLockstepping doesn’t work SMT may issue same instruction from redundant threads in SMT may issue same instruction from redundant threads in
different cyclesdifferent cycles Must carefully fetch/schedule instructions from redundant Must carefully fetch/schedule instructions from redundant
threadsthreads branch mispredictionbranch misprediction cache misscache miss
Disclaimer: This talk focuses only on fault detection, not recovery
Slide 8
Contributions & OutlineContributions & Outline
Sphere of Replication (SoR)Sphere of Replication (SoR) Output comparison for SRTOutput comparison for SRT Input replication for SRTInput replication for SRT Performance Optimizations for SRTPerformance Optimizations for SRT SRT outperforms on-chip replicated microprocessorsSRT outperforms on-chip replicated microprocessors Related Work Related Work SummarySummary
Slide 9
Sphere of Replication (SoR)Sphere of Replication (SoR)
Rest of System
Sphere of Replication
OutputCompariso
n
InputReplication
ExecutionCopy 1
ExecutionCopy 2
Logical boundary of redundant execution within a system• Trade-off between information, time, & space redundancy
Slide 10
Compaq Himalaya
Example Spheres of Replication
Sphere of Replication
OutputComparison
InputReplication
Microprocessor Microprocessor
Memory covered by ECCRAID array covered by parityServernet covered by CRC
Sphere of Replication
OutputComparison
InputReplication
Memory covered by ECCRAID array covered by parityServernet covered by CRC
Pipeline1 Pipeline2
Instruction cache covered by ECCData cache covered by ECC
ORH-Dual: On-Chip Replicated Hardware(similar to IBM G5)
Slide 11
Sphere of Replication for SRTSphere of Replication for SRT
Fetch PC
Instruction Cache
Decode Register Rename
FpRegs
Int .Regs
FpUnits
Ld /StUnits
Int .Units
Thread 0
Thread 1
R1 (R2)
R1 (R2)R3 = R1 + R7
R8 = R7 * 2
RUU
Data
Cac
he
Excludes instruction and data cachesAlternates SoRs possible (e.g., exclude register file)… not in this talk
Slide 12
Output Comparison in SRTOutput Comparison in SRT
Rest of System
Sphere of Replication
OutputCompariso
n
InputReplication
ExecutionCopy 1
ExecutionCopy 2
Compare & validate output before sending it outside the SoR
Slide 13
<address, data> for stores from redundant threads<address, data> for stores from redundant threads compare & validate at commit timecompare & validate at commit time
Output ComparisonOutput Comparison
• <address> for uncached load from redundant threads• <address> for cached load from redundant threads: not required• other output comparison based on the boundary of an SoR
Store: ...
Store: R1 (R2)Store: ...Store: R1 (R2)Store: ...Store: ...
Store: ...StoreQueue
OutputComparison To Data Cache
Slide 14
Input Replication in SRTInput Replication in SRT
Rest of System
Sphere of Replication
OutputCompariso
n
InputReplication
ExecutionCopy 1
ExecutionCopy 2
Replicate & deliver same input (coming from outside SoR) to redundant copies
Slide 15
Input ReplicationInput Replication
Cached load dataCached load data pair loads from redundant threads: too slowpair loads from redundant threads: too slow allow both loads to probe cache: false faults with I/O or allow both loads to probe cache: false faults with I/O or
multiprocessorsmultiprocessors Load Value Queue (LVQ)Load Value Queue (LVQ)
pre-designated pre-designated leading leading & & trailingtrailing threads threads
add load R1(R2)sub add
load R1 (R2)sub
probe cache LVQ
Slide 16
Input Replication (contd.)Input Replication (contd.)
Cached Load Data: alternate solutionCached Load Data: alternate solution Active Load Address Buffer Active Load Address Buffer
Special CasesSpecial Cases Cycle- or time-sensitive instructionsCycle- or time-sensitive instructions External interruptsExternal interrupts
Slide 17
OutlineOutline
Sphere of Replication (SoR)Sphere of Replication (SoR) Output comparison for SRTOutput comparison for SRT Input replication for SRTInput replication for SRT Performance Optimizations for SRTPerformance Optimizations for SRT SRT outperforms on-chip replicated microprocessorsSRT outperforms on-chip replicated microprocessors Related Work Related Work SummarySummary
Slide 18
Performance OptimizationsPerformance Optimizations
Slack fetchSlack fetch maintain constant slack of instructions between leading and maintain constant slack of instructions between leading and
trailing threadtrailing thread
+ leading thread prefetches cache misses+ leading thread prefetches cache misses
+ leading thread prefetches correct branch outcomes+ leading thread prefetches correct branch outcomes Branch Outcome QueueBranch Outcome Queue
feed branch outcome from leading to trailing threadfeed branch outcome from leading to trailing thread Combine the above twoCombine the above two
Slide 19
Baseline Architecture ParametersBaseline Architecture ParametersL1 instruction cache
64K bytes, 4-way associative, 32-byte blocks, single portedL1 data cache
64K bytes, 4-way associative, 32-byte blocks, four read/write portsUnified L2 Cache
1M bytes, 4-way associative, 64-byte blocksBranch predictor
Hybrid local/global (like 21264); 13-bit global history register indexing 8K-entry global PHT and 8K-entry choice table; 2K 11-bit local history registers indexing 2K local PHT; 4K-entry BTB, 16-entry RAS (per thread)
Fetch/Decode/Issue/Commit Width8 instructions/cycle (fetch can span 3 basic blocks)
Function Units6 Int ALU, 2 Int Multiply, 4 FP Add, 2 FP Multiply
Fetch to Decode Latency = 5 cyclesDecode to Execution Latency = 10 cycles
Slide 20
Target ArchitecturesTarget Architectures
SRTSRT SMT + fault detectionSMT + fault detection Output ComparisonOutput Comparison Input Replication (Load Value Queue)Input Replication (Load Value Queue) Slack Fetch + Branch Outcome QueueSlack Fetch + Branch Outcome Queue
ORH-Dual: On-Chip Replicated HardwareORH-Dual: On-Chip Replicated Hardware Each pipeline of dual has half the resources of SRTEach pipeline of dual has half the resources of SRT Two pipelines share fetch stage (including branch predictor)Two pipelines share fetch stage (including branch predictor)
Slide 21
Performance Model & BenchmarksPerformance Model & Benchmarks
SimpleScalar 3.0SimpleScalar 3.0 modified to support SMT by Steve Raasch, U. of Michiganmodified to support SMT by Steve Raasch, U. of Michigan SMT/Simplescalar modified to support SRTSMT/Simplescalar modified to support SRT
Benchmarks Benchmarks compiled with gcc 2.6 + full optimizationcompiled with gcc 2.6 + full optimization subset of spec95 suite (11 benchmarks)subset of spec95 suite (11 benchmarks) skipped between 300 million and 20 billion instructionsskipped between 300 million and 20 billion instructions simulated 200 million for each benchmarksimulated 200 million for each benchmark
Slide 22
SRT vs. ORH-DualSRT vs. ORH-Dual
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4S
peed
up o
ver
OR
H-D
ual
Average improvement = 16%, Maximum = 29%
Slide 23
Recent Related WorkRecent Related WorkSaxena & McCluskey, IEEE Systems, Man, & Saxena & McCluskey, IEEE Systems, Man, &
Cybernetics, 1998. Cybernetics, 1998. + First to propose use of SMT for fault detection+ First to propose use of SMT for fault detection
AR-SMT, Rotenberg, FTCS, 1999AR-SMT, Rotenberg, FTCS, 1999+ Forwards values from leading to checker thread+ Forwards values from leading to checker thread
DIVA, Austin, MICRO, 1999DIVA, Austin, MICRO, 1999+ Converts checker thread into simple processor+ Converts checker thread into simple processor
Slide 24
Improvements over Prior WorkImprovements over Prior Work Sphere of Replication (SoR)Sphere of Replication (SoR)
e.g., AR-SMT register file must be augmented with ECCe.g., AR-SMT register file must be augmented with ECC e.g., DIVA must handle uncached loads in a special waye.g., DIVA must handle uncached loads in a special way
Output ComparisonOutput Comparison e.g., AR-SMT & DIVA compare all instructions, SRT e.g., AR-SMT & DIVA compare all instructions, SRT
compares selected ones based on SoRcompares selected ones based on SoR Input ReplicationInput Replication
e.g., AR-SMT & DIVA detect false transient faults, SRT e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQavoids this problem using LVQ
Slack FetchSlack Fetch
Slide 25
SummarySummary
Simultaneous & Redundantly Threaded Processor (SRT)Simultaneous & Redundantly Threaded Processor (SRT)
SMT + Fault detectionSMT + Fault detection Sphere of replicationSphere of replication
Output comparison of committed store instructionsOutput comparison of committed store instructions Input replication via load value queueInput replication via load value queue
Slack fetch & branch outcome queueSlack fetch & branch outcome queue SRT outperforms equivalently-sized on-chip replicated SRT outperforms equivalently-sized on-chip replicated
hardware by 16% on average & up to 29%hardware by 16% on average & up to 29%