Download - Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee [email protected] VSSAD, Alpha Technology Compaq Computer Corporation

Transient Fault Detection Transient Fault Detection via Simultaneous Multithreadingvia Simultaneous Multithreading

Shubhendu S. MukherjeeShubhendu S. [email protected]@compaq.com

VSSAD, Alpha TechnologyVSSAD, Alpha TechnologyCompaq Computer CorporationCompaq Computer Corporation

Shrewsbury, MassachusettsShrewsbury, Massachusetts

Steven K. ReinhardtSteven K. [email protected]@eecs.umich.eduElectrical Engineering & Computer SciencesElectrical Engineering & Computer SciencesUniversity of MichiganUniversity of MichiganAnn Arbor, MichiganAnn Arbor, Michigan

27th Annual International Symposium on Computer Architecture (ISCA), 2000

Transient Faults Transient Faults

Faults that persist for a “short” durationFaults that persist for a “short” duration Cause: Cause: cosmic rays (e.g., neutrons)cosmic rays (e.g., neutrons) Effect: Effect: knock off electrons, discharge capacitorknock off electrons, discharge capacitor SolutionSolution

no practical absorbent for cosmic raysno practical absorbent for cosmic rays– 1 fault per 1000 computers per year (estimated fault rate)1 fault per 1000 computers per year (estimated fault rate)

Future is worseFuture is worse smaller feature size, reduce voltage, higher transistor smaller feature size, reduce voltage, higher transistor

count, reduced noise margincount, reduced noise margin

Fault Detection in Compaq Himalaya SystemFault Detection in Compaq Himalaya System

R1 (R2)

InputReplication

OutputComparison

Memory covered by ECCRAID array covered by parityServernet covered by CRC

R1 (R2)

microprocessor microprocessor

Replicated Microprocessors + Cycle-by-Cycle Lockstepping

Fault Detection via Simultaneous MultithreadingFault Detection via Simultaneous Multithreading

R1 (R2)

InputReplication

OutputComparison


R1 (R2)

THREAD THREAD

Replicated Microprocessors + Cycle-by-Cycle LocksteppingThreads ?

Simultaneous Multithreading (SMT)Simultaneous Multithreading (SMT)

FunctionalUnits

InstructionScheduler

Thread1 Thread2

Example: Alpha 21464

Simultaneous & Redundantly Threaded Processor Simultaneous & Redundantly Threaded Processor (SRT)(SRT)

+ Less hardware+ Less hardware compared to replicated microprocessorscompared to replicated microprocessors

SMT needs ~5% more hardware over uniprocessorSMT needs ~5% more hardware over uniprocessor

SRT adds very little hardware overhead to existing SMTSRT adds very little hardware overhead to existing SMT

+ Better performance than complete replication+ Better performance than complete replication

better use of resourcesbetter use of resources

+ Lower cost+ Lower cost

avoids complete replicationavoids complete replication

market volume of SMT & SRTmarket volume of SMT & SRT

SRT = SMT + Fault Detection

SRT Design ChallengesSRT Design Challenges

Lockstepping doesn’t workLockstepping doesn’t work SMT may issue same instruction from redundant threads in SMT may issue same instruction from redundant threads in

different cyclesdifferent cycles Must carefully fetch/schedule instructions from redundant Must carefully fetch/schedule instructions from redundant

threadsthreads branch mispredictionbranch misprediction cache misscache miss

Disclaimer: This talk focuses only on fault detection, not recovery

Contributions & OutlineContributions & Outline

Sphere of Replication (SoR)Sphere of Replication (SoR) Output comparison for SRTOutput comparison for SRT Input replication for SRTInput replication for SRT Performance Optimizations for SRTPerformance Optimizations for SRT SRT outperforms on-chip replicated microprocessorsSRT outperforms on-chip replicated microprocessors Related Work Related Work SummarySummary

Sphere of Replication (SoR)Sphere of Replication (SoR)

Rest of System

Sphere of Replication

OutputCompariso

n

InputReplication

ExecutionCopy 1

ExecutionCopy 2

Logical boundary of redundant execution within a system• Trade-off between information, time, & space redundancy

Compaq Himalaya

Example Spheres of Replication


OutputComparison

InputReplication

Microprocessor Microprocessor



OutputComparison

InputReplication


Pipeline1 Pipeline2

Instruction cache covered by ECCData cache covered by ECC

ORH-Dual: On-Chip Replicated Hardware(similar to IBM G5)

Sphere of Replication for SRTSphere of Replication for SRT

Fetch PC

Instruction Cache

Decode Register Rename

FpRegs

Int .Regs

FpUnits

Ld /StUnits

Int .Units

Thread 0

Thread 1

R1 (R2)

R1 (R2)R3 = R1 + R7

R8 = R7 * 2

RUU

Data

Cac

he

Excludes instruction and data cachesAlternates SoRs possible (e.g., exclude register file)… not in this talk

Output Comparison in SRTOutput Comparison in SRT

Rest of System


OutputCompariso

n

InputReplication

ExecutionCopy 1

ExecutionCopy 2

Compare & validate output before sending it outside the SoR

<address, data> for stores from redundant threads<address, data> for stores from redundant threads compare & validate at commit timecompare & validate at commit time

Output ComparisonOutput Comparison

• <address> for uncached load from redundant threads• <address> for cached load from redundant threads: not required• other output comparison based on the boundary of an SoR

Store: ...

Store: R1 (R2)Store: ...Store: R1 (R2)Store: ...Store: ...

Store: ...StoreQueue

OutputComparison To Data Cache

Input Replication in SRTInput Replication in SRT

Rest of System


OutputCompariso

n

InputReplication

ExecutionCopy 1

ExecutionCopy 2

Replicate & deliver same input (coming from outside SoR) to redundant copies

Input ReplicationInput Replication

Cached load dataCached load data pair loads from redundant threads: too slowpair loads from redundant threads: too slow allow both loads to probe cache: false faults with I/O or allow both loads to probe cache: false faults with I/O or

multiprocessorsmultiprocessors Load Value Queue (LVQ)Load Value Queue (LVQ)

pre-designated pre-designated leading leading & & trailingtrailing threads threads

add load R1(R2)sub add

load R1 (R2)sub

probe cache LVQ

Input Replication (contd.)Input Replication (contd.)

Cached Load Data: alternate solutionCached Load Data: alternate solution Active Load Address Buffer Active Load Address Buffer

Special CasesSpecial Cases Cycle- or time-sensitive instructionsCycle- or time-sensitive instructions External interruptsExternal interrupts

OutlineOutline

Sphere of Replication (SoR)Sphere of Replication (SoR) Output comparison for SRTOutput comparison for SRT Input replication for SRTInput replication for SRT Performance Optimizations for SRTPerformance Optimizations for SRT SRT outperforms on-chip replicated microprocessorsSRT outperforms on-chip replicated microprocessors Related Work Related Work SummarySummary

Performance OptimizationsPerformance Optimizations

Slack fetchSlack fetch maintain constant slack of instructions between leading and maintain constant slack of instructions between leading and

trailing threadtrailing thread

+ leading thread prefetches cache misses+ leading thread prefetches cache misses

+ leading thread prefetches correct branch outcomes+ leading thread prefetches correct branch outcomes Branch Outcome QueueBranch Outcome Queue

feed branch outcome from leading to trailing threadfeed branch outcome from leading to trailing thread Combine the above twoCombine the above two

Baseline Architecture ParametersBaseline Architecture ParametersL1 instruction cache

64K bytes, 4-way associative, 32-byte blocks, single portedL1 data cache

64K bytes, 4-way associative, 32-byte blocks, four read/write portsUnified L2 Cache

1M bytes, 4-way associative, 64-byte blocksBranch predictor

Hybrid local/global (like 21264); 13-bit global history register indexing 8K-entry global PHT and 8K-entry choice table; 2K 11-bit local history registers indexing 2K local PHT; 4K-entry BTB, 16-entry RAS (per thread)

Fetch/Decode/Issue/Commit Width8 instructions/cycle (fetch can span 3 basic blocks)

Function Units6 Int ALU, 2 Int Multiply, 4 FP Add, 2 FP Multiply

Fetch to Decode Latency = 5 cyclesDecode to Execution Latency = 10 cycles

Target ArchitecturesTarget Architectures

SRTSRT SMT + fault detectionSMT + fault detection Output ComparisonOutput Comparison Input Replication (Load Value Queue)Input Replication (Load Value Queue) Slack Fetch + Branch Outcome QueueSlack Fetch + Branch Outcome Queue

ORH-Dual: On-Chip Replicated HardwareORH-Dual: On-Chip Replicated Hardware Each pipeline of dual has half the resources of SRTEach pipeline of dual has half the resources of SRT Two pipelines share fetch stage (including branch predictor)Two pipelines share fetch stage (including branch predictor)

Performance Model & BenchmarksPerformance Model & Benchmarks

SimpleScalar 3.0SimpleScalar 3.0 modified to support SMT by Steve Raasch, U. of Michiganmodified to support SMT by Steve Raasch, U. of Michigan SMT/Simplescalar modified to support SRTSMT/Simplescalar modified to support SRT

Benchmarks Benchmarks compiled with gcc 2.6 + full optimizationcompiled with gcc 2.6 + full optimization subset of spec95 suite (11 benchmarks)subset of spec95 suite (11 benchmarks) skipped between 300 million and 20 billion instructionsskipped between 300 million and 20 billion instructions simulated 200 million for each benchmarksimulated 200 million for each benchmark

SRT vs. ORH-DualSRT vs. ORH-Dual

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4S

peed

up o

ver

OR

H-D

ual

Average improvement = 16%, Maximum = 29%

Recent Related WorkRecent Related WorkSaxena & McCluskey, IEEE Systems, Man, & Saxena & McCluskey, IEEE Systems, Man, &

Cybernetics, 1998. Cybernetics, 1998. + First to propose use of SMT for fault detection+ First to propose use of SMT for fault detection

AR-SMT, Rotenberg, FTCS, 1999AR-SMT, Rotenberg, FTCS, 1999+ Forwards values from leading to checker thread+ Forwards values from leading to checker thread

DIVA, Austin, MICRO, 1999DIVA, Austin, MICRO, 1999+ Converts checker thread into simple processor+ Converts checker thread into simple processor

Improvements over Prior WorkImprovements over Prior Work Sphere of Replication (SoR)Sphere of Replication (SoR)

e.g., AR-SMT register file must be augmented with ECCe.g., AR-SMT register file must be augmented with ECC e.g., DIVA must handle uncached loads in a special waye.g., DIVA must handle uncached loads in a special way

Output ComparisonOutput Comparison e.g., AR-SMT & DIVA compare all instructions, SRT e.g., AR-SMT & DIVA compare all instructions, SRT

compares selected ones based on SoRcompares selected ones based on SoR Input ReplicationInput Replication

e.g., AR-SMT & DIVA detect false transient faults, SRT e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQavoids this problem using LVQ

Slack FetchSlack Fetch

SummarySummary

Simultaneous & Redundantly Threaded Processor (SRT)Simultaneous & Redundantly Threaded Processor (SRT)

SMT + Fault detectionSMT + Fault detection Sphere of replicationSphere of replication

Output comparison of committed store instructionsOutput comparison of committed store instructions Input replication via load value queueInput replication via load value queue

Slack fetch & branch outcome queueSlack fetch & branch outcome queue SRT outperforms equivalently-sized on-chip replicated SRT outperforms equivalently-sized on-chip replicated

hardware by 16% on average & up to 29%hardware by 16% on average & up to 29%

Download - Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee [email protected] VSSAD, Alpha Technology Compaq Computer Corporation

Top Related