agesim: a simulation framework for evaluating the lifetime reliability of processor-based socs

26
l i a b l e h k C o m p u t i n gL a b o r a t o r y AgeSim: A Simulation Framework for AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Evaluating the Lifetime Reliability of Processor-Based SoCs Processor-Based SoCs Presenter: Lin Huang Presenter: Lin Huang Lin Huang and Qiang Xu CU hk RE liable computing laboratory (CURE) The Chinese University of Hong Kong

Upload: shalom

Post on 15-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs. Presenter: Lin Huang Lin Huang and Qiang Xu CU hk RE liable computing laboratory (CURE) The Chinese University of Hong Kong. Lifetime Reliability Becomes A Serious Concern. Failure mechanisms - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

l i a b l eh k C o m p u t i n gL a b o r a t o r y

AgeSim: A Simulation Framework AgeSim: A Simulation Framework for Evaluating the Lifetime Reliabifor Evaluating the Lifetime Reliabi

lity of Processor-Based SoCslity of Processor-Based SoCs

Presenter: Lin HuangPresenter: Lin Huang

Lin Huang and Qiang Xu

CUhk REliable computing laboratory (CURE)

The Chinese University of Hong Kong

Page 2: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Lifetime Reliability Becomes A Serious Lifetime Reliability Becomes A Serious ConcernConcern

Useful life

Fai

lure

rat

e

Infantmortality

180nm130nm90nm

~ 7 year[T. M. Mak]

< 7 year ~ 10 year

Time

WearoutFailure mechanisms

Electromigration

NBTI

TDDB

Reliability-related factors

Temperature

Supply voltage

Frequency

Page 3: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Design-Stage Decisions Affect Lifetime Design-Stage Decisions Affect Lifetime ReliabilityReliability

Functionality Power consumption Area constraint Thermal issue Expected service life …

SPECIFICATION

IC

DPM / DTMDVFS

Timeout

Thermal throttling

Power gating

RedundancyLevel

Quantity

Task AllocationRound-robin

Optimized

Without an efficient yet accurate lifetime reliability simulation framework,making the good decisions is extremely difficult if not impossible !

Page 4: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

The Challenges in Simulation-Based The Challenges in Simulation-Based Lifetime Reliability AnalysisLifetime Reliability Analysis

Increasing failure rate

Exponential distribution assumption in previous work

Time

Fai

lure

rat

e

Useful lifeInfant

mortality Wearout

Page 5: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

The Challenges in Simulation-Based The Challenges in Simulation-Based Lifetime Reliability AnalysisLifetime Reliability Analysis

Operational temperature varies significantly and rapidly

Obtained with HotSpot 4.0 [Huang-ieeetc08]

How to achieve efficient yet accurate lifetime reliability simulation with such limited information, when failure mechanisms follow arbitrary failure distributions?

Page 6: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Key IdeaKey Idea

General failure distribution with general scale parameter by which time is divided Example: Weibull failure distribution

Suppose we can express the reliability function as

and can be computed according to limited tracing information Example: reliability function

Page 7: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Key IdeaKey Idea

Aging rate Capture the impact of certain usage strategy

Reliability-related usage strategy A combination of …

Dynamic power/thermal managementTrigger mechanismLoad-sharing strategy

… given the application flow with certain characteristic

Page 8: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Key IdeaKey Idea

Temperature

Supply voltage

Frequency

USAGESTRATEGY

FuturePast

Representative workload

Aging rate

Page 9: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Key IdeaKey Idea

FuturePast

Representative workload

Page 10: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step One: Simulation and Tracing– Step One: Simulation and Tracing

ExecutionMode

Power(Data)

Temperature(Data)Power /

ThermalManager

TemperatureSimulator

PowerSimulator

Power StateMachine

TriggerMechanism

ApplicationFlow

Load-sharingStrategy

RedundancyScheme

time step

Page 11: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step One: Simulation and Tracing– Step One: Simulation and Tracing

Power StateMachine

TriggerMechanism

ApplicationFlow

Load-sharingStrategy

RedundancyScheme

Power /ThermalManager

TemperatureSimulator

PowerSimulator

ExecutionMode

Power(Data)

Temperature(Data)

Reliability-Related Factors

Trace File

Page 12: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Proposed Simulation Framework: AgeSimProposed Simulation Framework: AgeSim– Step Two: Aging Rate Calculation– Step Two: Aging Rate Calculation

&

&

Reliability-Related Factors

Trace File

Aging rate

Page 13: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Model ValidationModel Validation

By average temperature28.3% error in MTTF

By AgeSimalmost identical results

Page 14: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling

DVFS1 Low voltage: 90%Vdd

DVFS2 Low voltage: 80%Vdd

No DVFS

HVIdle

HVRun

Task departure

Task arrival

LVIdle

LVRun

Task departure

Task arrival

T>TH T<TL

Page 15: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling

System load The ratio between task arrival rate and service rate

Page 16: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Case Study ICase Study IDynamic Voltage and Frequency ScalingDynamic Voltage and Frequency Scaling

System load The ratio between task arrival rate and service rate

Page 17: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Case Study IICase Study IITask Allocation on Multi-Core ProcessorsTask Allocation on Multi-Core Processors

Random allocation

Performance-aware allocation Always choose the

available core with highest frequency

[Sarangi-ieeetsm08]

Example Chip Frequency Map

Page 18: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Case Study IICase Study IITask Allocation on Multi-Core ProcessorsTask Allocation on Multi-Core Processors

System load The ratio between task arrival rate and service rate

Page 19: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Discussion on the Flexibility of Discussion on the Flexibility of AgeSimAgeSim

Task allocation and scheduling for MPSoC under lifetime reliability constraint

Multiprocessor with different redundancy schemes Example: gracefully degrading redundancy, standby redundancy

Page 20: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

ConclusionConclusion

Lifetime reliability has become a serious concern for high-performance ICs

Design stage decisions significantly affect system reliability

We propose an efficient yet accurate simulation framework to evaluate the system reliability under various usage strategy Arbitrary failure distribution Fine-grained tracing for representative workloads

AgeSim is effective and flexible

Page 21: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

AgeSim: A Simulation Framework for EvaluatiAgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based ng the Lifetime Reliability of Processor-Based

SoCsSoCs

Thank you for your attention !Thank you for your attention !

Page 22: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Backup SlidesBackup Slides

Multiple representative workload Aging rate Accuracy Key idea

Page 23: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Multiple Representative WorkloadsMultiple Representative Workloads

The proposed method could be easily extended to analyze the system with multiple representative workloads

We can organize the workloads into a hyper-workload with their occurrence probabilities

We can extract the aging rate and occurrence probability for each workload and then compute the unified aging rate by

Page 24: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Aging RateAging Rate

Time

Fai

lure

rat

e Aging rate is independent of time

Page 25: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

AccuracyAccuracy

Page 26: AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs

Key IdeaKey Idea

Power StateMachine

TriggerMechanism

ApplicationFlow

Load-sharingStrategy

RedundancyScheme

Processorusage strategy

Power StateMachine

TriggerMechanism

ApplicationFlow

Load-sharingStrategy

RedundancyScheme

Agingrate

Reliabilityfunction