fault detection

Fault DetectionFault DetectionSathish S. VadhiyarSathish S. Vadhiyar

Source/Credits: From Referenced Source/Credits: From Referenced PapersPapers

IntroductionIntroduction

Fine Grain Cycle Sharing (FGCS)Fine Grain Cycle Sharing (FGCS) Host computers allow guest jobs to utilize Host computers allow guest jobs to utilize

CPU cyclesCPU cycles Availability of host computers varyAvailability of host computers vary

Guest jobs may incur resource failuresGuest jobs may incur resource failures Need to predict availability of host Need to predict availability of host

computerscomputers A scheduling system can allocate guest jobs A scheduling system can allocate guest jobs

based on the availability of host computersbased on the availability of host computers

Kinds of Non AvailabilitiesKinds of Non Availabilities

FRC (Failures Caused by Resource FRC (Failures Caused by Resource Contention)Contention) A guest job may significantly impact host A guest job may significantly impact host

processesprocesses Hence a guest job can be removedHence a guest job can be removed

FRR (Failures Caused by Resource FRR (Failures Caused by Resource Revocation)Revocation) A machine owner suspends resource A machine owner suspends resource

contribution without noticecontribution without notice Hardware-software failures occurHardware-software failures occur

Resource Failure PredictionResource Failure Prediction

A multi-state failure model and application of a A multi-state failure model and application of a semi-Markov Process (SMP) to predict the semi-Markov Process (SMP) to predict the temporal reliabilitytemporal reliability

Predicting probability that no resource failure will Predicting probability that no resource failure will occur on a machine in a future time windowoccur on a machine in a future time window

Observing host resource usage values in a time Observing host resource usage values in a time window; calculating parameters of SMP based window; calculating parameters of SMP based on host resource usage valueson host resource usage values

Multi-state resource failure modelMulti-state resource failure model

FRR – 2 statesFRR – 2 states A machine is either available or unavailableA machine is either available or unavailable

FRC FRC Failures when host processes incur Failures when host processes incur

noticeable slowdown due to contention from noticeable slowdown due to contention from guest processesguest processes

A host processor can first decrease the A host processor can first decrease the priority of guest processes; If this does not priority of guest processes; If this does not help, the guest process is terminatedhelp, the guest process is terminated

Measured host resource usage as indicators Measured host resource usage as indicators of noticeable slowdownof noticeable slowdown

Initial ExperimentsInitial Experiments

To study relations between host resource usage To study relations between host resource usage and FRC - Experiments conducted to simulate and FRC - Experiments conducted to simulate resource contentions between a guest process resource contentions between a guest process and host processesand host processes

Host-group – an aggregated set of host Host-group – an aggregated set of host processes with various resource usagesprocesses with various resource usages

Slowdown of host group – reduction of its CPU Slowdown of host group – reduction of its CPU utilization due to contending guest processutilization due to contending guest process

Host programs are run with their isolated CPU Host programs are run with their isolated CPU usage between 10% and 100%usage between 10% and 100%

Guest process – a CPU bound programGuest process – a CPU bound program

Experiments on CPU contentionExperiments on CPU contention

Also measured reduction rate of host CPU Also measured reduction rate of host CPU usage for a host-groupusage for a host-group

Experiments repeated with different host groups Experiments repeated with different host groups with host priority 0, and guest priority 0 and 19 with host priority 0, and guest priority 0 and 19 (renice)(renice)

Measured reduction rate plotted as function of Measured reduction rate plotted as function of isolated host CPU usage, Lisolated host CPU usage, LHH

Found 2 thresholds for LHFound 2 thresholds for LH Th1 – highest value of LH when guest process needs Th1 – highest value of LH when guest process needs

to be reniced to keep reduction rate below 5%to be reniced to keep reduction rate below 5% Th2 – highest value of LH when guest process needs Th2 – highest value of LH when guest process needs

to be suspended to keep reduction rate below 5%to be suspended to keep reduction rate below 5%

State model for LRCState model for LRC

3 states3 states S1 - When LH < Th1; ignore resource S1 - When LH < Th1; ignore resource

contention due to guest processes; contention due to guest processes; slowdown already less than 5%slowdown already less than 5%

S2 - When Th1 < LH < Th2; renice guest S2 - When Th1 < LH < Th2; renice guest processes for slowdown to be < 5%processes for slowdown to be < 5%

S3 - When LH > Th2; terminate guest S3 - When LH > Th2; terminate guest processprocess

Experiments on CPU and Memory Experiments on CPU and Memory ContentionContention

When memory trashing occursWhen memory trashing occurs Total memory of guest and host processes Total memory of guest and host processes

exceed available memory sizeexceed available memory size Experiments were conducted to verify Experiments were conducted to verify

memory trashing does not depend on guest memory trashing does not depend on guest prioritypriority

S4 – for failure due to memory trashingS4 – for failure due to memory trashing

Multi-State Failure ModelMulti-State Failure Model

Proposed prediction algorithm is to predict the Proposed prediction algorithm is to predict the probability that a machine will never transfer to probability that a machine will never transfer to S3, S4, or S5 within a future time windowS3, S4, or S5 within a future time window

TransitionsTransitions Between S1, S2, S3 – decided by measured host CPU Between S1, S2, S3 – decided by measured host CPU

usageusage To S4 – when memory is limitedTo S4 – when memory is limited

Semi-Markov Process Model Semi-Markov Process Model (SMP)(SMP)

Applicable when next transition depends only onApplicable when next transition depends only on Current stateCurrent state How long the system at the current stateHow long the system at the current state

Transition probabilities depend on amount of Transition probabilities depend on amount of time elapsed since last change in statetime elapsed since last change in state

SMP is defined by a 3-tupleSMP is defined by a 3-tuple S – finite set of statesS – finite set of states Q – state transition matrixQ – state transition matrix H – holding time mass function matrixH – holding time mass function matrix

SMP (Contd…)SMP (Contd…)

The most important statistics of SMP - Interval transition The most important statistics of SMP - Interval transition probabilities, Pprobabilities, P

To calculate PTo calculate P Continuous time SMP is expensiveContinuous time SMP is expensive Hence the work develops a discrete time SMP modelHence the work develops a discrete time SMP model

SMP for Resource AvailabilitySMP for Resource Availability

TR – probability of never transferring to S3, S4 or S5 TR – probability of never transferring to S3, S4 or S5 within an arbitrary time window, Wwithin an arbitrary time window, W

SSinitinit – initial system state – initial system state W – WW – Winitinit + T + T

Q and H calculated based on statistics from history logs Q and H calculated based on statistics from history logs due to monitoring host resource usagedue to monitoring host resource usage

SMP for Resource AvailabilitySMP for Resource Availability

PPi,ji,j(m) = P(m) = Pi,ji,j(W(Winitinit, W, Winitinit+m)+m) PP11

i,ki,k(l) – interval transition probabilities for a one-step (l) – interval transition probabilities for a one-step transitiontransition

d – time unit of a discretization intervald – time unit of a discretization interval Q and H calculated based on statistics from history logs Q and H calculated based on statistics from history logs

due to monitoring host resource usagedue to monitoring host resource usage

System Design and ImplementationSystem Design and Implementation

Client requests job submission Client requests job submission Client’s job scheduler queries Client’s job scheduler queries

the gateways on available the gateways on available machines for temporal machines for temporal availabilitiesavailabilities

Chooses a machine and Chooses a machine and spawns a guest jobspawns a guest job

During job execution, monitor During job execution, monitor detects state transition and detects state transition and notifies gatewaynotifies gateway

Gateway renices or kills the Gateway renices or kills the guest processes accordinglyguest processes accordingly

Resource monitor uses simple Resource monitor uses simple cpu commands like `top’ to cpu commands like `top’ to calculate cpu usagescalculate cpu usages

Computation in Solving SMPComputation in Solving SMP Matrix sparsity in SMP is exploited to reduce Matrix sparsity in SMP is exploited to reduce

computationscomputations

The sparse matrix is constructed based on 2 The sparse matrix is constructed based on 2 facts:facts: It takes a finite amount of time to transition from one It takes a finite amount of time to transition from one

state to anotherstate to another S3, S4, S5 are unrecoverable failure statesS3, S4, S5 are unrecoverable failure states

Prediction AccuracyPrediction Accuracy

TR gets close to 0 for large time windows

Appropriate Training SizeAppropriate Training Size

Comparison with Linear Regression Comparison with Linear Regression TechniquesTechniques

Injecting NoisesInjecting Noises

ReferencesReferences

Resource Failure Prediction in Fine-Resource Failure Prediction in Fine-Grained Cycle Sharing Systems. X. Ren, Grained Cycle Sharing Systems. X. Ren, S. Lee, R. Eigenmann, S. Bagchi. HPDC S. Lee, R. Eigenmann, S. Bagchi. HPDC 2006.2006.

fault detection

Documents

host priority

host processeshence

host processeshost

guest processexperiments

guest prioritys4

guest processes slowdown

priority of guest processes

resource contentiona