present by chen, ting-wei adaptive task checkpointing and replication: toward efficient...

30
Present by Chen, Ting-Wei Present by Chen, Ting-Wei Adaptive Task Checkpointing Adaptive Task Checkpointing and Replication: Toward and Replication: Toward Efficient Fault-Tolerant Efficient Fault-Tolerant Grids Grids Maria Chtepen, Filip H.A. Claeys, Bart D Maria Chtepen, Filip H.A. Claeys, Bart D hoedt, Member, IEEE, Filip De Turck, Mem hoedt, Member, IEEE, Filip De Turck, Mem ber, IEEE, Piet Demeester, Senior Member, ber, IEEE, Piet Demeester, Senior Member, IEEE, AND Peter A. Vanrolleghem IEEE, AND Peter A. Vanrolleghem

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

Present by Chen, Ting-WeiPresent by Chen, Ting-Wei

Adaptive Task Checkpointing Adaptive Task Checkpointing and Replication: Toward and Replication: Toward

Efficient Fault-Tolerant GridsEfficient Fault-Tolerant Grids

Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt, MembMaria Chtepen, Filip H.A. Claeys, Bart Dhoedt, Member, IEEE, Filip De Turck, Member, IEEE, Piet Demeester, IEEE, Filip De Turck, Member, IEEE, Piet Demeester, Senior Member, IEEE, AND Peter A. Vanrolleghemer, Senior Member, IEEE, AND Peter A. Vanrolleghem

Page 2: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

2

Table of ContentTable of Content

• Introduction

• Adaptive Checkpointing Heuristics

• Replication-Based Heuristics

• Conclusion and Future Work

Page 3: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

3

IntroductionIntroduction

• A novel fault-tolerant algorithm combine– Checkpointing– Replication

• Be evaluated– Newly developed grid simulation

environment Dynamic Scheduling in Distributed Environments (DSiDE)

Page 4: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

4

Introduction Introduction (cont.)(cont.)

• Simulation– Run employing workload– System parameters

• From several large-scale parallel production systems’ logs

– Using the discrete event grid simulator DSiDE

Page 5: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

5

Introduction Introduction (cont.)(cont.)

• Comparable throughput and fault tolerance– Static checkpointing with optimal

parameters– Replication with optimal parameters

Page 6: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

6

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics

• The Checkpointing Model– Limites

• Runtime overhead (C)• Network latency (L)• Recovery delay (R)

– Concentrates on the reduction of the checkpointing runtime overhead

Page 7: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

7

Adaptive Checkpointing Adaptive Checkpointing HeuristicsHeuristics (cont.)(cont.)

– Problem

Assuming the execution time can be exactly determined in advance

– Simulation

The upper bounds of the algorithms performance, with respect to this parameter

Page 8: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

8

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

• Last Failure Dependent Checkpointing (LastFailureCP)– Goal

• To reduce the overhead

Page 9: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

9

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

• Mean Failure Dependent Checkpointing (MeanFailureCP)– Only considers checkpoint omissions– Modify the checkpointing interval based

on the runtime information• The remaining job execution time• The average failure interval of the resource

Page 10: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

10

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

• DSiDE Simulation Environment– Goal

Validate– Architecture

• DExec• DGen

– Each DSiDE event has a time stamp• Provide a priori or at runtime

– Support several types of dynamic system modifications

Page 11: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

11

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

• The DSiDE simulator architecture

Page 12: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

12

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

– The resource performed useful computations

– Total grid availability

– DSiDE provides a set of events to specify network links and routes

, ,1

((1 ( ( ) / )) 100)N

f rr r n r n sim

n

A t t T

1

(( ) /( )) 100R

grid r simr

A A T R

Page 13: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

13

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

• Simulation Result– To compare the performance

• Checkpointing heuristics• Realistic workload• System failure model

Page 14: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

14

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

– Submit’s time• 80% (7 a.m. ~ 9 p.m.)• 20% (9 p.m. ~ 7 a.m.)

Page 15: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

15

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

– Execution time• More than 80% of percent of all submitted

jobs have medium execution times• 1 hour to 6 hours

Page 16: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

16

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

– I decreases and longer jobs can get processed

– Increase in job runtime is in effect– The results

• The results achieved with PeriodicCP are partially improved by LastFailureCP due to omission of redundant checkpoints

• The technique provides the best results for short checkpointing intervals

• The effectiveness of LastFailureCP strongly depends on failure periodically

Page 17: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

17

Adaptive Checkpointing Adaptive Checkpointing Heuristics Heuristics (cont.)(cont.)

• Failures occur quite periodically– Can easily be predicted by the algorithm– LastFailureCP will perform similar to PeriodicCP

• The fully dynamic scheme of MeanFailureCP proves to be the most effective

• Selective increase in checkpointing keeps the number of processed jobs and the average execution time of MeanFailureCP more or less constant

• PeriodicCP and LastFailureCP algorithms, the performance drops considerably

Page 18: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

18

Replication-based HeuristicsReplication-based Heuristics

• Load-Dependent Replication (LoadDependentRep)– Providing fault tolerance in distributed

environments through replication• Idle resources can be utilized to run job

copies without significantly delaying the execution of the original job

Page 19: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

19

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

– The algorithm requires a number of parameters to be provided in advance

• Minimum number of job copies (Repmin)• Maximum number of job copies (Repmax)• The CPU limit (CL)

Page 20: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

20

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

– The outcome of the comparison determines the choice for the next job to be scheduled

• CA >= CL (Less than Repmax)• 0 < CA < CL (Less than Repmin)• CA = 0 (Skip the current scheduling round)

– When one of the job duplicates finishes, other replicas are automatically canceled

Page 21: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

21

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

• Failure Detection and Load Dependent Replication (FailureDependentRep)– Increase the fault tolerance of the previousl

y discussed LoadDependentRep heuristic– Offer a higher level of fault tolerance comp

ared to solely replication-based strategies– Not ensure job execution

Page 22: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

22

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

• Adaptive Checkpoint and Replication-Based Fault Tolerance (CombinedFT)– Dynamically switches between both tech

niques based on runtime information on system load

• Checkpointing mode• Replication mode

Page 23: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

23

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

– Checkpointing mode• CPU availability is low (CA < CL)• Combined FT rolls back• The earlier distributed active job replicas (A

Rj) • Starts job checkpointing

– ARj > 0

– ARj = 0 & CA > 0– ARj = 0 & CA = 0 & ∃i: ARi > 1– ARj = 0 & CA = 0 & ¬∃i: ARi > 1

Page 24: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

24

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

– Replication mode• Either the system load decreases• Enough resources restore from failure (CA≧CL)

• All jobs with less than Repmax replicas are considered for submission to the available resources

• Assign to the fastest resource connected to a grid site S with the maximum SpeedS

• The smallest number of identical replicas

Page 25: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

25

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

• Simulation Results– Approaches

• Unconditional RL(1)• Unconditional RL(2)• Unconditional RL(3)• LoadDependentRL(1, 3, 40)• FailureDependentRL(1, 3, 40)• MeanFailureCP• CombinedFT

Page 26: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

26

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

Page 27: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

27

Replication-based Heuristics Replication-based Heuristics (cont.)(cont.)

Page 28: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

28

Conclusion and Future Work Conclusion and Future Work

• Fault tolerance forms an important problem– Job checkpointing– Replication

• Evaluate in the DSiDE grid simulator

• The runtime overhead characteristic to periodic checkpointing can be reduced

Page 29: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

29

Conclusion and Future Work Conclusion and Future Work (cont.)(cont.)

• Advantage– When the distributed system properties

are not known in advance, both techniques can best be applied

• Future Work– Scheduling methods will be considered

Page 30: Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

Present by Chen, Ting-WeiPresent by Chen, Ting-Wei

Thank you for Thank you for your attentionyour attention