coordinated checkpoint versus message log for fault tolerant mpi

Coordinated Checkpoint Coordinated Checkpoint Versus Message Log For Versus Message Log For

Fault Tolerant MPIFault Tolerant MPI

Aurélien Bouteiller [email protected] work with

F.Cappello, G.Krawezik, P.Lemarinier

Cluster&Grid group, Grand Large Projecthttp://www.lri.fr/~gk/MPICH-V

Grand Large

http://www.lri.fr/archi/parall/introduction.fr.html

HPC trend: Clusters are getting larger

High performance computers have more and more nodes (more than 8000 for ASCI Q, more than 5000 for BigMac cluster, 1/3rd of the installations of top500 have more than 500 processors).More components increases fault probabilityASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), a 5 hours job with 4096 procs has less than 50% chance to terminate

Many numerical applications uses MPI (Message Passing Interface) libraryNeed for automatic fault tolerant MPI

Fault tolerant MPI A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques.

Sender based Mess. Log.1 fault sender based

Manethon faults

Egida MPI/FTRedundance of tasks

FT-MPIModification of MPI routines

User Fault Treatment

MPICH-V2N faults

Distributed logging

MPI-FTN fault

Centralized server

Non AutomaticAutomatic

Pessimistic log

Log basedcoordinate

dbased

Causal logOptimistic log

Level

Framework

API

Communication Lib.

CocheckIndependent of

MPI

StarfishEnrichment of MPI

ClipSemi-transparent checkpoint

Pruitt 982 faults sender based

Optimistic recoveryIn distributed systems

n faults with coherent checkpoint

Coordinated checkpoint

MPICH-CLN faults

Several protocols to perform fault tolerance in MPI applications with N faults and automatic recovery : Global checkpointing, Pessimistic/Causal Message log

compare fault tolerant protocols for a single MPI implementation

Outline

Introduction Coordinated checkpoint vs message

log Comparison framework Performances Conclusions and future works

Fault Tolerant protocolsProblem of inconsistent states

Uncoordinated checkpoint : the problem of inconsistent states

Order of message receptions are undeterministic events

message received but not sent are inconsistent Domino effect can lead to rollback to the begining of the

execution in case of fault

Possible loose of the whole execution and unpredictive fault cost

P0

P1

P2C3 C1

m1m2

m3

C2

Fault Tolerant protocolsGlobal Checkpoint 1/2

Communication Induced Checkpointing Does not require global synchronisation to provide a

global coherent snapshot Drawbacks studied in L. alvisi, E. Elnozahy, S. Rao, S. A. Husain, and

A. D. Mel. An analysis of communication induced checkpointing. In 29th symposium on Fault-Tolerant Computing (FTFC’99). IEEE Press, June 99.

number of forced checkpoint increases linearly with number of nodes Does not scale

Unpredictable frequency of checkpoint may lead to take an overestimated number of checkpoints

Detection of a possible inconsistent state induces blocking checkpoint of some processes

Blocking checkpoint has a dramatic overhead on fault free execution

These protocols may not be used in practice

Fault Tolerant protocolsGlobal Checkpoint 2/2

Coordinated checkpoint All processes coordinate their checkpoints so that

the global system state is coherent (Chandy & Lamport Algorithm)

Negligible overhead on fault free execution Requires global synchronization (may take a long

time to perform checkpoint because of checkpoint server stress)

In the case of a single fault, all processes have to roll back to their checkpoints

High cost of fault recovery

Efficient when fault frequency is low

Fault tolerant protocolsMessage Log 1/2

Pessimistic log All messages recieved by a process are logged on

a reliable media before it can causally influence the rest of the system

Non negligible overhead on network performances in fault free execution

No need to perform global synchronization

Does not stress checkpoint servers No need to roll back non failed processes

Fault recovery overhead is limited

Efficient when fault frequency is high

Fault tolerant protocolsMessage Log 2/2

Causal log Designed to improve fault free performance of pessimistic

log Messages are logged locally and causal dependencies are

piggybacked to messages

Non negligible overhead on fault free execution, slightly better than pessimistic log

No global synchronisation

Does not stress checkpoint server Only failed process are rolled back Failed Processes retrieve their state from dependant

processes or no process depends on it.

Fault recovery overhead is limited but greater than pessimistic log

Comparison: Related works

Egida compared log based techniques Siram Rao, Lorenzo Alvisi, Harrick M. Vim: The cost of recovery in message logging protocols. In 17th symposium on Reliable Distributed Systems (SRDS), pages 10-18, IEEE Press, October 1998

- Causal log is better for single nodes faults- Pessimistic log is better for concurrent faults

No existing comparison of coordinated and message log protocols No existing comparable implementations of coordinated and

message log protocols high fault recovery overhead of coordinated checkpoint high overhead of message logging on fault free performance

Suspected : fault frequency implies tradeoff

Compare coordinated checkpoint and pessimistic logging

Several protocols to perform automatic fault tolerance in MPI applications- Coordinated checkpoint- Causal message log- Pessimistic message logAll of them have been studied theoretically but not compared

Outline

Introduction Coordinated checkpoint vs Message log Related work Comparison framework Performances Conclusions and future works

Architectures

MPICH-CLChandy&Lamport algorithmCoordinated checkpoint

MPICH-V2Pessimistic sender Based message log

We designed MPICH-CL and MPICH-V2 in a shared framework to perform a fair comparison of coordinated checkpoint and pessimistic message log

Communication daemon

MPI process

CL/V2 daemon

Send

Receive

Send

Receive

EventLogger (V2 only)

Receptionevent

keepPayload (V2 only)

CkptServer

CSAC

CheckpointImage

CkptControl

Node

ack

CL and V2 share the same architecture

communication daemon includes protocol specific actions

– A new device: ‘ch_v2’ or ‘ch_cl’ device

– All ch_xx device functions are blocking communication functions built over TCP layer

Generic device: based on MPICH-1.2.5

ADI

ChannelInterface

ChameleonInterface

V2/CL deviceInterface

MPI_Send

MPID_SendControlMPID_SendChannel

_bsend

_bsend

_brecv

_probe

_from

_Init

_Finalize

- get the src of the last message

- check for any message avail.

- blocking send

- blocking receive

- initialize the client

- finalize the client

Binding

Node Checkpointing

User-level Checkpoint : Condor Stand Alone Checkpointing

Clone checkpointing + non blocking checkpoint

code

CSAC

libmpichv

Ckptorder

(1) fork

fork(2) Terminate ongoing coms(3) close sockets(4) call ckpt_and_exit()

Checkpoint image is sent to reliable CS on the fly

Local storage does not ensure fault tolerance

Resume execution using CSAC just after (4), reopen sockets and return

Checkpoint scheduler policy● In MPICH-V2,

checkpoint scheduler is not required by pessimistic protocol. It is used to minimize size of checkpointed payload using a best effort heuristic.

Policy : permanent individual checkpoint

● In MPICH-CL, checkpoint scheduler is used as a dedicaded process to initiate checkpoint.

Policy : checkpoint every n seconds where n is a runtime parameter

Node

Node

Node

Ckptsched

CkptServer

1

2

3Messagepayload

Node

Node

Node

Ckptsched

CkptServer

2

3

Synchro

1

Outline

Introduction Coordinated checkpoint vs Message log Comparison framework Performances Conclusions and future works

Experimental conditionsCluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc

+ 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc+ 48 ports 100Mb/s Ethernet switch

Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp)

node

Network

node

node

A single reliable node

Checkpoint Server +Event Logger (V2 only)+Checkpoint Scheduler+Dispatcher

Bandwith and latency

Latency for a 0 byte MPI message : MPICH-P4 ( 77us), MPICH-CL (154us), MPICH-V2 (277us)

Latency is high in MPICH-CL due to more memory copies compared to P4Latency is even higher in MPICH-V2 due to the event logging.

A receiving process can send a new message only when the reception event has been successfully logged (3 TCP messages for a communication)

Benchmark applications

Validating our implementations on NAS BT Benchmark class A and B shows comparable performances to P4 reference implementation. As expected MPICH-CL reaches better fault free performances than MPICH-V2

Checkpoint server Performance

Time to checkpoint all processes concurently on a single checkpoint server.

2nd process does not increase checkpoint time, filling unused bandwith.

More processes increase checkpoint time linearly.

Time to checkpoint a process according to its size.

Checkpoint time increases linearly with checkpoint size.

Memory swap overhead appears at 512MB (fork).

Ch

eckp

oin

t ti

me (

secon

ds)

Size of process (MB) Number of processes checkpointing simultaneouslyTim

e t

o c

heckp

oin

t all p

rocesses (

secon

ds)

BT Checkpoint and Restart Performance

● Considerating the same dataset, per process image size decreases when number of processes increases

● As a consequence time to checkpoint remains constant with increasing number of processes

● Performing a complete asynchronous checkpoint takes as much time as coordinated checkpoint

● Time to restart after a fault is decreasing with the number of nodes for V2 and not changing for CL

Fault impact on performances

NAS Benchmark BT B 25 nodes (32MB per process image size)

Average time to perform Checkpoint

MPICH-CL : 68s MPICH-V2 : 73.9s

Average time to recover from failure

MPICH-CL : 65.8s MPICH-V2 : 5.3s

If we consider a 1GB memory occupation for every process, an extrapolation gives a 2000s checkpoint time for 25 nodes in MPICH-CL. The minimum fault interval ensuring progression of computation becomes about 1h

MPICH-V2 can tolerate a high fault rate. MPICH-CL cannot ensure termination of the execution for a high fault rate.

Outline

Introduction Coordinated checkpoint vs Message log Comparison framework Performances Conclusions and future works

Conclusion

MPICH-CL and MPICH-V2 are two comparable implementations of fault tolerant MPI from the MPICH-1.2.5, one using coordinated checkpoint, the other pessimistic message log

We have compared the overhead of these two techniques according to fault frequency

The recovery overhead is the main factor differentiating performances

We have found a crosspoint from which message log becomes better than coordinated checkpoint. On our test application this cross point appears near 1 per 10 minutes. With 1GB application, coordinated checkpoint does not ensure progress of the computation for one fault every hour.

Perspectives

Larger scale experiments Use more nodes and applications with realistic

amount of memory High performance networks experiments

Myrinet – Infiniband Comparison with causal log => MPICH-V2C vs

augmented MPICH-CL MPICH-V2C is a causal log implementation, thus removing

the high latency impact induced by pessimistic log MPICH-CL is being modified to restart non failed nodes from

local checkpoint, removing the high restart overhead

MPICH-V2C● MPICH-V2 suffers from high latency

(pessimistic protocol)● MPICH-V2C corrects this drawback at the

expense of an increase of average message size (causal log protocol)

Log of r1 r2

ACK for r1 r2P can send s1

P could have sent s1 But have to wait ack for message log Of preceeding receptions

r1 r2

s1(delayed)

s1(actual)

P

EL

Log of r1 r2

ACK for r1 r2P can stop piggyback

P sends s1 before acknoledge fromEL. Causality informations areare piggybacked on messages in such a caser1 r2

s1(piggybackOf causalities)

s2(nothing topiggyback)

coordinated checkpoint versus message log for fault tolerant mpi

Documents