coordinated checkpoint versus message log for fault tolerant mpi
Post on 17-Jan-2016
35 Views
Preview:
DESCRIPTION
TRANSCRIPT
Coordinated Checkpoint Coordinated Checkpoint Versus Message Log For Versus Message Log For
Fault Tolerant MPIFault Tolerant MPI
Aurélien Bouteiller bouteill@lri.frjoint work with
F.Cappello, G.Krawezik, P.Lemarinier
Cluster&Grid group, Grand Large Projecthttp://www.lri.fr/~gk/MPICH-V
Grand Large
HPC trend: Clusters are getting larger
High performance computers have more and more nodes (more than 8000 for ASCI Q, more than 5000 for BigMac cluster, 1/3rd of the installations of top500 have more than 500 processors).More components increases fault probabilityASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), a 5 hours job with 4096 procs has less than 50% chance to terminate
Many numerical applications uses MPI (Message Passing Interface) libraryNeed for automatic fault tolerant MPI
Fault tolerant MPI A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques.
Sender based Mess. Log.1 fault sender based
Manethon faults
Egida MPI/FTRedundance of tasks
FT-MPIModification of MPI routines
User Fault Treatment
MPICH-V2N faults
Distributed logging
MPI-FTN fault
Centralized server
Non AutomaticAutomatic
Pessimistic log
Log basedcoordinate
dbased
Causal logOptimistic log
Level
Framework
API
Communication Lib.
CocheckIndependent of
MPI
StarfishEnrichment of MPI
ClipSemi-transparent checkpoint
Pruitt 982 faults sender based
Optimistic recoveryIn distributed systems
n faults with coherent checkpoint
Coordinated checkpoint
MPICH-CLN faults
Several protocols to perform fault tolerance in MPI applications with N faults and automatic recovery : Global checkpointing, Pessimistic/Causal Message log
compare fault tolerant protocols for a single MPI implementation
Outline
Introduction Coordinated checkpoint vs message
log Comparison framework Performances Conclusions and future works
Fault Tolerant protocolsProblem of inconsistent states
Uncoordinated checkpoint : the problem of inconsistent states
Order of message receptions are undeterministic events
message received but not sent are inconsistent Domino effect can lead to rollback to the begining of the
execution in case of fault
Possible loose of the whole execution and unpredictive fault cost
P0
P1
P2C3 C1
m1m2
m3
C2
Fault Tolerant protocolsGlobal Checkpoint 1/2
Communication Induced Checkpointing Does not require global synchronisation to provide a
global coherent snapshot Drawbacks studied in L. alvisi, E. Elnozahy, S. Rao, S. A. Husain, and
A. D. Mel. An analysis of communication induced checkpointing. In 29th symposium on Fault-Tolerant Computing (FTFC’99). IEEE Press, June 99.
number of forced checkpoint increases linearly with number of nodes Does not scale
Unpredictable frequency of checkpoint may lead to take an overestimated number of checkpoints
Detection of a possible inconsistent state induces blocking checkpoint of some processes
Blocking checkpoint has a dramatic overhead on fault free execution
These protocols may not be used in practice
Fault Tolerant protocolsGlobal Checkpoint 2/2
Coordinated checkpoint All processes coordinate their checkpoints so that
the global system state is coherent (Chandy & Lamport Algorithm)
Negligible overhead on fault free execution Requires global synchronization (may take a long
time to perform checkpoint because of checkpoint server stress)
In the case of a single fault, all processes have to roll back to their checkpoints
High cost of fault recovery
Efficient when fault frequency is low
Fault tolerant protocolsMessage Log 1/2
Pessimistic log All messages recieved by a process are logged on
a reliable media before it can causally influence the rest of the system
Non negligible overhead on network performances in fault free execution
No need to perform global synchronization
Does not stress checkpoint servers No need to roll back non failed processes
Fault recovery overhead is limited
Efficient when fault frequency is high
Fault tolerant protocolsMessage Log 2/2
Causal log Designed to improve fault free performance of pessimistic
log Messages are logged locally and causal dependencies are
piggybacked to messages
Non negligible overhead on fault free execution, slightly better than pessimistic log
No global synchronisation
Does not stress checkpoint server Only failed process are rolled back Failed Processes retrieve their state from dependant
processes or no process depends on it.
Fault recovery overhead is limited but greater than pessimistic log
Comparison: Related works
Egida compared log based techniques Siram Rao, Lorenzo Alvisi, Harrick M. Vim: The cost of recovery in message logging protocols. In 17th symposium on Reliable Distributed Systems (SRDS), pages 10-18, IEEE Press, October 1998
- Causal log is better for single nodes faults- Pessimistic log is better for concurrent faults
No existing comparison of coordinated and message log protocols No existing comparable implementations of coordinated and
message log protocols high fault recovery overhead of coordinated checkpoint high overhead of message logging on fault free performance
Suspected : fault frequency implies tradeoff
Compare coordinated checkpoint and pessimistic logging
Several protocols to perform automatic fault tolerance in MPI applications- Coordinated checkpoint- Causal message log- Pessimistic message logAll of them have been studied theoretically but not compared
Outline
Introduction Coordinated checkpoint vs Message log Related work Comparison framework Performances Conclusions and future works
Architectures
MPICH-CLChandy&Lamport algorithmCoordinated checkpoint
MPICH-V2Pessimistic sender Based message log
We designed MPICH-CL and MPICH-V2 in a shared framework to perform a fair comparison of coordinated checkpoint and pessimistic message log
Communication daemon
MPI process
CL/V2 daemon
Send
Receive
Send
Receive
EventLogger (V2 only)
Receptionevent
keepPayload (V2 only)
CkptServer
CSAC
CheckpointImage
CkptControl
Node
ack
CL and V2 share the same architecture
communication daemon includes protocol specific actions
– A new device: ‘ch_v2’ or ‘ch_cl’ device
– All ch_xx device functions are blocking communication functions built over TCP layer
Generic device: based on MPICH-1.2.5
ADI
ChannelInterface
ChameleonInterface
V2/CL deviceInterface
MPI_Send
MPID_SendControlMPID_SendChannel
_bsend
_bsend
_brecv
_probe
_from
_Init
_Finalize
- get the src of the last message
- check for any message avail.
- blocking send
- blocking receive
- initialize the client
- finalize the client
Binding
Node Checkpointing
User-level Checkpoint : Condor Stand Alone Checkpointing
Clone checkpointing + non blocking checkpoint
code
CSAC
libmpichv
Ckptorder
(1) fork
fork(2) Terminate ongoing coms(3) close sockets(4) call ckpt_and_exit()
Checkpoint image is sent to reliable CS on the fly
Local storage does not ensure fault tolerance
Resume execution using CSAC just after (4), reopen sockets and return
Checkpoint scheduler policy● In MPICH-V2,
checkpoint scheduler is not required by pessimistic protocol. It is used to minimize size of checkpointed payload using a best effort heuristic.
Policy : permanent individual checkpoint
● In MPICH-CL, checkpoint scheduler is used as a dedicaded process to initiate checkpoint.
Policy : checkpoint every n seconds where n is a runtime parameter
Node
Node
Node
Ckptsched
CkptServer
1
2
3Messagepayload
Node
Node
Node
Ckptsched
CkptServer
2
3
Synchro
1
Outline
Introduction Coordinated checkpoint vs Message log Comparison framework Performances Conclusions and future works
Experimental conditionsCluster: 32 1800+ Athlon CPU, 1 GB, IDE Disc
+ 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc+ 48 ports 100Mb/s Ethernet switch
Linux 2.4.18, GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp)
node
Network
node
node
A single reliable node
Checkpoint Server +Event Logger (V2 only)+Checkpoint Scheduler+Dispatcher
Bandwith and latency
Latency for a 0 byte MPI message : MPICH-P4 ( 77us), MPICH-CL (154us), MPICH-V2 (277us)
Latency is high in MPICH-CL due to more memory copies compared to P4Latency is even higher in MPICH-V2 due to the event logging.
A receiving process can send a new message only when the reception event has been successfully logged (3 TCP messages for a communication)
Benchmark applications
Validating our implementations on NAS BT Benchmark class A and B shows comparable performances to P4 reference implementation. As expected MPICH-CL reaches better fault free performances than MPICH-V2
Checkpoint server Performance
Time to checkpoint all processes concurently on a single checkpoint server.
2nd process does not increase checkpoint time, filling unused bandwith.
More processes increase checkpoint time linearly.
Time to checkpoint a process according to its size.
Checkpoint time increases linearly with checkpoint size.
Memory swap overhead appears at 512MB (fork).
Ch
eckp
oin
t ti
me (
secon
ds)
Size of process (MB) Number of processes checkpointing simultaneouslyTim
e t
o c
heckp
oin
t all p
rocesses (
secon
ds)
BT Checkpoint and Restart Performance
● Considerating the same dataset, per process image size decreases when number of processes increases
● As a consequence time to checkpoint remains constant with increasing number of processes
● Performing a complete asynchronous checkpoint takes as much time as coordinated checkpoint
● Time to restart after a fault is decreasing with the number of nodes for V2 and not changing for CL
Fault impact on performances
NAS Benchmark BT B 25 nodes (32MB per process image size)
Average time to perform Checkpoint
MPICH-CL : 68s MPICH-V2 : 73.9s
Average time to recover from failure
MPICH-CL : 65.8s MPICH-V2 : 5.3s
If we consider a 1GB memory occupation for every process, an extrapolation gives a 2000s checkpoint time for 25 nodes in MPICH-CL. The minimum fault interval ensuring progression of computation becomes about 1h
MPICH-V2 can tolerate a high fault rate. MPICH-CL cannot ensure termination of the execution for a high fault rate.
Outline
Introduction Coordinated checkpoint vs Message log Comparison framework Performances Conclusions and future works
Conclusion
MPICH-CL and MPICH-V2 are two comparable implementations of fault tolerant MPI from the MPICH-1.2.5, one using coordinated checkpoint, the other pessimistic message log
We have compared the overhead of these two techniques according to fault frequency
The recovery overhead is the main factor differentiating performances
We have found a crosspoint from which message log becomes better than coordinated checkpoint. On our test application this cross point appears near 1 per 10 minutes. With 1GB application, coordinated checkpoint does not ensure progress of the computation for one fault every hour.
Perspectives
Larger scale experiments Use more nodes and applications with realistic
amount of memory High performance networks experiments
Myrinet – Infiniband Comparison with causal log => MPICH-V2C vs
augmented MPICH-CL MPICH-V2C is a causal log implementation, thus removing
the high latency impact induced by pessimistic log MPICH-CL is being modified to restart non failed nodes from
local checkpoint, removing the high restart overhead
MPICH-V2C● MPICH-V2 suffers from high latency
(pessimistic protocol)● MPICH-V2C corrects this drawback at the
expense of an increase of average message size (causal log protocol)
Log of r1 r2
ACK for r1 r2P can send s1
P could have sent s1 But have to wait ack for message log Of preceeding receptions
r1 r2
s1(delayed)
s1(actual)
P
EL
Log of r1 r2
ACK for r1 r2P can stop piggyback
P sends s1 before acknoledge fromEL. Causality informations areare piggybacked on messages in such a caser1 r2
s1(piggybackOf causalities)
s2(nothing topiggyback)
top related