SC 2002 1November 20, 2002
MPICH-V: Toward a scalable fault tolerant MPI for Volatile nodes
G. Bosilca, A. Bouteiller, F. Cappello,S. Djilali, G. Fedak, C. Germain, Th. Herault, P. Lemarinier,
O. Lodygensky, F. Magniette, V. Neri, A. Selikhov
Cluster & GRID groupLRI, University of Paris [email protected], www.lri.fr/~fci
SC 2002 2November 20, 2002
Outline
• Introduction
• Motivations & Objectives
• Architecture
• Performance
• Future work
• Concluding remarks
SC 2002 3November 20, 2002
• Industry and academia are building larger and larger computing facilities for technical computing (research and production).
Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids (Seti@home, XtremWeb, Entropia, UD, Boinc)
Large Scale Parallel and Distributed systems and node Volatility
• These large scale systems have frequent failures/disconnections:
ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate.
PC Grids nodes are volatile disconnections / interruptions are expected to be very frequent (several/hour)
• Many HPC applications use message passing paradigm
We need a Volatility tolerant Message passing environment
• When failures/disconnections can not be avoided, they become one characteristic of the system called Volatility
Scaling to Thousands of Processors with Buffered CoschedulingWorkshop: Scaling to New Heights, Pittsburgh, May 2002.
SC 2002 4November 20, 2002
Fault tolerant Message passing: a long history of research!
3 main parameters distinguish the proposed FT techniques:
Transparency: application checkpointing, MP API+Fault management, automatic.
application ckpt: application store intermediate results and restart form them
MP API+FM: message passing API returns errors to be handled by the programmer
automatic: runtime detects faults and handle recovery
Checkpoint coordination: no, coordinated, uncoordinated.
coordinated: all processes are synchronized, network is flushed before ckpt;
all processes rollback from the same snapshot
uncoordinated: each process checkpoint independently of the others
each process is restarted independently of the others
Message logging: no, pessimistic, optimistic, causal.
pessimistic: all messages are logged on reliable media and used for replay
optimistic: all messages are logged on non reliable media. If 1 node fails, replay is done according to other nodes logs. If >1 node fail, rollback to last coherent checkpoint causal: optimistic+Antecedence Graph, reduces the recovery time
Related work
SC 2002 5November 20, 2002
Related work
Manethon faults[EZ92]
Egida
[RAV99]
MPI/FTRedundance of tasks
[BNC01]
FT-MPIModification of MPI routines
User Fault Treatment
[FD00]
MPICH-VN faults
Distributed logging
MPI-FTN fault
Centralized server
[LNLE00]
Non AutomaticAutomatic
Pessimistic log
Log basedCheckpointbased
Causal logOptimistic log(sender based)
Level
Framework
API
Communication Lib.
CocheckIndependent of MPI
[Ste96]
StarfishEnrichment of MPI
[AF99]
ClipSemi-transparent checkpoint
[CLP97]
Pruitt 982 faults sender based
[PRU98]
Sender based Mess. Log.1 fault sender based
[JZ87]
Optimistic recoveryIn distributed systems
n faults with coherent checkpoint[SY85]
A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques.
No automatic/transparent, n fault tolerant, scalable message passing env.
Causal logging +Coordinated checkpoint
SC 2002 6November 20, 2002
Outline
• Introduction
• Motivations & Objectives
• Architecture
• Performance
• Future work
• Concluding remarks
SC 2002 7November 20, 2002
Goal: execute existing or new MPI Apps
PC client MPI_send() PC client MPI_recv()
Programmer’s view unchanged:
Objective summary: 1) Automatic fault tolerance2) Transparency for the programmer & user3) Tolerate n faults (n being the #MPI processes)4) Firewall bypass (tunnel) for cross domain execution5) Scalable Infrastructure/protocols6) Avoid global synchronizations (ckpt/restart)7) Theoretical verification of protocols
Objectives and constraints
Problems: 1) volatile nodes (any number at any time) 2) firewalls (PC Grids) 3) non named receptions ( should be replayed in the
same order as the one of the previous failed exec.)
SC 2002 8November 20, 2002
The objective is to checkpoint the application when there is no in transit messages between any two nodes global synchronization network flush not scalable Nodes
Ckpt
failure
detection/global stop
restart
Nodes
Ckpt
failure
detectionrestart
Coordinated Checkpoint(Chandy/Lamport)
Uncoordinated Checkpoint
Uncoordinated checkpoint
No global synchronization (scalable)Nodes may checkpoint at any time (independently of the others)Need to log undeterministic events: In-transit Messages
Sync
SC 2002 9November 20, 2002
A set of reliable nodes called “Channel Memories” logs every message.
All communications are Implemented by 1 PUT and 1 GET operation to the CM
PUT and GET operations aretransactions
When a process restarts, itreplays all communicationsusing the Channel Memory
CM stores and delivers messages inFIFO order for ensuring a consistent state for each receiver
node
node
Get
Network
Channel Memory
Put
node
Get
Pessimistic message logging on Channel Memories
CM also works as a tunnel for firewall protected nodes (PC-Grids)
node
node
Get
Network
Channel Memory (stable-tunnel)
Put
Distributed pessimistic remote logging
Firewall
SC 2002 10November 20, 2002
Putting all together: Sketch of execution with a crash
0
1
2 2
1
CM
CM
CS 1 2
2
Crash
Rollback to latestprocess checkpoint
Worst condition: in-transit message + checkpointPseudo time scaleProcesses
Ckpt image
Ckpt image
Ckpt images
SC 2002 11November 20, 2002
Outline
• Introduction
• Motivations & Objectives
• Architecture
• Performance
• Future work
• Concluding remarks
SC 2002 12November 20, 2002
MPICH-V :–Communications Library: a MPICH device with Channel Memory
–Run-time: execute/manage instances of MPI processes on nodes
requires only to re-link the application with libmpichv instead of libmpich
Node
Network
Node
Dispatcher
Node
Channel MemoryCheckpoint
server
Firewall
Global architecture
Firewall
1
2 3
4
5
SC 2002 13November 20, 2002
Dispatcher (stable)-- Initializes the execution: distributes roles (CM, CS and Nodes) to participant nodes (launches the appropriate job), checks readiness
-- Launches the instances of MPI processes on Nodes
-- Monitors the Node state (alive signal, or time-out)
-- Reschedules tasks on available nodes for dead MPI process instances
Dispatcher
Channel Memories Checkpoint servers
Nodes
Role distribution
Faillure
Alive signal MPI proc. instance
New MPI proc. instance
SC 2002 14November 20, 2002
Channel Memory (stable)
Multithread server
Out-of-core message storage+ garbage collection
Incoming Message (Put transaction + control)
Open Sockets:-one per attached Node-one per home checkpoint server of attached node-one for the dispatcher
Poll, treat event and releaseother threads
Outgoing Message(Get transaction + control)
Message queues
Memory
Disc
FIFOFor ensuring total order onreceiver messages
Removes messages older than thecurrent checkpoint image for each node
SC 2002 15November 20, 2002
Nodes
Channel Memories
Our solution:Each Node is “attached” to only one “home” CM • A node Receives messages from its home CM• A node Sends massages to the home CM of the destination node
Mapping Channel Memories with nodes
0 1 2 N
CM CM CM
…
Several CM coordination constraints:1) Force a total order on the messages for each receiver.2) Avoid coordination messages among CMs
Home For node 2
SC 2002 16November 20, 2002
Checkpoint Server (stable)
Multiprocess server
Checkpoint images are stored on reliable media:1 file per Node (name given By Node)
Incoming Message(Put ckpt transaction)
Open Sockets:-one per attached Node-one per home CM of attached Nodes
Poll, treat event and dispatch jobto other processes
Outgoing Message(Get ckpt transaction + control)
Checkpoint images
Disc
SC 2002 17November 20, 2002
Node (Volatile) :Checkpointing
User-level Checkpoint : Condor Stand Alone Checkpointing
Clone checkpointing + non blocking checkpoint
code
CSAC
libmpichv
Ckptorder
(1) fork
fork(2) Terminate ongoing coms(3) close sockets(4) call ckpt_and_exit()
Checkpoint image is sent to CS on the fly (not stored locally) Checkpoint order is triggered locally (not by a dispatcher signal)
Resume execution using CSAC just after (4), reopen sockets and return
SC 2002 18November 20, 2002
ADI
ChannelInterface
ChameleonInterface
CM deviceInterface
MPI_Send
MPID_SendControlMPID_SendChannel
_cmbsend
_cmbsend
_cmbrecv
_cmprobe
_cmfrom
_cmInit
_cmFinalize
- get the src of the last message
- check for any message avail.
- blocking send
- blocking receive
- initialize the client
- finalize the client
– A new device: ‘ch_cm’ device
– All ch_cm device functions are blocking communication functions built over TCP layer
Library: based on MPICH
Binding
SC 2002 19November 20, 2002
Outline
• Introduction
• Motivations & Objectives
• Architecture
• Performance
• Future work
• Concluding remarks
SC 2002 20November 20, 2002
~4,8 Gb/s~4,8 Gb/s
~1 Gb/s
Experimental platform
Icluster-Imag, 216 PIII 733 Mhz, 256MB/node5 subsystems with 32 to 48 nodes, 100BaseT switch 1Gb/s switch mesh between subsystemsLinux, PGI Fortran or GCC compiler Very close to a typical Building LANSimulate node Volatility
XtremWeb as software environment (launching MPICH-V)NAS BT benchmark complex application (high comm/comp)
SC 2002 21November 20, 2002
RTT Ping-Pong : 2 nodes, 2 Channel Memories,blocking coms.
Performance degradation of a factor 2 (compared to P4) but MPICH-V tolerates arbitrary number of faults
Reasonable since every message crosses the networktwice (store and forward through CM).
Basic performance
X ~2
10.5 MB/s
5.6 MB/s
Message size
Time, sec Mean over 100 measurements
0
0.05
0.1
0.15
0.2
0 64kB 128kB 192kB 256kB 320kB 384kB
P4ch_cm 1 CM out-of-corech_cm 1 CM in-corech_cm 1 CM out-of-core best
SC 2002 22November 20, 2002
CM
Individual communication time according to the number of nodes attached to 1CM (simultaneous communications)
Impact of sharing a Channel Memory
Asynchronous token ring (#tokens= # nodes)Mean over 100 executions
Tokens are rotating simultaneously around the ring:there are always #nodes communications at the same time
CM response time (as seen by a node) increases linearly with the number of nodes.
Standard deviation < 3% across nodes fair distribution of the CM resource
Token size
Time, sec
0 64kB 128kB 192kB 256kB 320kB 384kB0
0.1
0.2
0.3
0.4
0.5
12 nodes
8 nodes
4 nodes
2 nodes
1 node
SC 2002 23November 20, 2002
CM
Token size (Bytes)
Time, sec
Individual communication time according to the number of nodes attached to 1CM and the number of threads in the CM
Impact the number of threads in Channel Memory
Increasing the number of threads reduces theCM response time whatever number of nodes areusing the same CM.
Asynchronous token ring (# tokens = # nodes)Mean over 100 executions
SC 2002 24November 20, 2002
Time between reception of a checkpoint signal and actual restart: fork, ckpt, compress, transfer to CS, way back, decompress, restart
Impact of remote checkpoint on node performance
Cost of remote checkpoint is close to the one of local checkpoint (can be as low as 2%)……because compression and transfer are overlapped
RTT Time, sec
+25%
+2%
+14%
+28%50 44
7862
214208
1.81.40
50
100
150
200
250
bt.w.4 (2MB) bt.A.4 (43MB) bt.B.4 (21MB) bt.A.1 (201MB)
Dist. Ethernet 100BaseTLocal (disc)
SC 2002 25November 20, 2002
RTT experienced by every node for simultaneous ckpt,(ckpt signals are sync.) according to #checkpointing nodes
Stressing the checkpoint server:Ckpt RTT for simultaneous ckpts.
Number of simultaneous checkpoint on a single CS (BT.A.1)
RTT Time, sec
RTT increases almost linearly according to thenumber of nodes, after network saturation is reached (from 1 to 2)
7655321200
250
300
350
400
450
500
SC 2002 26November 20, 2002
Performance reduction for NAS BT.A.4 according to the number of consecutive checkpoints
Impact of checkpointing on application performance
Number of checkpoints during BT.A.4
100
90
80
70
60
500 1 2 3 4R
ela
tive p
erf
orm
ance
(%
)
A single checkpoint server for 4 MPI tasks (P4 driver)Ckpt is performed at random time on each node (no sync.)
Dual processorUni processor
0 1 2 3 4
97
96
95
94
93
98
99
100
Non blocking Blocking
When 4 checkpoints are performed per process performance is about 94% the one of a non checkpointed execution.Several nodes can use the same CS
SC 2002 27November 20, 2002
Crash
Time for the re-execution of a token ring on 8 nodesAccording to the token size and number of re-started nodes
Performance of re-execution
The system can survive the crash of all MPI Processes
re-execution is faster because messages are available in the CM (stored by the previous execution)
Time, sec
Re-executionis faster thanexecution:Messages arealready storedin CM
token size
0 restart8 restarts
0.1
0.2
0.3
64kB 128kB 192kB 256kB
0 restart1 restart2 restarts3 restarts4 restarts5 restarts6 restarts7 restarts8 restarts
00
SC 2002 28November 20, 2002
Global operation performance
0,3
0,7
2,1
1 x3
MPI all-to-all for 9 nodes (1CM)
SC 2002 29November 20, 2002
Performance of MPI-PovRay
Putting all together: Performance scalability
• Parallelized version of the PovRay raytracer application• 1 CM for 8 MPI processes• Render a complex 450x350 scene • Comm/comp ratio is about 10% for 16 MPI processes
MPICH-V provides similar performance compared to P4+ fault-tolerance (at the cost of 1 CM every 8 nodes)
#nodes 16 32 64 128
MPICH-P4 744 sec. 372 sec. 191 sec. 105 sec.
MPICH-V 757 sec. 382 sec. 190 sec. 107 sec.
P4/V Ratio .98 .97 .99 .98
Speedup 1 1.98 3.98 7.07
Execution time
SC 2002 30November 20, 2002
Performance of BT.A.9 with frequent faults
Putting all together: Performance with volatile nodes
• 3 CM, 2 CS (4 nodes on 1 CS, 5 on the other)• 1 checkpoint every 130 seconds on each node (non sync.)
Overhead of ckpt is about 23% For 10 faults performance is 68% of the one without fault
MPICH-V allows application to survive node volatility (1 F/2 min.) Performance degradation with frequent faults stays reasonable
Number of faults during execution
Total executiontime (sec.)
Base exec.without ckpt.and fault
0 1 2 3 4 5 6 7 8 9 10610650700750800850900950
100010501100
~1 fault/110 sec.
SC 2002 31November 20, 2002
Putting all together: MPICH-V vs. MPICH-P4 on NAS BT
• 1 CM per MPI process, 1 CS for 4 MPI processes• 1 checkpoint every 120 seconds on each node (Whole)
MPICH-V (CM but no logs)
MPICH-V (CM with logs)
MPICH-V (CM+CS+ckpt)
MPICH-P4
MPICH-V Compares favorably to MPICH-P4 for all configurations on this platform forBT class A
The differences for the communication times is due to theway asynchronous coms. are handled by each environment.
SC 2002 32November 20, 2002
Outline
• Introduction
• Motivations & Objectives
• Architecture
• Performance
• Future work
• Concluding remarks
SC 2002 33November 20, 2002
Future Work
node
Network
node
Dispatcher
node
Channel MemoriesCheckpoint
servers
Firewall
Redundancy
Channel Memories reduce the communication performance: change packet transit from Store and Forward to Wormhole remove CMs (cluster), message logging on node,
communication causality vector stored separately on CSsRemove the need of stable resources: add redundancy
Redundancy
SC 2002 34November 20, 2002
Outline
• Introduction
• Motivations & Objectives
• Architecture
• Performance
• Future work
• Concluding remarks
SC 2002 35November 20, 2002
Concluding remarks
MPICH-V:
• full fledge fault tolerant MPI environment (lib + runtime).• uncoordinated checkpoint + distributed pessimistic message logging.• Channel Memories, Checkpoint Servers, Dispatcher and nodes.
Main results:
• Raw communication Performance (RTT) is about ½ of MPICH-P4.• Scalability is as good as the one of P4 (128 nodes) for MPI-Pov.• MPICH-V allows application to survive node volatility (1 F/ 2min).• When frequent faults occur, performance degradation is reasonable. • NAS BT performance comparable to MPICH-P4 (up to 25 nodes).
www.lri.fr/~fci/Group