mpich- v : toward a scalable fault tolerant mpi for v olatile nodes

SC 2002 1November 20, 2002

MPICH-V: Toward a scalable fault tolerant MPI for Volatile nodes

G. Bosilca, A. Bouteiller, F. Cappello,S. Djilali, G. Fedak, C. Germain, Th. Herault, P. Lemarinier,

O. Lodygensky, F. Magniette, V. Neri, A. Selikhov

Cluster & GRID groupLRI, University of Paris [email protected], www.lri.fr/~fci

SC 2002 2November 20, 2002

Outline

• Introduction

• Motivations & Objectives

• Architecture

• Performance

• Future work

• Concluding remarks

SC 2002 3November 20, 2002

• Industry and academia are building larger and larger computing facilities for technical computing (research and production).

Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids (Seti@home, XtremWeb, Entropia, UD, Boinc)

Large Scale Parallel and Distributed systems and node Volatility

• These large scale systems have frequent failures/disconnections:

ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate.

PC Grids nodes are volatile disconnections / interruptions are expected to be very frequent (several/hour)

• Many HPC applications use message passing paradigm

We need a Volatility tolerant Message passing environment

• When failures/disconnections can not be avoided, they become one characteristic of the system called Volatility

Scaling to Thousands of Processors with Buffered CoschedulingWorkshop: Scaling to New Heights, Pittsburgh, May 2002.

SC 2002 4November 20, 2002

Fault tolerant Message passing: a long history of research!

3 main parameters distinguish the proposed FT techniques:

Transparency: application checkpointing, MP API+Fault management, automatic.

application ckpt: application store intermediate results and restart form them

MP API+FM: message passing API returns errors to be handled by the programmer

automatic: runtime detects faults and handle recovery

Checkpoint coordination: no, coordinated, uncoordinated.

coordinated: all processes are synchronized, network is flushed before ckpt;

all processes rollback from the same snapshot

uncoordinated: each process checkpoint independently of the others

each process is restarted independently of the others

Message logging: no, pessimistic, optimistic, causal.

pessimistic: all messages are logged on reliable media and used for replay

optimistic: all messages are logged on non reliable media. If 1 node fails, replay is done according to other nodes logs. If >1 node fail, rollback to last coherent checkpoint causal: optimistic+Antecedence Graph, reduces the recovery time

Related work

SC 2002 5November 20, 2002

Related work

Manethon faults[EZ92]

Egida

[RAV99]

MPI/FTRedundance of tasks

[BNC01]

FT-MPIModification of MPI routines

User Fault Treatment

[FD00]

MPICH-VN faults

Distributed logging

MPI-FTN fault

Centralized server

[LNLE00]

Non AutomaticAutomatic

Pessimistic log

Log basedCheckpointbased

Causal logOptimistic log(sender based)

Level

Framework

API

Communication Lib.

CocheckIndependent of MPI

[Ste96]

StarfishEnrichment of MPI

[AF99]

ClipSemi-transparent checkpoint

[CLP97]

Pruitt 982 faults sender based

[PRU98]

Sender based Mess. Log.1 fault sender based

[JZ87]

Optimistic recoveryIn distributed systems

n faults with coherent checkpoint[SY85]

A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques.

No automatic/transparent, n fault tolerant, scalable message passing env.

Causal logging +Coordinated checkpoint

SC 2002 6November 20, 2002

Outline

• Introduction


• Architecture

• Performance

• Future work


SC 2002 7November 20, 2002

Goal: execute existing or new MPI Apps

PC client MPI_send() PC client MPI_recv()

Programmer’s view unchanged:

Objective summary: 1) Automatic fault tolerance2) Transparency for the programmer & user3) Tolerate n faults (n being the #MPI processes)4) Firewall bypass (tunnel) for cross domain execution5) Scalable Infrastructure/protocols6) Avoid global synchronizations (ckpt/restart)7) Theoretical verification of protocols

Objectives and constraints

Problems: 1) volatile nodes (any number at any time) 2) firewalls (PC Grids) 3) non named receptions ( should be replayed in the

same order as the one of the previous failed exec.)

SC 2002 8November 20, 2002

The objective is to checkpoint the application when there is no in transit messages between any two nodes global synchronization network flush not scalable Nodes

Ckpt

failure

detection/global stop

restart

Nodes

Ckpt

failure

detectionrestart

Coordinated Checkpoint(Chandy/Lamport)

Uncoordinated Checkpoint

Uncoordinated checkpoint

No global synchronization (scalable)Nodes may checkpoint at any time (independently of the others)Need to log undeterministic events: In-transit Messages

Sync

SC 2002 9November 20, 2002

A set of reliable nodes called “Channel Memories” logs every message.

All communications are Implemented by 1 PUT and 1 GET operation to the CM

PUT and GET operations aretransactions

When a process restarts, itreplays all communicationsusing the Channel Memory

CM stores and delivers messages inFIFO order for ensuring a consistent state for each receiver

node

node

Get

Network

Channel Memory

Put

node

Get

Pessimistic message logging on Channel Memories

CM also works as a tunnel for firewall protected nodes (PC-Grids)

node

node

Get

Network

Channel Memory (stable-tunnel)

Put

Distributed pessimistic remote logging

Firewall

SC 2002 10November 20, 2002

Putting all together: Sketch of execution with a crash

0

1

2 2

1

CM

CM

CS 1 2

2

Crash

Rollback to latestprocess checkpoint

Worst condition: in-transit message + checkpointPseudo time scaleProcesses

Ckpt image

Ckpt image

Ckpt images

SC 2002 11November 20, 2002

Outline

• Introduction


• Architecture

• Performance

• Future work


SC 2002 12November 20, 2002

MPICH-V :–Communications Library: a MPICH device with Channel Memory

–Run-time: execute/manage instances of MPI processes on nodes

requires only to re-link the application with libmpichv instead of libmpich

Node

Network

Node

Dispatcher

Node

Channel MemoryCheckpoint

server

Firewall

Global architecture

Firewall

1

2 3

4

5

SC 2002 13November 20, 2002

Dispatcher (stable)-- Initializes the execution: distributes roles (CM, CS and Nodes) to participant nodes (launches the appropriate job), checks readiness

-- Launches the instances of MPI processes on Nodes

-- Monitors the Node state (alive signal, or time-out)

-- Reschedules tasks on available nodes for dead MPI process instances

Dispatcher

Channel Memories Checkpoint servers

Nodes

Role distribution

Faillure

Alive signal MPI proc. instance

New MPI proc. instance

SC 2002 14November 20, 2002

Channel Memory (stable)

Multithread server

Out-of-core message storage+ garbage collection

Incoming Message (Put transaction + control)

Open Sockets:-one per attached Node-one per home checkpoint server of attached node-one for the dispatcher

Poll, treat event and releaseother threads

Outgoing Message(Get transaction + control)

Message queues

Memory

Disc

FIFOFor ensuring total order onreceiver messages

Removes messages older than thecurrent checkpoint image for each node

SC 2002 15November 20, 2002

Nodes

Channel Memories

Our solution:Each Node is “attached” to only one “home” CM • A node Receives messages from its home CM• A node Sends massages to the home CM of the destination node

Mapping Channel Memories with nodes

0 1 2 N

CM CM CM

…

Several CM coordination constraints:1) Force a total order on the messages for each receiver.2) Avoid coordination messages among CMs

Home For node 2

SC 2002 16November 20, 2002

Checkpoint Server (stable)

Multiprocess server

Checkpoint images are stored on reliable media:1 file per Node (name given By Node)

Incoming Message(Put ckpt transaction)

Open Sockets:-one per attached Node-one per home CM of attached Nodes

Poll, treat event and dispatch jobto other processes

Outgoing Message(Get ckpt transaction + control)

Checkpoint images

Disc

SC 2002 17November 20, 2002

Node (Volatile) :Checkpointing

User-level Checkpoint : Condor Stand Alone Checkpointing

Clone checkpointing + non blocking checkpoint

code

CSAC

libmpichv

Ckptorder

(1) fork

fork(2) Terminate ongoing coms(3) close sockets(4) call ckpt_and_exit()

Checkpoint image is sent to CS on the fly (not stored locally) Checkpoint order is triggered locally (not by a dispatcher signal)

Resume execution using CSAC just after (4), reopen sockets and return

SC 2002 18November 20, 2002

ADI

ChannelInterface

ChameleonInterface

CM deviceInterface

MPI_Send

MPID_SendControlMPID_SendChannel

_cmbsend

_cmbsend

_cmbrecv

_cmprobe

_cmfrom

_cmInit

_cmFinalize

- get the src of the last message

- check for any message avail.

- blocking send

- blocking receive

- initialize the client

- finalize the client

– A new device: ‘ch_cm’ device

– All ch_cm device functions are blocking communication functions built over TCP layer

Library: based on MPICH

Binding

SC 2002 19November 20, 2002

Outline

• Introduction


• Architecture

• Performance

• Future work


SC 2002 20November 20, 2002

~4,8 Gb/s~4,8 Gb/s

~1 Gb/s

Experimental platform

Icluster-Imag, 216 PIII 733 Mhz, 256MB/node5 subsystems with 32 to 48 nodes, 100BaseT switch 1Gb/s switch mesh between subsystemsLinux, PGI Fortran or GCC compiler Very close to a typical Building LANSimulate node Volatility

XtremWeb as software environment (launching MPICH-V)NAS BT benchmark complex application (high comm/comp)

SC 2002 21November 20, 2002

RTT Ping-Pong : 2 nodes, 2 Channel Memories,blocking coms.

Performance degradation of a factor 2 (compared to P4) but MPICH-V tolerates arbitrary number of faults

Reasonable since every message crosses the networktwice (store and forward through CM).

Basic performance

X ~2

10.5 MB/s

5.6 MB/s

Message size

Time, sec Mean over 100 measurements

0

0.05

0.1

0.15

0.2

0 64kB 128kB 192kB 256kB 320kB 384kB

P4ch_cm 1 CM out-of-corech_cm 1 CM in-corech_cm 1 CM out-of-core best

SC 2002 22November 20, 2002

CM

Individual communication time according to the number of nodes attached to 1CM (simultaneous communications)

Impact of sharing a Channel Memory

Asynchronous token ring (#tokens= # nodes)Mean over 100 executions

Tokens are rotating simultaneously around the ring:there are always #nodes communications at the same time

CM response time (as seen by a node) increases linearly with the number of nodes.

Standard deviation < 3% across nodes fair distribution of the CM resource

Token size

Time, sec

0 64kB 128kB 192kB 256kB 320kB 384kB0

0.1

0.2

0.3

0.4

0.5

12 nodes

8 nodes

4 nodes

2 nodes

1 node

SC 2002 23November 20, 2002

CM

Token size (Bytes)

Time, sec

Individual communication time according to the number of nodes attached to 1CM and the number of threads in the CM

Impact the number of threads in Channel Memory

Increasing the number of threads reduces theCM response time whatever number of nodes areusing the same CM.

Asynchronous token ring (# tokens = # nodes)Mean over 100 executions

SC 2002 24November 20, 2002

Time between reception of a checkpoint signal and actual restart: fork, ckpt, compress, transfer to CS, way back, decompress, restart

Impact of remote checkpoint on node performance

Cost of remote checkpoint is close to the one of local checkpoint (can be as low as 2%)……because compression and transfer are overlapped

RTT Time, sec

+25%

+2%

+14%

+28%50 44

7862

214208

1.81.40

50

100

150

200

250

bt.w.4 (2MB) bt.A.4 (43MB) bt.B.4 (21MB) bt.A.1 (201MB)

Dist. Ethernet 100BaseTLocal (disc)

SC 2002 25November 20, 2002

RTT experienced by every node for simultaneous ckpt,(ckpt signals are sync.) according to #checkpointing nodes

Stressing the checkpoint server:Ckpt RTT for simultaneous ckpts.

Number of simultaneous checkpoint on a single CS (BT.A.1)

RTT Time, sec

RTT increases almost linearly according to thenumber of nodes, after network saturation is reached (from 1 to 2)

7655321200

250

300

350

400

450

500

SC 2002 26November 20, 2002

Performance reduction for NAS BT.A.4 according to the number of consecutive checkpoints

Impact of checkpointing on application performance

Number of checkpoints during BT.A.4

100

90

80

70

60

500 1 2 3 4R

ela

tive p

erf

orm

ance

(%

)

A single checkpoint server for 4 MPI tasks (P4 driver)Ckpt is performed at random time on each node (no sync.)

Dual processorUni processor

0 1 2 3 4

97

96

95

94

93

98

99

100

Non blocking Blocking

When 4 checkpoints are performed per process performance is about 94% the one of a non checkpointed execution.Several nodes can use the same CS

SC 2002 27November 20, 2002

Crash

Time for the re-execution of a token ring on 8 nodesAccording to the token size and number of re-started nodes

Performance of re-execution

The system can survive the crash of all MPI Processes

re-execution is faster because messages are available in the CM (stored by the previous execution)

Time, sec

Re-executionis faster thanexecution:Messages arealready storedin CM

token size

0 restart8 restarts

0.1

0.2

0.3

64kB 128kB 192kB 256kB

0 restart1 restart2 restarts3 restarts4 restarts5 restarts6 restarts7 restarts8 restarts

00

SC 2002 28November 20, 2002

Global operation performance

0,3

0,7

2,1

1 x3

MPI all-to-all for 9 nodes (1CM)

SC 2002 29November 20, 2002

Performance of MPI-PovRay

Putting all together: Performance scalability

• Parallelized version of the PovRay raytracer application• 1 CM for 8 MPI processes• Render a complex 450x350 scene • Comm/comp ratio is about 10% for 16 MPI processes

MPICH-V provides similar performance compared to P4+ fault-tolerance (at the cost of 1 CM every 8 nodes)

#nodes 16 32 64 128

MPICH-P4 744 sec. 372 sec. 191 sec. 105 sec.

MPICH-V 757 sec. 382 sec. 190 sec. 107 sec.

P4/V Ratio .98 .97 .99 .98

Speedup 1 1.98 3.98 7.07

Execution time

SC 2002 30November 20, 2002

Performance of BT.A.9 with frequent faults

Putting all together: Performance with volatile nodes

• 3 CM, 2 CS (4 nodes on 1 CS, 5 on the other)• 1 checkpoint every 130 seconds on each node (non sync.)

Overhead of ckpt is about 23% For 10 faults performance is 68% of the one without fault

MPICH-V allows application to survive node volatility (1 F/2 min.) Performance degradation with frequent faults stays reasonable

Number of faults during execution

Total executiontime (sec.)

Base exec.without ckpt.and fault

0 1 2 3 4 5 6 7 8 9 10610650700750800850900950

100010501100

~1 fault/110 sec.

SC 2002 31November 20, 2002

Putting all together: MPICH-V vs. MPICH-P4 on NAS BT

• 1 CM per MPI process, 1 CS for 4 MPI processes• 1 checkpoint every 120 seconds on each node (Whole)

MPICH-V (CM but no logs)

MPICH-V (CM with logs)

MPICH-V (CM+CS+ckpt)

MPICH-P4

MPICH-V Compares favorably to MPICH-P4 for all configurations on this platform forBT class A

The differences for the communication times is due to theway asynchronous coms. are handled by each environment.

SC 2002 32November 20, 2002

Outline

• Introduction


• Architecture

• Performance

• Future work


SC 2002 33November 20, 2002

Future Work

node

Network

node

Dispatcher

node

Channel MemoriesCheckpoint

servers

Firewall

Redundancy

Channel Memories reduce the communication performance: change packet transit from Store and Forward to Wormhole remove CMs (cluster), message logging on node,

communication causality vector stored separately on CSsRemove the need of stable resources: add redundancy

Redundancy

SC 2002 34November 20, 2002

Outline

• Introduction


• Architecture

• Performance

• Future work


SC 2002 35November 20, 2002

Concluding remarks

MPICH-V:

• full fledge fault tolerant MPI environment (lib + runtime).• uncoordinated checkpoint + distributed pessimistic message logging.• Channel Memories, Checkpoint Servers, Dispatcher and nodes.

Main results:

• Raw communication Performance (RTT) is about ½ of MPICH-P4.• Scalability is as good as the one of P4 (128 nodes) for MPI-Pov.• MPICH-V allows application to survive node volatility (1 F/ 2min).• When frequent faults occur, performance degradation is reasonable. • NAS BT performance comparable to MPICH-P4 (up to 25 nodes).

www.lri.fr/~fci/Group

mpich- v : toward a scalable fault tolerant mpi for v olatile nodes

Documents

fault tolerant message

volatility tolerant

scalable fault tolerant

mp api fault management

volatile nodes

pc grids nodes

large scale systems

application ckpt