fault tolerant mpimeseec.ce.rit.edu/756-projects/spring2006/d3/2/fault... · 2006. 5. 23. · fault...

Post on 25-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fault Tolerant MPIFault Tolerant MPIProtocols and ImplementationsProtocols and Implementations

Brian J. ArgauerBrian J. ArgauerStephen R. ByersStephen R. Byers

May 23, 2006May 23, 2006

Multiple Processor Systems EECC 756Multiple Processor Systems EECC 756Dr. Muhammad ShaabanDr. Muhammad Shaaban

OutlineOutline

Motivation for Fault ToleranceMotivation for Fault ToleranceTechniquesTechniques

CheckpointCheckpointMessage LoggingMessage Logging

ImplementationsImplementationsMPICHMPICH--V1V1MPICHMPICH--V2V2MPICHMPICH--VCLVCL

CoCheck FrameworkCoCheck FrameworkConclusionConclusion

Fault Tolerant MotivationFault Tolerant Motivation

Current trend toward Current trend toward larger clusters, larger clusters, distributed and distributed and GRID computingGRID computingSource of FailureSource of Failure

NodesNodesNetworkNetworkHuman FactorsHuman Factors

Thousands of nodes Thousands of nodes reduces MTBF to reduces MTBF to hours or minuteshours or minutes

Techniques: CheckpointTechniques: CheckpointCapture entire state of taskCapture entire state of task

Application, Stack, Allocated memory, etc.Application, Stack, Allocated memory, etc.Program FailureProgram Failure

Kill survivorsKill survivorsRestart from last consistent and complete set of checkpointsRestart from last consistent and complete set of checkpoints

CoordinatedCoordinatedExpensiveExpensiveAll tasks stop message passingAll tasks stop message passingWrite to disks simultaneouslyWrite to disks simultaneouslyContinue Message PassingContinue Message Passing

UncoordinatedUncoordinatedNodes Checkpoint at different timesNodes Checkpoint at different timesInIn--flight messages retained via explicit loggingflight messages retained via explicit logging

Techniques: Message LoggingTechniques: Message Logging

Pessimistic LogPessimistic LogTransaction loggingTransaction loggingNo incoherent states can be reachedNo incoherent states can be reachedCan handle an unbounded number of faultsCan handle an unbounded number of faults

Optimistic LogOptimistic LogLog messagesLog messagesAssume part of log lost when faults occurAssume part of log lost when faults occurEither rollback entire application if too many faults or Either rollback entire application if too many faults or assume only 1 fault at a time can occur in systemassume only 1 fault at a time can occur in system

Causal LogCausal Log

Technique DecisionTechnique Decision

Checkpointing is efficient with low fault frequency!Checkpointing is efficient with low fault frequency!Message logging is efficient higher fault frequency!Message logging is efficient higher fault frequency!

MPICHMPICH--V IntroductionV Introduction

Traditional MPITraditional MPIStatic ResourcesStatic ResourcesLimited Error HandingLimited Error HandingNode failure stalls or slows down other nodesNode failure stalls or slows down other nodes

MPICHMPICH--VVResearch effort to provide MPI implementation Research effort to provide MPI implementation based on MPICHbased on MPICHAutomatic fault tolerant MPI libraryAutomatic fault tolerant MPI libraryImplementations: MPICHImplementations: MPICH--V1, MPICHV1, MPICH--V2, V2, MPICHMPICH--VCausal, MPICHVCausal, MPICH--VCLVCL

Fault Tolerant OverviewFault Tolerant Overview

MPICH-VCLMPICH-V1/V2

FT-MPI

Automatic Non-Automatic

Co-Check

MPICH-V/Causal

MPICHMPICH--V1 GoalsV1 Goals

Volatility ToleranceVolatility ToleranceRedundancyRedundancyTask MigrationTask Migration

Highly DistributedHighly DistributedScalable Scalable Asynchronous Asynchronous CheckpointingCheckpointingNo Global No Global SynchronizationSynchronization

InterInter--administration administration domain domain communicationscommunications

Security Tools for Security Tools for GRID DeploymentGRID DeploymentUse nonUse non--protected protected relay between client relay between client and server nodes if and server nodes if client and server client and server both fire walledboth fire walled

MPICHMPICH--V1 OverviewV1 Overview

Designed from standard MPI Designed from standard MPI ImplementationImplementationRun existing MPI applications without Run existing MPI applications without modificationmodificationSuitable for very large scale Suitable for very large scale computing using heterogeneous computing using heterogeneous networksnetworksUncoordinated CheckpointUncoordinated CheckpointRemote Pessimistic Message LoggingRemote Pessimistic Message Logging

MPICHMPICH--V1 ArchitectureV1 Architecture

Checkpoint ServerCheckpoint ServerStore and provide Store and provide task imagestask imagesImages sent to CS Images sent to CS as generated by as generated by nodesnodesImage clone of Image clone of running process on running process on given nodegiven node

MPICHMPICH--V1 ArchitectureV1 Architecture

Channel MemoryChannel MemoryStorage of inStorage of in--transit transit messagesmessagesRepository servicesRepository services

DispatcherDispatcherResource SchedulingResource SchedulingTask ManagementTask Management

MPICHMPICH--V1 PerformanceV1 Performance

Fault Tolerant OverviewFault Tolerant Overview

MPICH-VCLMPICH-V1/V2

FT-MPI

Automatic Non-Automatic

Co-Check

MPICH-V/Causal

MPICHMPICH--V2V2Pessimistic Logging for large clustersPessimistic Logging for large clustersUncoordinated CheckpointUncoordinated CheckpointNodes store messages they send locallyNodes store messages they send locallyEvent Loggers store sequence of received messages for each nodeEvent Loggers store sequence of received messages for each node

Fault Tolerant OverviewFault Tolerant Overview

MPICH-VCLMPICH-V1/V2

FT-MPI

Automatic Non-Automatic

Co-Check

MPICH-V/Causal

MPICHMPICH--VCLVCLNewest MPICHNewest MPICH--VVDesigned for extra low latency dependent applicationsDesigned for extra low latency dependent applicationsCoordinated CheckpointCoordinated Checkpoint

MPICHMPICH--VCL PerformanceVCL Performance

NAS Benchmark BT NAS Benchmark BT Class B, 25 Nodes, Class B, 25 Nodes, Fast EthernetFast EthernetPerformance Performance crossover point crossover point between checkpoint between checkpoint and message and message logging: 1 fault every logging: 1 fault every 3 minutes3 minutes

Fault Tolerant OverviewFault Tolerant Overview

MPICH-VCLMPICH-V1/V2

FT-MPI

Automatic Non-Automatic

Co-Check

MPICH-V/Causal

CoCheck FrameworkCoCheck Framework

Abstraction FrameworkAbstraction FrameworkAbove message passing layerAbove message passing layerEasily adaptable and portable to different Easily adaptable and portable to different MPI implementations through the use of MPI implementations through the use of wrapper functionswrapper functionsProvides consistencyProvides consistencyConsiders checkpointing & process Considers checkpointing & process migrationmigrationtuMPItuMPI

State Consistency ProblemState Consistency Problem

Processes A,B,CProcesses A,B,CCircles = Events, Arrows = Message SendingCircles = Events, Arrows = Message SendingS, S’, S’’ = Checkpoint SnapshotsS, S’, S’’ = Checkpoint SnapshotsNotice S’’ is inconsistentNotice S’’ is inconsistent

Clearing Communication LinesClearing Communication Lines

Uses Coordinator ProcessUses Coordinator ProcessSends “ReadySends “Ready--Message” Message” (RM) when checkpoint or (RM) when checkpoint or migration is needed. migration is needed. If process receives RM, If process receives RM, assumes no more assumes no more communicationcommunicationOnce all RMs are Once all RMs are received... can checkpoint received... can checkpoint or migrateor migrateOn restart… check for On restart… check for messages in buffermessages in buffer

Performance & Future ResearchPerformance & Future Research

Single processor migration Single processor migration resultsresults

vs. num processors & size of vs. num processors & size of checkpoint imagecheckpoint image

8 Machines8 MachinesMix of Sun SparcStation 2 Mix of Sun SparcStation 2 and Sparc 10and Sparc 10

Dominating FactorDominating FactorImage SizeImage Size

Future ConsiderationsFuture ConsiderationsAutomatic load performance Automatic load performance and balancingand balancing

ConclusionConclusion

Current trendCurrent trendIncreasing cluster size Increasing cluster size Lower MTBFLower MTBFFault tolerance increasingly importantFault tolerance increasingly important

Fault tolerant implementations in MPI offer Fault tolerant implementations in MPI offer assortment of solutionsassortment of solutionsNew research yielding new improvements & New research yielding new improvements & ideas to enhance efficiency and robustness of ideas to enhance efficiency and robustness of fault tolerant systems.fault tolerant systems.

Questions?Questions?References:References:

G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C.G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, A. Selikhov Lemarinier, O. Lodygensky, F. Magniette, V. Neri, A. Selikhov MPICHMPICH--V: Toward a Scalable V: Toward a Scalable Fault Tolerant MPI for Volatile NodesFault Tolerant MPI for Volatile Nodes, LRI, Université de Paris Sud, Orsay, France (2002) , LRI, Université de Paris Sud, Orsay, France (2002) IEEE.IEEE.G. Stellner, G. Stellner, CoCheck: Checkpointing and Process Migration for MPICoCheck: Checkpointing and Process Migration for MPI, Institüt fur Informatik , Institüt fur Informatik der Tecnischen Universität München, Müchen, Germanyder Tecnischen Universität München, Müchen, GermanyG.G. FaggFagg,, Fault TolerantFault Tolerant MPIMPI,, LinuxLinux Magazine (Magazine (NovemberNovember 2004). 2004). [Online]. Available: [Online]. Available: http://www.linuxhttp://www.linux--mag.com/index2.php?option=com_content&task=view&id=1781&Itemid=2mag.com/index2.php?option=com_content&task=view&id=1781&Itemid=2070&pop=1&pag070&pop=1&page=0e=0A. Bouteiller, P.A. Bouteiller, P. LemarinierLemarinier, G., G. KraqezikKraqezik, F., F. CappelloCappello,, Coordinated CheckpointCoordinated Checkpoint versus versus Message Log forMessage Log for Fault TolerantFault Tolerant MPIMPI . LRI, Université de Paris Sud, Orsay, France. LRI, Université de Paris Sud, Orsay, FranceW. Gropp, E. Lusk, W. Gropp, E. Lusk, Fault Tolerance in MPI ProgramsFault Tolerance in MPI Programs, Argonne National Laboratory, Argone , Argonne National Laboratory, Argone ILILA. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, F. CappelA. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, F. Cappello lo MPICHMPICH--V ProjectV Project : A : A Multiprotocol Automatic Fault Tolerant MPIMultiprotocol Automatic Fault Tolerant MPI, INRIA/LRI, Université de Paris Sud, Orsay, , INRIA/LRI, Université de Paris Sud, Orsay, France.France.MPICHMPICH--V IntroductionV Introduction. . [Online]. Available: http://mpich[Online]. Available: http://mpich--v.lri.fr/index.php?section=intro&subsection=introv.lri.fr/index.php?section=intro&subsection=intro

top related