@spcl eth fault tolerance for remote memory access ... · logging memory accesses vs. messages...

Post on 08-Aug-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

spcl.inf.ethz.ch

@spcl_eth

MACIEJ BESTA, TORSTEN HOEFLER

Fault Tolerance for Remote Memory Access

Programming Models

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory

Process p

A

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Process p Process q

A

BB

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

Process p Process q

A

BB

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

Process p Process q

A

BB

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

Process p Process q

A

BB

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

Process p Process q

A

BB

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

BB

AA

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

A

B

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

flush

A

B

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

flush

A

B

spcl.inf.ethz.ch

@spcl_eth

One-sided communication

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

flush

A

B

spcl.inf.ethz.ch

@spcl_eth

One-sided communication

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

flush

A

B

no active

participation

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

spcl.inf.ethz.ch

@spcl_eth

4

REMOTE MEMORY ACCESS PROGRAMMING

Supported by many HPC libraries and languages

spcl.inf.ethz.ch

@spcl_eth

4

REMOTE MEMORY ACCESS PROGRAMMING

Supported by many HPC libraries and languages

spcl.inf.ethz.ch

@spcl_eth

4

REMOTE MEMORY ACCESS PROGRAMMING

Supported by many HPC libraries and languages

spcl.inf.ethz.ch

@spcl_eth

4

REMOTE MEMORY ACCESS PROGRAMMING

Supported by many HPC libraries and languages

spcl.inf.ethz.ch

@spcl_eth

5

REMOTE MEMORY ACCESS PROGRAMMING

Enables significant speedups over message passing in

many types of applications, e.g.:

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12

spcl.inf.ethz.ch

@spcl_eth

5

REMOTE MEMORY ACCESS PROGRAMMING

Enables significant speedups over message passing in

many types of applications, e.g.: Speedup of ~1.5 for communication patterns in graph analytics

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12

spcl.inf.ethz.ch

@spcl_eth

5

REMOTE MEMORY ACCESS PROGRAMMING

Enables significant speedups over message passing in

many types of applications, e.g.: Speedup of ~1.5 for communication patterns in graph analytics

Speedup of ~1.4-2 in physics computations

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12

spcl.inf.ethz.ch

@spcl_eth

6

FAULT TOLERANCE + RMA

spcl.inf.ethz.ch

@spcl_eth

6

15.8h of MTBF (for nodes) for the TSUBAME2

FAULT TOLERANCE + RMA

spcl.inf.ethz.ch

@spcl_eth

7

FAULT TOLERANCE + RMA

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Message Passing

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Message Passing

Coordinated

Checkpointing (CC)

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Message Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Message Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

Message Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

logging memory

accesses vs. messages

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

logging memory

accesses vs. messages

checkpointing in RMA-

based applications

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

logging memory

accesses vs. messages

checkpointing in RMA-

based applications

fault tolerance

mechanisms and

schemes

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

logging memory

accesses vs. messages

checkpointing in RMA-

based applications

performance

fault tolerance

mechanisms and

schemes

spcl.inf.ethz.ch

@spcl_eth

8

OVERVIEW OF OUR RESEARCH

spcl.inf.ethz.ch

@spcl_eth

Generic model

8

OVERVIEW OF OUR RESEARCH

spcl.inf.ethz.ch

@spcl_eth

Generic model

8

OVERVIEW OF OUR RESEARCH

spcl.inf.ethz.ch

@spcl_eth

Generic model

8

OVERVIEW OF OUR RESEARCH

spcl.inf.ethz.ch

@spcl_eth

Generic model

8

OVERVIEW OF OUR RESEARCH

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

8

OVERVIEW OF OUR RESEARCH

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA Schemes

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA Schemes

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Schemes

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Basic scheme

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processes

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processes

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processes

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

8

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

8

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

8

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

9

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

10

COORDINATED CHECKPOINTING (MP)

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

global

rollback

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

send

coordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

send

coordinated

checkpoint

recv

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

send

coordinated

checkpoint

recv

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

put

coordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

put

coordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

put

coordinated

checkpoint

flush

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

put

coordinated

checkpoint

flush

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

13

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

13

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

13

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A

B

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A

B

actions are

non-blocking

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A

B

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A

B

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B

C

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B

C

D

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B

C

D

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD

E

F

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD

E

F

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD E F

actions are

non-blocking

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD E F

actions are

non-blockingEpoch 0

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD E F

actions are

non-blockingEpoch 0

Epoch 1

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD E F

actions are

non-blockingEpoch 0

Epoch 1

Epoch 2

data will be valid

upon synchronizing

memories

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

X

Y

Z

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

Epoch 0

X

Y

Z

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

Epoch 0

Epoch 1

X

Y

Z

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

B

C

E

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

B

C

E

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

B

C

E

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

B

C

E

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

B

C

B Cput put

E

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

B

C

B Cput put

E

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

B

C

B Cput put

E

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

B

C

B Cput put

E Fget getE

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

B

C

B Cput put

E Fget getE

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

B

C

B Cput put

E Fget getE

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

memory

EC

B

C

B Cput put

E Fget getE

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

memory

EC 0

B

C

B Cput put

E Fget getE

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

memory

EC 1

B

C

B Cput put

E Fget getE

F

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

memory

EC 2

B

C

B Cput put

E Fget getE

F

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

17

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

17

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

17

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

rollback

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

dependency

rollback

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

dependency

rollback

rollback

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

dependency

dependency

rollback

rollback

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

dependency

dependency

rollback

rollback rollback

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

logging a

message

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

local

rollback

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

get the log and

replay the message

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

memoryCC

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

memoryCC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

memoryCC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

memory

Logs of puts

CC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

CC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

CC

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 1

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 1

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 1

1

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 1

1

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

1

spcl.inf.ethz.ch

@spcl_eth

20

RMA: LOGGING GETS

Proc p Proc q

C

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

20

RMA: LOGGING GETS

Proc p Proc q

A

B

C

D

E

F

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

20

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory memoryD

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

memoryD

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

memoryD

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

memoryD

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

memoryD

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

Data is not

yet valid

memoryD

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

Data is not

yet valid

memoryD

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get

Data is not

yet valid

memoryD

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

D1

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

D

1

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

D

1...

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

D

1...

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

DD

1...

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

DD

Data is valid

1...

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

D

Now this data

may be invalid

Data is valid

1...

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

D

Now this data

may be invalid

Data is valid

1...

Logs of gets

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

+ #get

+ EC( )

D

Now this data

may be invalid

Data is valid

1...

Logs of gets

D

1

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

+ #get

+ EC( )

D

Now this data

may be invalid

Data is valid

delete #get

and EC

1...

Logs of gets

D

1

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

EC 1memory

+ #get

+ EC( )

D

Now this data

may be invalid

Data is valid

delete #get

and EC

...Logs of gets

D

1

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

22

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

22

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

22

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

App. dataChpt. data

DataP

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

App. data

DataP

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

App. data

DataP

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

X

Y

Z

App. data

DataP

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

X

Y

Z

App. data

DataP

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

X

Y

App. data

DataP

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

X

Y

App. data

DataP

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X

Y

App. data

DataPX + #put + EC( )0

Y + #put + EC( )0

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X

Y

App. data

DataP

Epoch 0

X + #put + EC( )0

Y + #put + EC( )0

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X

Y

X + #put + EC( )0

Y + #put + EC( )0

App. data

DataP

Epoch 0

X + #put + EC( )0

Y + #put + EC( )0

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X

Y

X + #put + EC( )0

Y + #put + EC( )0

App. data

DataP

Epoch 0

X + #put + EC( )0

Y + #put + EC( )0

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X + #put + EC( )0

Y + #put + EC( )0

App. data

DataPX + #put + EC( )0

Y + #put + EC( )0

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

Stage 2: replay actions beyond the checkpoint

DataP

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

A

B

C

D

E

F

Stage 2: replay actions beyond the checkpoint

DataP

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

A

B

C

D

E

F

Stage 2: replay actions beyond the checkpoint

DataP

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Stage 2: replay actions beyond the checkpoint

DataP

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Stage 2: replay actions beyond the checkpoint

DataP

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )2

F + #get + EC( )2

Stage 2: replay actions beyond the checkpoint

DataP

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )2

F + #get + EC( )2

Stage 2: replay actions beyond the checkpoint

DataP

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )2

F + #get + EC( )2

Stage 2: replay actions beyond the checkpoint

DataP

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

E + #get + EC( )2

F + #get + EC( )2

D + #get + EC( )1

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

E + #get + EC( )2

F + #get + EC( )2

D + #get + EC( )1

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

state

restored!

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

25

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

25

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

25

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

A Cray XE/XT

supercomputer

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

A Cray XE/XT

supercomputer

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

3 chassis: ...

A Cray XE/XT

supercomputer

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

3 chassis: ...

A Cray XE/XT

supercomputer

...8 blades:

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

3 chassis:

4 nodes:

...

...

A Cray XE/XT

supercomputer

...8 blades:

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

3 chassis:

4 nodes:

...

...

...

A Cray XE/XT

supercomputer

...8 blades:

32 cores:

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

A single hardware crash may kill multiple processes

...4 cabinets:

3 chassis:

4 nodes:

...

...

...

A Cray XE/XT

supercomputer

...8 blades:

32 cores:

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

A single hardware crash may kill multiple processes

...4 cabinets:

3 chassis:

4 nodes:

...

...

...

A Cray XE/XT

supercomputer

...8 blades:

32 cores:

Up to 128 process

failures (assuming

1 process per core)

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

A single hardware crash may kill multiple processes

Introduced protocols usually cannot handle > 1 process crash

...4 cabinets:

3 chassis:

4 nodes:

...

...

...

A Cray XE/XT

supercomputer

...8 blades:

32 cores:

Up to 128 process

failures (assuming

1 process per core)

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

G

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

G

A

B

C

Application data

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

G

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Application data

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

G

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Parity data

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups:

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failure

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failure

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failure

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...at every

hierarchy level

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...at every

hierarchy level

...

...

...

...

...

spcl.inf.ethz.ch

@spcl_eth

30

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

spcl.inf.ethz.ch

@spcl_eth

30

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓/𝑑𝑎𝑦:

𝑁𝑟 𝑜𝑓 𝑝𝑎𝑟𝑖𝑡𝑦 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

31

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

31

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

31

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

coordinated checkpointing

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

coordinated checkpointing

topology-awareness

uses

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

[2] J.T.Daly, A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems,

Volume 22 Issue 3, February 2006, Pages 303-312

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

usescoordinated checkpointing

Daly’s [2]

formula

topology-awareness

uses

≈ 2𝑀𝑇

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

[2] J.T.Daly, A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems,

Volume 22 Issue 3, February 2006, Pages 303-312

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

usescoordinated checkpointing

Daly’s [2]

formula

topology-awareness

uses

≈ 2𝑀𝑇

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

usesdemand

checkpoints

spcl.inf.ethz.ch

@spcl_eth

33

Evaluation on CSCS Monte Rosa

1,496 computing Cray XE6 nodes

47,872 schedulable cores

46TB memory

4 protocols

2 applications

PERFORMANCE

spcl.inf.ethz.ch

@spcl_eth

34

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

no-FT: no fault tolerance

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

no-FT: no fault tolerance

f-daly: using Daly’s interval

1-5% slower than no-FT

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

no-FT: no fault tolerance

f-daly: using Daly’s interval

1-5% slower than no-FT

f-no-daly: no Daly’s interval

1-15% slower than no-FT

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

SCR: a popular checkpoint

/ restart library

21-67% slower than no-FT

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

no-FT: no fault tolerance

f-daly: using Daly’s interval

1-5% slower than no-FT

f-no-daly: no Daly’s interval

1-15% slower than no-FT

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

35

PERFORMANCE: UNCOORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

35

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: UNCOORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

35

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

How the log size impacts the

number of uncoordinated

checkpoints and the

performance of the code?

PERFORMANCE: UNCOORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

35

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

How the log size impacts the

number of uncoordinated

checkpoints and the

performance of the code?

PERFORMANCE: UNCOORDINATED CHECKPOINTING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

36

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

36

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

spcl.inf.ethz.ch

@spcl_eth

36

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

no-FT: no fault tolerance

spcl.inf.ethz.ch

@spcl_eth

36

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

no-FT: no fault tolerance

FTRMA: logging puts (FFT

code does not use gets)

8-9% slower than no-FT

spcl.inf.ethz.ch

@spcl_eth

36

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

no-FT: no fault tolerance

FTRMA: logging puts (FFT

code does not use gets)

8-9% slower than no-FT

ML: a simple protocol

based on message logging

18% slower than no-FT

spcl.inf.ethz.ch

@spcl_eth

37

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

no-FT: no fault tolerance

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

no-FT: no fault tolerance

f-puts: logging puts

12% slower than no-FT

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

no-FT: no fault tolerance

f-puts: logging puts

12% slower than no-FT

f-puts-gets: logging puts

and gets

33% slower than no-FT

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

no-FT: no fault tolerance

f-puts: logging puts

12% slower than no-FT

ML: logging puts and gets

40% slower than no-FT

f-puts-gets: logging puts

and gets

33% slower than no-FT

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

spcl.inf.ethz.ch

@spcl_eth

Thanks to:

Paul Hargrove (and the whole UPC team)

and the MPI Forum RMA WG …

… and the institutions:

39

ACKNOWLEDGMENTS

spcl.inf.ethz.ch

@spcl_eth

Thank you

for your attention

40

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

memory

Logs of puts

CC

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

memory

Logs of puts

CC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

CC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

CC

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 2

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 2

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 2

2

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 2

2

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

2

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory memory

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

memory

App. data

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

memory

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

memory

Application

state before

the checkpoint

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

memorymemory

Application

state before

the checkpoint

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memorymemory

Application

state before

the checkpoint

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

take a local

checkpointApp. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataPDataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

DataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

DataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

DataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

DataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...Application

state some time

later, after some

communication...

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...Application

state some time

later, after some

communication...

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...State lost!

DataP

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

DataC

Chpt. data

take a local

checkpointApp. data

Chpt. data

...

...State lost!

can clear

the logs

spcl.inf.ethz.ch

@spcl_eth

memory

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

The Epoch Condition

DataC

Chpt. data

App. data

Chpt. data

...

...State lost!

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

Fetch the data

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

...

Fetch the data

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

...

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

...

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

...

DataP

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!DataP

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!DataP

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!DataP

Fetch the data

Decode

the data

DataP

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

DataP

Fetch the data

Decode

the data

DataP

Stage 1: restore the state upon checkpointing

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

memory

Chpt. data

RMA: RECOVERY

DataP

top related