a new transformation scheme based on active replication strategy that tolerates failures

A new transformation scheme based on active replication strategy that

tolerates failures

Hamoudi Kalla, Alain Girault and Yves SorelPop Art team and Aoste team

Paris, April 23, 2004

2

Outline• Introduction

• Model and problem

• State of the art

• The proposed fault-tolerant method for tolerating :

• Processors failures

• Communication media failures

• Both processors and communication media failures

• Example

• Conclusion and future work

3

High level program

Compiler

Architecture specification

Distribution constraints

Execution times

Real-time constraints

Failure specification

Fault-tolerant distributed static schedule

Fault-tolerant distributed code

Code generator

Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic

Model of the algorithm

Introduction

4

Models : Application algorithm

a. Algorithm graph

« I1 and I2 » are inputs operations (sensors)

« O » is output operation (actuator)

« A, B and C » are computations operations

« A C » is data-dependence

I

1

B

A

C O

I2

5

b. Architecture graph

P1

P3

« P1, P2 and P3 » are processors

« L12, L13 and L23 » are point-to-point communication links

« B1 » is multipoint communication link

« com1 and com2 » are communicators

P1

P3

P2

B1

L23

L12

L13

P2

Models : Hardware architecture

Processor

Memory

operator com2

com1

Architecture with point-to-point links Architecture with multipoint links

6

1. Only processors and communication media (point-to-point and multipoint) can fails.

2. Failures can be characterized as transient or permanent.

3. At least a fixed number of processors can fail-stop.

4. At least a fixed number of communication media can fail-stop : partially or completely.

Partial communication

media failures

Processor failures

P1

P3

L23

L12

L13

P2 P1

P3

P2

m1

P1

P3

P2

m1

complete communication

media failures

Models : Component Failures

7

Find a distributed schedule of the algorithm on the architecture which is

fault-tolerantfault-tolerant to processors and communication media failures ?

Problem ?

I

1

B

A

C O

I2

SynDEx *SynDEx *SynDEx *SynDEx *

P1

P3

L23

L12

L13

P2

*SynDEx is a system level CAD software tool for optimizing the implementation of real-time embeded applications on multicomponenet architecture

architecture graph

algorithm graph

Distribution/scheduling

8

State of the art

“ A system is fault tolerant if it can mask the presence of faults in the system by using

hardware and/or software redundancy ”

(a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures

I

1

B

A

C O

I2


P1

P3

L23

L12

L13

P2

architecture graph

algorithm graph


P4

P4

9

State of the art




I

1

B

A

C O

I2


P1

P3

L23

L12

L13

P2

architecture graph

algorithm graph


I1

A

10

State of the art



1. Active software redundancy : (Hashimoto et al., 2002(a); Fragopoulou and Akl, 1995(b))

(a) Multiple redundant copies of an operation are scheduled on different processors.

(b) Multiple redundant copies of a message are sent along disjoint paths.

2. Passive software redundancy : (Qin et al., 2002(a); Sriram et al., 1999(b))

(a) each operation is replicated on primary and backups copies, but only the primary is

executed.

(b) One copy of the message is sent, and if it fails, another copy will be transmitted.


11





• Processor failures

• Communication media failures (point-to-point links)

• Both processor and communication media failures

• Example


12

The Proposed fault-tolerant method

We use active software redundancy for both operations and communications.

Makes the recovery from failures bounded.

Motivations :

Principle (1) :

Easier to integrate to SynDEx.

Makes the system predictable.

13


Principle (2) :

Algorithm graph (Alg)

NPF processors failures

NLF links failures

Architecture graph (Arc)

Real-time and embedding

constraints


(SynDEx)(SynDEx)

Fault-tolerant distributed real-time executive

New Alg with redundancy

and exclusion relations

Graph transformationGraph transformation

14


A Bdata

A

A

. . .

B

B

. . .

NP

F+

1 replicas o

f B

NP

F+

1 replicas o

f A

a. initial algorithm sub-graph b1. final algorithm sub-graph

Algorithm graph transformation (1) : Tolerating NPF processors failures

15


A B

On

e replica o

f B

on

e replica o

f A NLF+1 replicas of data

b2. final algorithm sub-graph

A Bdata

a. initial algorithm sub-graph

Algorithm graph transformation (2) : Tolerating NLF links failures

16


Algorithm graph transformation (3) : Tolerating NPF processors and NLF links failures

A Bdata

a. Initial algorithm sub-graph b. Operations redundancy

A

A

B

B

two

replicas o

f A

two

replicas o

f Bc. Data-dependence redundancy

A

A

Bitw

o rep

licas of A

two

replicas o

f B

data

d. Data-dependence distribution (1)

A

A

Bi

two

replicas o

f A

two

replicas o

f B

data

R

A

Bi

two

replicas o

f B

e. Data-dependence distribution (2)

Atwo

replicas o

f Adata

NPF=1 and NLF=1

17


R

A1

Bi

two

replicas o

f B

e. Data-dependence distribution (2)

A1two

replicas o

f A

data

A1

A2

B1two

replicas o

f A

two

replicas o

f B

Case 1

R

data

B2

A1

A2

B1two

replicas o

f A

two

replicas o

f B

Case 2

R

data

B2

Algorithm graph transformation (4) : Tolerating NPF processors and NLF links failuresNPF=1 and NLF=1

18


b. final algorithm sub-graph

A Bdata

a. initial algorithm sub-graph

A

B

A

NP

F+

1 replicas o

f A

NP

F+

1 replicas o

f B

R

R

...

...

NLF routing operations R

Algorithm graph transformation (5) : Tolerating NPF processors and NLF links failuresNPF>=1 and NLF>=1

19


NPF processors failures

NLF links failures

Architecture graph Arc

Real-time and embedding

constraints


(SynDEx)(SynDEx)

Fault-tolerant distributed real-time executive

Graph transformationGraph transformation

A Bdata

A

B

A

NP

F+

1 replica o

f A

NP

F+

1 replica o

f B

R

R

......

NLF routing operations R

New Alg with redundancy

and exclusion relations

20


B1 will receive its input data NPF+NLF+1 times (NPF=1, NLF=1);as soon as it

receives the first input, B1 is executed, and it ignores the later inputs

Implantation

R

A2

B1

two

replicas o

f B

A1two

replicas o

f A

data

P1

P4

L23

L12

L14

P2

P3L34

SynDExSynDEx SynDExSynDEx

P1 P2 P3 P4L12 L23 L34 L14

A2 A1

B1

data data

L24

L24

data

datadata

start time (B1) = min ( end communication [A1,A2,R] )

a transformed algorithm sub-graph

architecture graph

Temporary schedule

RR

time

B1

21





• Processor failures

• Communication media failures (multipoint links)

• Both processor and communication media failures

• Example


22

1. We use the active sactive software redundancyoftware redundancy of operations; where each operation is

replicated on NPF+1 different processors to tolerate NPF processors failures.

P1 P2

B1 B2

P3 P4

architecture graphAlgorithm sub-graph Temporary schedule


23

2. Use the passive software redundancypassive software redundancy of communication


3. Split each data communication on NLF messages (data fragmentation)(data fragmentation)

24

Why data data fragmentation fragmentation ?

1. Distinction between complete and partialcomplete and partial communication links failures


2. Enable rapid recoveryrapid recovery from processors and communication links failures

25

1. Recovery from processor failures


26

2. Recovery from partial communication links failures


27

3. Recovery from complete communication media failures


28

Example (1)

29

Example (2)

30

Conclusion and future work

Benchmarks.

Using passive redundancy to tolerate communication links failures.

Taking into account sensors and actuators failures.

A new method to tolerate both communication links both communication links and processor processor failuresfailures in distributed real-time systems, which may be reduce the overhead of the recovery from failures.

Result

Future work

31

References

[Hashimoto et al., 2002].

Hashimoto, K., Tsuchiya, T., and Kikuno, T. (2002). Effective scheduling of duplicated tasks for fault

tolerance in multiprocessor systems. IEICE Transactions on Information and Systems.

[Fragopoulou and Akl, 1995]. Fragopoulou, P. and Akl, S.G. (1995). Fault tolerant communication algorithms on the star network using

disjoint paths. In Proceedings of the 28th Hawaii International Conference on System Sciences, HICSS’95 , Kingston, Canada.

[Qin et al., 2002].

Qin, X., Jiang, H., and Swanson, D.R. (2002). An efficient fault-tolerant scheduling algorithm for real-time

tasks with precedence constraints in heterogeneous systems. In Proceedings of the 31th International

Conference on Parallel Processing, Vancouver, Canada.

[Sriram et al., 1999].

Sriram, R., Manimaran, G., and Murthy, C.S.R. (1999). An integrated scheme for establishing dependable

real-time channels in multihop networks. In Proc. ICCCN, pages 528–533.

32

Questions Questions ??

a new transformation scheme based on active replication strategy that tolerates failures

Documents

faulttolerant method

passive software redundancy

processors l12

different processors

fixed number of processors

system level cad software

presence of faults

b multiple redundant