a new transformation scheme based on active replication strategy that tolerates failures

32
A new transformation scheme based on active replication strategy that tolerates failures Hamoudi Kalla, Alain Girault and Yves Sorel Pop Art team and Aoste team Paris, April 23, 2004

Upload: takara

Post on 31-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

A new transformation scheme based on active replication strategy that tolerates failures. Hamoudi Kalla , Alain Girault and Yves Sorel. Pop Art team and Aoste team. Paris, April 23, 2004. Outline. Introduction Model and problem State of the art - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A new transformation scheme based on active replication strategy that  tolerates failures

A new transformation scheme based on active replication strategy that

tolerates failures

Hamoudi Kalla, Alain Girault and Yves SorelPop Art team and Aoste team

Paris, April 23, 2004

Page 2: A new transformation scheme based on active replication strategy that  tolerates failures

2

Outline• Introduction

• Model and problem

• State of the art

• The proposed fault-tolerant method for tolerating :

• Processors failures

• Communication media failures

• Both processors and communication media failures

• Example

• Conclusion and future work

Page 3: A new transformation scheme based on active replication strategy that  tolerates failures

3

High level program

Compiler

Architecture specification

Distribution constraints

Execution times

Real-time constraints

Failure specification

Fault-tolerant distributed static schedule

Fault-tolerant distributed code

Code generator

Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic

Model of the algorithm

Introduction

Page 4: A new transformation scheme based on active replication strategy that  tolerates failures

4

Models : Application algorithm

a. Algorithm graph

« I1 and I2 » are inputs operations (sensors)

« O » is output operation (actuator)

« A, B and C » are computations operations

« A C » is data-dependence

I

1

B

A

C O

I2

Page 5: A new transformation scheme based on active replication strategy that  tolerates failures

5

b. Architecture graph

P1

P3

« P1, P2 and P3 » are processors

« L12, L13 and L23 » are point-to-point communication links

« B1 » is multipoint communication link

« com1 and com2 » are communicators

P1

P3

P2

B1

L23

L12

L13

P2

Models : Hardware architecture

Processor

Memory

operator com2

com1

Architecture with point-to-point links Architecture with multipoint links

Page 6: A new transformation scheme based on active replication strategy that  tolerates failures

6

1. Only processors and communication media (point-to-point and multipoint) can fails.

2. Failures can be characterized as transient or permanent.

3. At least a fixed number of processors can fail-stop.

4. At least a fixed number of communication media can fail-stop : partially or completely.

Partial communication

media failures

Processor failures

P1

P3

L23

L12

L13

P2 P1

P3

P2

m1

P1

P3

P2

m1

complete communication

media failures

Models : Component Failures

Page 7: A new transformation scheme based on active replication strategy that  tolerates failures

7

Find a distributed schedule of the algorithm on the architecture which is

fault-tolerantfault-tolerant to processors and communication media failures ?

Problem ?

I

1

B

A

C O

I2

SynDEx *SynDEx *SynDEx *SynDEx *

P1

P3

L23

L12

L13

P2

*SynDEx is a system level CAD software tool for optimizing the implementation of real-time embeded applications on multicomponenet architecture

architecture graph

algorithm graph

Distribution/scheduling

Page 8: A new transformation scheme based on active replication strategy that  tolerates failures

8

State of the art

“ A system is fault tolerant if it can mask the presence of faults in the system by using

hardware and/or software redundancy ”

(a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures

I

1

B

A

C O

I2

SynDEx *SynDEx *SynDEx *SynDEx *

P1

P3

L23

L12

L13

P2

architecture graph

algorithm graph

Distribution/scheduling

P4

P4

Page 9: A new transformation scheme based on active replication strategy that  tolerates failures

9

State of the art

“ A system is fault tolerant if it can mask the presence of faults in the system by using

hardware and/or software redundancy ”

(a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures

I

1

B

A

C O

I2

SynDEx *SynDEx *SynDEx *SynDEx *

P1

P3

L23

L12

L13

P2

architecture graph

algorithm graph

Distribution/scheduling

I1

A

Page 10: A new transformation scheme based on active replication strategy that  tolerates failures

10

State of the art

“ A system is fault tolerant if it can mask the presence of faults in the system by using

hardware and/or software redundancy ”

1. Active software redundancy : (Hashimoto et al., 2002(a); Fragopoulou and Akl, 1995(b))

(a) Multiple redundant copies of an operation are scheduled on different processors.

(b) Multiple redundant copies of a message are sent along disjoint paths.

2. Passive software redundancy : (Qin et al., 2002(a); Sriram et al., 1999(b))

(a) each operation is replicated on primary and backups copies, but only the primary is

executed.

(b) One copy of the message is sent, and if it fails, another copy will be transmitted.

(a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures

Page 11: A new transformation scheme based on active replication strategy that  tolerates failures

11

Outline• Introduction

• Model and problem

• State of the art

• The proposed fault-tolerant method for tolerating :

• Processor failures

• Communication media failures (point-to-point links)

• Both processor and communication media failures

• Example

• Conclusion and future work

Page 12: A new transformation scheme based on active replication strategy that  tolerates failures

12

The Proposed fault-tolerant method

We use active software redundancy for both operations and communications.

Makes the recovery from failures bounded.

Motivations :

Principle (1) :

Easier to integrate to SynDEx.

Makes the system predictable.

Page 13: A new transformation scheme based on active replication strategy that  tolerates failures

13

The Proposed fault-tolerant method

Principle (2) :

Algorithm graph (Alg)

NPF processors failures

NLF links failures

Architecture graph (Arc)

Real-time and embedding

constraints

Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic

(SynDEx)(SynDEx)

Fault-tolerant distributed real-time executive

New Alg with redundancy

and exclusion relations

Graph transformationGraph transformation

Page 14: A new transformation scheme based on active replication strategy that  tolerates failures

14

The Proposed fault-tolerant method

A Bdata

A

A

. . .

B

B

. . .

NP

F+

1 replicas o

f B

NP

F+

1 replicas o

f A

a. initial algorithm sub-graph b1. final algorithm sub-graph

Algorithm graph transformation (1) : Tolerating NPF processors failures

Page 15: A new transformation scheme based on active replication strategy that  tolerates failures

15

The Proposed fault-tolerant method

A B

On

e replica o

f B

on

e replica o

f A NLF+1 replicas of data

b2. final algorithm sub-graph

A Bdata

a. initial algorithm sub-graph

Algorithm graph transformation (2) : Tolerating NLF links failures

Page 16: A new transformation scheme based on active replication strategy that  tolerates failures

16

The Proposed fault-tolerant method

Algorithm graph transformation (3) : Tolerating NPF processors and NLF links failures

A Bdata

a. Initial algorithm sub-graph b. Operations redundancy

A

A

B

B

two

replicas o

f A

two

replicas o

f Bc. Data-dependence redundancy

A

A

Bitw

o rep

licas of A

two

replicas o

f B

data

d. Data-dependence distribution (1)

A

A

Bi

two

replicas o

f A

two

replicas o

f B

data

R

A

Bi

two

replicas o

f B

e. Data-dependence distribution (2)

Atwo

replicas o

f Adata

NPF=1 and NLF=1

Page 17: A new transformation scheme based on active replication strategy that  tolerates failures

17

The Proposed fault-tolerant method

R

A1

Bi

two

replicas o

f B

e. Data-dependence distribution (2)

A1two

replicas o

f A

data

A1

A2

B1two

replicas o

f A

two

replicas o

f B

Case 1

R

data

B2

A1

A2

B1two

replicas o

f A

two

replicas o

f B

Case 2

R

data

B2

Algorithm graph transformation (4) : Tolerating NPF processors and NLF links failuresNPF=1 and NLF=1

Page 18: A new transformation scheme based on active replication strategy that  tolerates failures

18

The Proposed fault-tolerant method

b. final algorithm sub-graph

A Bdata

a. initial algorithm sub-graph

A

B

A

NP

F+

1 replicas o

f A

NP

F+

1 replicas o

f B

R

R

...

...

NLF routing operations R

Algorithm graph transformation (5) : Tolerating NPF processors and NLF links failuresNPF>=1 and NLF>=1

Page 19: A new transformation scheme based on active replication strategy that  tolerates failures

19

The Proposed fault-tolerant method

NPF processors failures

NLF links failures

Architecture graph Arc

Real-time and embedding

constraints

Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic

(SynDEx)(SynDEx)

Fault-tolerant distributed real-time executive

Graph transformationGraph transformation

A Bdata

A

B

A

NP

F+

1 replica o

f A

NP

F+

1 replica o

f B

R

R

......

NLF routing operations R

New Alg with redundancy

and exclusion relations

Page 20: A new transformation scheme based on active replication strategy that  tolerates failures

20

The Proposed fault-tolerant method

B1 will receive its input data NPF+NLF+1 times (NPF=1, NLF=1);as soon as it

receives the first input, B1 is executed, and it ignores the later inputs

Implantation

R

A2

B1

two

replicas o

f B

A1two

replicas o

f A

data

P1

P4

L23

L12

L14

P2

P3L34

SynDExSynDEx SynDExSynDEx

P1 P2 P3 P4L12 L23 L34 L14

A2 A1

B1

data data

L24

L24

data

datadata

start time (B1) = min ( end communication [A1,A2,R] )

a transformed algorithm sub-graph

architecture graph

Temporary schedule

RR

time

B1

Page 21: A new transformation scheme based on active replication strategy that  tolerates failures

21

Outline• Introduction

• Model and problem

• State of the art

• The proposed fault-tolerant method for tolerating :

• Processor failures

• Communication media failures (multipoint links)

• Both processor and communication media failures

• Example

• Conclusion and future work

Page 22: A new transformation scheme based on active replication strategy that  tolerates failures

22

1. We use the active sactive software redundancyoftware redundancy of operations; where each operation is

replicated on NPF+1 different processors to tolerate NPF processors failures.

P1 P2

B1 B2

P3 P4

architecture graphAlgorithm sub-graph Temporary schedule

The Proposed fault-tolerant method

Page 23: A new transformation scheme based on active replication strategy that  tolerates failures

23

2. Use the passive software redundancypassive software redundancy of communication

The Proposed fault-tolerant method

3. Split each data communication on NLF messages (data fragmentation)(data fragmentation)

Page 24: A new transformation scheme based on active replication strategy that  tolerates failures

24

Why data data fragmentation fragmentation ?

1. Distinction between complete and partialcomplete and partial communication links failures

The Proposed fault-tolerant method

2. Enable rapid recoveryrapid recovery from processors and communication links failures

Page 25: A new transformation scheme based on active replication strategy that  tolerates failures

25

1. Recovery from processor failures

The Proposed fault-tolerant method

Page 26: A new transformation scheme based on active replication strategy that  tolerates failures

26

2. Recovery from partial communication links failures

The Proposed fault-tolerant method

Page 27: A new transformation scheme based on active replication strategy that  tolerates failures

27

3. Recovery from complete communication media failures

The Proposed fault-tolerant method

Page 28: A new transformation scheme based on active replication strategy that  tolerates failures

28

Example (1)

Page 29: A new transformation scheme based on active replication strategy that  tolerates failures

29

Example (2)

Page 30: A new transformation scheme based on active replication strategy that  tolerates failures

30

Conclusion and future work

Benchmarks.

Using passive redundancy to tolerate communication links failures.

Taking into account sensors and actuators failures.

A new method to tolerate both communication links both communication links and processor processor failuresfailures in distributed real-time systems, which may be reduce the overhead of the recovery from failures.

Result

Future work

Page 31: A new transformation scheme based on active replication strategy that  tolerates failures

31

References

[Hashimoto et al., 2002].

Hashimoto, K., Tsuchiya, T., and Kikuno, T. (2002). Effective scheduling of duplicated tasks for fault

tolerance in multiprocessor systems. IEICE Transactions on Information and Systems.

[Fragopoulou and Akl, 1995]. Fragopoulou, P. and Akl, S.G. (1995). Fault tolerant communication algorithms on the star network using

disjoint paths. In Proceedings of the 28th Hawaii International Conference on System Sciences, HICSS’95 , Kingston, Canada.

[Qin et al., 2002].

Qin, X., Jiang, H., and Swanson, D.R. (2002). An efficient fault-tolerant scheduling algorithm for real-time

tasks with precedence constraints in heterogeneous systems. In Proceedings of the 31th International

Conference on Parallel Processing, Vancouver, Canada.

[Sriram et al., 1999].

Sriram, R., Manimaran, G., and Murthy, C.S.R. (1999). An integrated scheme for establishing dependable

real-time channels in multihop networks. In Proc. ICCCN, pages 528–533.

Page 32: A new transformation scheme based on active replication strategy that  tolerates failures

32

Questions Questions ??