intermediate presentation(05/04/15) autonomous failure detection for supporting fault tolerant...

31
Intermediate Presentation(05/ 04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432 Yuuki Horita

Upload: doris-lloyd

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation

05/04/15Taura Lab. Master 2nd46432 Yuuki Horita

Page 2: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Background Large-scale computation runs

in parallel on a great number of nodes in

distributed environments (Grid) over a long period of time

High failure rate

• Node / Process Failures

• Network Failures

Fault Tolerance is getting more important

Page 3: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Fault tolerant computing

Failures

Recovery ResumingFailure Detection

The end…Computing

Page 4: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Failure Detection Heartbeat strategy

X

Tto

Y is probably

dead

Y

Thb

Thb

msg

① A process Y sends a message, called heartbeat, to another process X at regular time interval Thb

② After Y dies, X receives no heartbeat from Y

③ X suspects Y after a certain period of time Thb+Tto from the last receipt of heartbeat

Page 5: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Objective

To design and implement failure detection service for supporting fault-tolerant parallel computation

Page 6: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Contributions propose a new failure detection approach for

fault-tolerant parallel computation high autonomy

address join/leave of procs. support Grid environments with less manual

configurations high consistency

all the procs. obtain consistent failure information

high efficiency more efficient than other autonomous

approaches (the overhead with 313 procs. was at most about 2% where the heartbeat interval is 0.1[s])

Page 7: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Agenda Background Demands / Related Works Our Approach Experiments Summary

Page 8: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Agenda Background Demands / Related Works Our Approach Experiments Summary

Page 9: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Demands for Failure Detection System demand ( : Autonomy)

Adaptability/Fault-tolerance: address join/leave of processes

Accessibility: need less manual configuration Information demand ( : Consistency)

Consistency: must provide consistent information Performance demand ( : Efficiency)

Low overhead: don’t deteriorate application performance

Low detection latency: inform failure events ASAP Accuracy: less false positive

Page 10: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Hierarchical style MDS (Globus Project) NWS [R. Wolski ’97, N.T.Spring ’99]

a single point of failure may lead to system failure

manual configuration may be cumbersome

: Autonomy Problem

Page 11: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Gossip style [R. Renesse’98]

utilize the mechanism of rumor spreading each process sends a gossip message (like

heartbeat) to a randomly selected process periodically

a gossip message includes {node, heartbeat} of all processes node : a process identifier heartbeat : the latest time when some node

received node’s heartbeat

Page 12: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Gossip styleHeartbeats are propagated to all processes in a certain amount of time automatically

each process judges process failure independently

: Consistency Problem

it takes longer to detect failures

: Efficiency Problem

Page 13: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Agenda Background Demands / Related Works Our Approach Experiments Summary

Page 14: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Basic Design Separation of failure detection and information

propagation Each process is monitored by some processes

(Failure-detection phase) If a process detects process failures, it broadcasts the

information (Information-propagation phase)

• the overhead under normal conditions will be low (Efficiency)

• the failure information will be shared (Consistency)

Page 15: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Failure Detection

Each process autonomously acts so that it is always monitored by some processes

Each process requests randomly selected k

neighbor processes to monitor itself (neighbor : directly connectable)

sends heartbeat to them at regular time interval Thb

requests again in the same way if the monitoring process has failed (self-repairing) A → B :

A sends heartbeats to B( B monitors A )

k = 2

Page 16: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Information Propagation

flood along the monitoring network

Can we guarantee that the monitoring network is connected ?

no need for extra connections redundant paths for broadcast

(:fault-tolerant) at most 2k messages per proc.

(:scalable)

Page 17: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Connectivity of Monitoring Network

We calculated the probability of disconnectivity of the monitoring network

1.00E- 161.00E- 141.00E- 121.00E- 101.00E- 081.00E- 061.00E- 041.00E- 021.00E+00

1 6 11 16 21 26 31 36

# of nodes

Pro

b.

k = 1

k = 2

k = 3

The disconnectivity can be ignored if k >= 3

Page 18: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Support Grid Environments The connectivity between different

networks is often limited (i.e. NAT, Firewall)

Cluster A Cluster B

GatewayGateway

Disconnected!

Page 19: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Support Grid Environments

K monitoring requests

K monitoring requests

For each process,

any of its neighbor processes should be either monitoring it directly or adjacent to k of its monitoring processes

Page 20: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes

Support Grid Environments

5

4 2

3 9

8

[2, 7]

[1, 2], [4, 5]

k = 2

2 3 4 5 6 7

2 2 2 2 2 2

2 3 4 5 6 7

1 2 1 1 2 2

neighbor processes

monitoring directly

6

7

1

Page 21: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes

Support Grid Environments

5

4 2

3 9

8

k = 2

2 4 5 6 7

1 1 1 2 2

1, [7,9]

monitoring directly

2 4 5 7

1 1 1 1

2 4 5

1 1 1

2 4 5

0 1 0

monitoring directly

2 4 5 6 7

1 1 1 2 1

monitoring directly6

7

1

Page 22: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Agenda Background Demands / Related Works Our Approach Experiments Summary

Page 23: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Experiment Environment ISTBS Cluster (112 nodes × 2 CPU)

Xeon2.4GHz × 70 + Xeon2.8GHz ×42 105 nodes (7 nodes down) located at Hongo

SHEEP Cluster (65 nodes × 2 CPU) Xeon2.4GHz × 65 65 nodes located at Kashiwa

Internet

SHEEP cluster in Kashiwa

ISTBS cluster in Hongo

Page 24: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Demonstration (Java Applet)

a process

monitoring

lots of processes will die concurrently 3-times (turn black and disappear)

the surviving processes will detect all of the failures (change in color)

processes will repair the broken monitoring relations (add new edges)

Page 25: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

connectivity under failures simulate the connectivity of the monitoring

network under some failures check whether monitoring network is connected

when F failures happen concurrently 1.8×109 trials in each case

Page 26: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Connectivity under failures

# of procs. 10 20 40 80 160

k=3, p=0.01 3 4 5 8 13

k=3, p=0.0001 2 2 3 3 4

k=4, p=0.01 4 6 9 14 24

k=4, p=0.0001 3 4 4 6 10

calculated the maximum number of failure where probability of disconnection is less than p

Page 27: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Efficiency measured the execution time of a Fibonacci program

under the following autonomous failure detection service all-to-all Gossip ours

parameters # of processes : 2 ~ 313 k = 3 Thb = 0.1, 1.0[s]

Page 28: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Results (Efficiency)

20

21

22

23

24

25

26

27

0 100 200 300 400# of processes(N)

exec

utio

n tim

e [s

]

all- to- all (Thb=0.1[s])

gossip(Thb=0.1[s])

ours(Thb=0.1[s])

all- to- all (Thb=1.0[s])

10% overhead (N = 127)

over 5% overhead (N =

153)

The overhead is at most around 2 %

Page 29: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Summary proposed a new failure detection technique

for fault-tolerant parallel computation showed that

our system could be autonomously constructed in Grid environments

our system has high fault-tolerance it is more efficient than other autonomous

approaches

Page 30: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Future Work handling network partitioning sharing load on dynamic process join showing its practicality by implementing

fault-tolerant parallel application using it

Page 31: Intermediate Presentation(05/04/15) Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation 05/04/15 Taura Lab. Master 2nd 46432

Intermediate Presentation(05/04/15)

Publications 堀田勇樹 , 田浦健次朗 , 近山隆 . 分散環境における耐故障並

列計算を支援する通信ライブラリ . 先進的計算基盤システムシンポジウム (SACSIS2004). May 2004. (ポスター論文)

堀田勇樹 , 田浦健次朗 , 近山隆 . Phoenix プログラミングモデルにおける故障検知機構 . 並列 /分散 /協調処理に関するサマー・ワークショップ (SWoPP2004). July 2004.

堀田勇樹 , 田浦健次朗 , 近山隆 . 耐故障並列計算を支援する自律的な故障検知機構 . 先進的計算基盤システムシンポジウム (SACSIS2005). May 2005. ( 発表予定 )