intermediate presentation(05/04/15) autonomous failure detection for supporting fault tolerant...

Intermediate Presentation(05/04/15)

Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation

05/04/15Taura Lab. Master 2nd46432 Yuuki Horita


Background Large-scale computation runs

in parallel on a great number of nodes in

distributed environments (Grid) over a long period of time

High failure rate

• Node / Process Failures

• Network Failures

Fault Tolerance is getting more important


Fault tolerant computing

Failures

Recovery ResumingFailure Detection

The end…Computing


Failure Detection Heartbeat strategy

X

Tto

Y is probably

dead

Y

Thb

Thb

msg

① A process Y sends a message, called heartbeat, to another process X at regular time interval Thb

② After Y dies, X receives no heartbeat from Y

③ X suspects Y after a certain period of time Thb+Tto from the last receipt of heartbeat


Objective

To design and implement failure detection service for supporting fault-tolerant parallel computation


Contributions propose a new failure detection approach for

fault-tolerant parallel computation high autonomy

address join/leave of procs. support Grid environments with less manual

configurations high consistency

all the procs. obtain consistent failure information

high efficiency more efficient than other autonomous

approaches (the overhead with 313 procs. was at most about 2% where the heartbeat interval is 0.1[s])


Agenda Background Demands / Related Works Our Approach Experiments Summary


Demands for Failure Detection System demand ( : Autonomy)

Adaptability/Fault-tolerance: address join/leave of processes

Accessibility: need less manual configuration Information demand ( : Consistency)

Consistency: must provide consistent information Performance demand ( : Efficiency)

Low overhead: don’t deteriorate application performance

Low detection latency: inform failure events ASAP Accuracy: less false positive


Hierarchical style MDS (Globus Project) NWS [R. Wolski ’97, N.T.Spring ’99]

a single point of failure may lead to system failure

manual configuration may be cumbersome

: Autonomy Problem


Gossip style [R. Renesse’98]

utilize the mechanism of rumor spreading each process sends a gossip message (like

heartbeat) to a randomly selected process periodically

a gossip message includes {node, heartbeat} of all processes node : a process identifier heartbeat : the latest time when some node

received node’s heartbeat


Gossip styleHeartbeats are propagated to all processes in a certain amount of time automatically

each process judges process failure independently

: Consistency Problem

it takes longer to detect failures

: Efficiency Problem


Basic Design Separation of failure detection and information

propagation Each process is monitored by some processes

(Failure-detection phase) If a process detects process failures, it broadcasts the

information (Information-propagation phase)

• the overhead under normal conditions will be low (Efficiency)

• the failure information will be shared (Consistency)


Failure Detection

Each process autonomously acts so that it is always monitored by some processes

Each process requests randomly selected k

neighbor processes to monitor itself (neighbor : directly connectable)

sends heartbeat to them at regular time interval Thb

requests again in the same way if the monitoring process has failed (self-repairing) A → B ：

A sends heartbeats to B( B monitors A )

k = 2


Information Propagation

flood along the monitoring network

Can we guarantee that the monitoring network is connected ?

no need for extra connections redundant paths for broadcast

(:fault-tolerant) at most 2k messages per proc.

(:scalable)


Connectivity of Monitoring Network

We calculated the probability of disconnectivity of the monitoring network

1.00E- 161.00E- 141.00E- 121.00E- 101.00E- 081.00E- 061.00E- 041.00E- 021.00E+00

1 6 11 16 21 26 31 36

# of nodes

Pro

b.

k = 1

k = 2

k = 3

The disconnectivity can be ignored if k >= 3


Support Grid Environments The connectivity between different

networks is often limited (i.e. NAT, Firewall)

Cluster A Cluster B

GatewayGateway

Disconnected!


Support Grid Environments

K monitoring requests

K monitoring requests

For each process,

any of its neighbor processes should be either monitoring it directly or adjacent to k of its monitoring processes

Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes


5

4 2

3 9

8

[2, 7]

[1, 2], [4, 5]

k = 2

2 3 4 5 6 7

2 2 2 2 2 2

2 3 4 5 6 7

1 2 1 1 2 2

neighbor processes

monitoring directly

6

7

1

Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes


5

4 2

3 9

8

k = 2

2 4 5 6 7

1 1 1 2 2

1, [7,9]

monitoring directly

2 4 5 7

1 1 1 1

2 4 5

1 1 1

2 4 5

0 1 0

monitoring directly

2 4 5 6 7

1 1 1 2 1

monitoring directly6

7

1


Experiment Environment ISTBS Cluster (112 nodes × 2 CPU)

Xeon2.4GHz × 70 + Xeon2.8GHz ×42 105 nodes (7 nodes down) located at Hongo

SHEEP Cluster (65 nodes × 2 CPU) Xeon2.4GHz × 65 65 nodes located at Kashiwa

Internet

SHEEP cluster in Kashiwa

ISTBS cluster in Hongo


Demonstration (Java Applet)

a process

monitoring

lots of processes will die concurrently 3-times (turn black and disappear)

the surviving processes will detect all of the failures (change in color)

processes will repair the broken monitoring relations (add new edges)


connectivity under failures simulate the connectivity of the monitoring

network under some failures check whether monitoring network is connected

when F failures happen concurrently 1.8×109 trials in each case


Connectivity under failures

# of procs. 10 20 40 80 160

k=3, p=0.01 3 4 5 8 13

k=3, p=0.0001 2 2 3 3 4

k=4, p=0.01 4 6 9 14 24

k=4, p=0.0001 3 4 4 6 10

calculated the maximum number of failure where probability of disconnection is less than p


Efficiency measured the execution time of a Fibonacci program

under the following autonomous failure detection service all-to-all Gossip ours

parameters # of processes : 2 ~ 313 k = 3 Thb = 0.1, 1.0[s]


Results (Efficiency)

20

21

22

23

24

25

26

27

0 100 200 300 400# of processes(N)

exec

utio

n tim

e [s

]

all- to- all (Thb=0.1[s])

gossip(Thb=0.1[s])

ours(Thb=0.1[s])

all- to- all (Thb=1.0[s])

10% overhead (N = 127)

over 5% overhead (N =

153)

The overhead is at most around 2 %


Summary proposed a new failure detection technique

for fault-tolerant parallel computation showed that

our system could be autonomously constructed in Grid environments

our system has high fault-tolerance it is more efficient than other autonomous

approaches


Future Work handling network partitioning sharing load on dynamic process join showing its practicality by implementing

fault-tolerant parallel application using it


Publications 堀田勇樹 , 田浦健次朗 , 近山隆 . 分散環境における耐故障並

列計算を支援する通信ライブラリ . 先進的計算基盤システムシンポジウム (SACSIS2004). May 2004. （ポスター論文）

堀田勇樹 , 田浦健次朗 , 近山隆 . Phoenix プログラミングモデルにおける故障検知機構 . 並列 /分散 /協調処理に関するサマー・ワークショップ (SWoPP2004). July 2004.

堀田勇樹 , 田浦健次朗 , 近山隆 . 耐故障並列計算を支援する自律的な故障検知機構 . 先進的計算基盤システムシンポジウム (SACSIS2005). May 2005. ( 発表予定 )

intermediate presentation(05/04/15) autonomous failure detection for supporting fault tolerant...

Documents