fault-tolerant networks and fault-tolerant routing soner dedeoĞlu 10/12/2015 1

44
FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 06/20/22 1

Upload: katherine-gibbs

Post on 31-Dec-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

FAULT-TOLERANT NETWORKS

ANDFAULT-TOLERANT ROUTINGSONER DEDEOĞLU

04/19/23

1

Page 2: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Outline• Introduction• Measures of Resilience

▫ Graph-Theoretical Measures▫ Computer Networks Measures

• Common Network Topologies and Their Resilience▫ Multi-stage and Extra-Stage Networks▫ Crossbar Networks▫ Regular Mesh and Interstitial Mesh▫ Hypercube Network

• Fault Tolerant Routing▫ Hypercube Fault-Tolerant Routing▫ Origin-Based Routing in the Mesh

04/19/23

2

SONER DEDEOĞLU

Page 3: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

04/19/23

3

SONER DEDEOĞLU

Page 4: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Types of Interconnection Networks• In shared-memory multiprocessor systems, connecting

processors and memories▫ Processors read from or write into memories.

• In distributed systems, connecting processors▫ Each has its own local memory ▫ Communicate through messages while executing parts

of a common application• Wide-area-networks, connecting large number of

processors that operate independently▫ Information sharing through packets▫ Communication over switchboxes and routers

04/19/23

4

SONER DEDEOĞLU

Page 5: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Network Topology• Definition: Network organization

• One or more paths between source and the destination

• Uni- or Bi-directional links and switchboxes.

• If a single path exists between sender and receiver, one fault along the path will disconnect the communication terminals.

• Fault tolerance is achieved by having multiple paths and/or spare units.

04/19/23SONER DEDEOĞLU

5

Page 6: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

04/19/23SONER DEDEOĞLU

6

Page 7: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Graph Theoretical Measures• Node (link) connectivity

▫ Minimum number of nodes (links) that must be removed to disconnect the graph

• Distance between nodes▫ Smallest number of links

• Diameter▫ Longest distance in the graph

• Diameter stability▫ Rate of increase in diameter due to faulty nodes.▫ Persistence: Smallest number of nodes that must fail in

order to increase the diameter of the graph

04/19/23SONER DEDEOĞLU

7

Page 8: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Computer Network Measures• Reliability - R(t)

▫ Probability that all nodes are operational and communicate during the time interval [0,t]

▫ Path reliability: Same definition for a specific sender-receiver pair.

• Bandwidth▫ Maximum rate of flow of messages

• Connectability – Q(t)▫ Expected number of source-destination pairs still

connected at time t in the presence of faults

04/19/23SONER DEDEOĞLU

8

Page 9: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

04/19/23SONER DEDEOĞLU

9

Page 10: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Types of topologies• TYPE-1

▫ Input and output nodes are connected through links and switchboxes. Multi-stage networks Crossbar

▫ Resilience measures Bandwidth Connectivity

• TYPE-2▫ Nodes are not connected through switchboxes, but links.

Nodes are both computational units and also serve as switches. Mesh Hypercube

▫ Resilience measure (Path) Reliability

04/19/23SONER DEDEOĞLU

10

Page 11: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Multi-stage Networks• Butterfly network

▫ Built out of 2x2 switches – Two input and two output ports Straight Cross Upper broadcast Lower broadcast

04/19/23SONER DEDEOĞLU

11

Page 12: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Multi-stage Networks (cont’d)• k-stage butterfly network

▫ 2k inputs and 2k outputs▫ 2k-1 switches in each stage▫ Connections follow a

recursive pattern from input to output.

▫ There is only one path from any given input to a specific output. (NOT FAULT TOLERANT)

▫ If a fail occurs, the system still operates but in a degraded mode.

04/19/23SONER DEDEOĞLU

12

Page 13: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Extra-stage Networks• Duplication of stage-0 at

the input of network

• Bypass multiplexers around switchboxes at the input and output stages. A failed switch can be bypassed by routing around it.

• FAULT TOLERANT up to one faulty switchbox anywhere in the system.

04/19/23SONER DEDEOĞLU

13

Page 14: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of ButterflyBandwidth Calculation (without failures)

• Bandwidth▫ Expected number of access requests from the

processors that reach the memories

• Assumptions▫ Each processor generate in each cycle, with a

probability pr, a request to a memory module, directed to any of the N memory modules with equal probability 1/N.

▫ Requests in each cycle are independent from requests in previous cycles.

04/19/23SONER DEDEOĞLU

14

Page 15: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of ButterflyBandwidth Calculation (without failures) (cont’d)

04/19/23SONER DEDEOĞLU

15

Page 16: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of ButterflyBandwidth Calculation (including failures)

04/19/23SONER DEDEOĞLU

16

Page 17: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of ButterflyConnectability (including failures)• Connectability

▫ Expected number of connected sender-receiver pairs• Senders and receivers are fault-free.• Exactly one path exists between each pair.• Each path has k+1 links k switchboxes.• Probabilities of link and switchbox failures

▫ ql and qs, respectively.• Probabilities of fault-free link and switchbox

▫ pl = 1 – ql and ps = 1 – qs, respectively.• Probability of a fault-free path

▫ plk+1 ps

k

• There are 22k = N2 sender-receiver pairs.Q = 22k pl k+1 ps k = N2 pl k+1 ps k

04/19/23SONER DEDEOĞLU

17

Page 18: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of Extra-StageConnectability• Each processor-memory pair connected by two disjoint paths

▫ Probability of at least one fault-free path = Prob{1st path is fault-free} + Prob{2nd path is fault-free} – Prob{both paths are fault-free}

• This probability can assume one of the following two expressionsA = (1-ql

2) plk (1-ql

2) + plk+2 – pl

2

B = 2(1-ql2) pl

k+1 – pl2k+2 (1-ql

2)2

• (1-ql2) is the probability that or a switchbox with a bypass

multiplexer at least one link is operational

• There are 22k = N2 sender-receiver pairs.Q = (A+B)22k/2 = (A+B)N2/2

04/19/23SONER DEDEOĞLU

18

Page 19: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Crossbar Networks• A switchbox for each sender-

receiver pair.• NxM crossbar:

▫ N-senders, M-receivers, NM-switches

• The (i, j) switchbox connects row i input to column j output and is capable of▫ propagating a message

along row from left link to right link

▫ propagating a message along column from bottom to top link

▫ turning a message from left link to top link

• Failure of any switchbox will disconnect certain pairs (NOT FAULT TOLERANT)

04/19/23SONER DEDEOĞLU

19

Page 20: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Redundant Crossbars• A row and a column of

switches are added.

• Input and output connections are augmented.▫ Each input can be send

either of two rows and can be received on either two columns.

• If a switch becomes faulty, row and column to which it belongs are replace by the space row and column (FAULT TOLERANT)

04/19/23SONER DEDEOĞLU

20

Page 21: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of CrossbarConnectability• Probability that a link is faulty: ql

• Probability that a link is fault-free: pl = 1 – ql

• Probability fo switchbox failures included in link failure probabilities.

• For input i to be connected to output j, we have to go through i+j links.

04/19/23SONER DEDEOĞLU

21

N

i

M

j l

Ml

l

Nl

lji

l p

p

p

pppQ

1 1

2

1

1

1

1

Page 22: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Mesh Networks• 2-Dimensional NxM

rectangular mesh network▫ All nodes are computing▫ No separate switchboxes

• Message sending▫ A path from source to

destination identified and message forwarded along path

• Conventional mesh reliability▫ Rmesh(t) = [R(t)]NM

▫ R(t) reliability of single node

• NOT FAULT TOLERANT

04/19/23SONER DEDEOĞLU

22

Page 23: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Interstitial Mesh Networks• (1, 4) Interstitial redundancy

▫ A spare node can be switched in to replace any neighbor failed.

▫ Each primary node has a single spare node while each spare node is a spare node for primary nodes

▫ Redundancy overhead = 25%

• (4, 4) Interstitial redundancy▫ Primary node has four spare

nodes.▫ Each spare node is a spare

for four primary nodes.▫ Redundancy overhead =

100%

04/19/23SONER DEDEOĞLU

23

Page 24: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of Interstitial MeshReliability• (4,1) Interstitial Mesh

▫ Mesh is size of NxM where N and M are even numbers

▫ Cluster: four primary nodes with one spare

▫ Mesh has NM/4 clusters at all.

▫ R(t) : Reliability of a single primary or spare node

▫ Reliability of clusterRcluster(t)= R5(t) + 5R4(t)[1-R(t)]

▫ Reliability of meshRmesh(t) = [Rcluster(t)]NM/4

• (4,4) Interstitial Mesh▫ No simple algorithm to calculate reliability of (4,4) interstitial mesh

04/19/23SONER DEDEOĞLU

24

Page 25: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Hypercube Networks• Hn : An n-dimensional hypercube network including 2n

nodes.

• A 0-dimensional hypercube H0 is a single node.

• Hn constructed by connecting the corresponding nodes of two Hn-1 networks.

• The edges added to connect the corresponding nodes are called dimension-(n-1) edges.

• Each node in an n-dimensional hypercube has n edges incident upon it.

04/19/23SONER DEDEOĞLU

25

Page 26: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Hypercube Networks (cont’d)

04/19/23SONER DEDEOĞLU

26

Page 27: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Fault Tolerance to Hypercubes• Hn (n ≥ 2) can tolerate link failures by multiple path

between source and destination.

• Node failures can disrupt the operation.

• Adding fault tolerance to overcome node failures▫ Adding one or more spare nodes▫ Increasing number of communication ports of each

original node from n to n+1▫ Connecting the extra ports through the additional links

to spare node.▫ Using crossbar switches with outputs connected to

spare node reduces number of ports of spare node to n+1.

04/19/23SONER DEDEOĞLU

27

Page 28: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Fault Tolerance to Hypercubes (cont’d)

04/19/23SONER DEDEOĞLU

28

Page 29: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of HypercubeReliability• Links and nodes all fail independently

• Reliability of Hn is the product of▫ Reliability of 2n nodes,▫ Probability that every node can communicate with every

other node

• Exact evaluation of the probability is difficult due to multiple-path connection between source-destination pairs.

• Lower bound on the reliability can be evaluated as addition of probabilities of three mutually exclusive cases for which the network is connected.

04/19/23SONER DEDEOĞLU

29

Page 30: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of HypercubeReliability (cont’d)• Hn is decomposed into two Hn-1 hypercubes, A and B, and the

dimesion-(n-1) links connecting them.

• CASE-1:▫ Both A and B are operational and at least one of dimension-

(n-1) link is functional.

• CASE-2:▫ One of A, B is operational and the other is not, and all

dimension-(n-1) links are functional.

• CASE-3:▫ Only one of A,B is operational, exactly one dimension-(n-1)

link is faulty and is connected in the nonoperational Hn-1 to a node that has at least one functional link to another node.

04/19/23SONER DEDEOĞLU

30

Page 31: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Resilience Analysis of HypercubeReliability (cont’d)• Notations

▫ qc : Probability of a node failure▫ ql : Probability of a link failure▫ NR(Hn, ql, qc) : Reliability of hypercube Hn

• Assumption▫ Nodes are perfect reliable (qc = 0)

• CASE-1:Prob{Case-1} = [NR(Hn-1, ql, 0)]2 (1 - ql

2n-1)

• CASE-2:Prob{Case-2} = 2NR(Hn-1, ql, 0) [1-NR(Hn-1, ql, 0)] (1 - ql)

2n-1

• CASE-3:Prob{Case-3} = 2NR(Hn-1, ql, 0) [1-NR(Hn-1, ql, 0)] 2n-1ql(1-ql)2n-1-1 (1-ql

n-1)

• NR(Hn, ql, 0) = Prob{Case-1} + Prob{Case-2} + Prob{Case-3}

04/19/23SONER DEDEOĞLU

31

Page 32: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

04/19/23SONER DEDEOĞLU

32

Page 33: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Concepts• Objective:

▫ Get a message from source to destination despite a subset of the network being faulty.

• Basic Idea:▫ If no shortest or most convenient path is available

because of failures, reroute message throught other paths to destination.

• Unicast Routing:▫ A message is sent from a source to just a one destination

• Multicast Routing:▫ Copies of a message sent to a number of nodes.

04/19/23SONER DEDEOĞLU

33

Page 34: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Classification of Routing Algorithms• Centralized routing

▫ A central controller knows the network state – faulty links or nodes, congested links - and selects path for each message.

• Distributed routing▫ Each intermediate node decides to which node to send it

next.

• Unique routing▫ One path for each source-destination pair

• Adaptive routing▫ Path selected according to network conditions (congestion)

04/19/23SONER DEDEOĞLU

34

Page 35: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Hypercube Fault Tolerant Routing• Basic Idea:

▫ List the dimensions along which the packet must travel, and traverse them one by one.

▫ As edges are traversed and are crossed off the list.▫ If, due to a link or node failure, the desired link is not

available, another edge in the list, if any, is chosen for traversal

▫ If packet arrives at some node to find all dimensions on its list down, it backtracks to the previous node and tries again.

04/19/23SONER DEDEOĞLU

35

Page 36: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Hypercube Fault Tolerant Routing (cont’d)• TD: list of dimensions that the message has traveled on - in

order of traversal• TDR: list of dimensions in reversed order• k

i=1 : Exculsive or operation carried out k times sequentialy.• D: destination• S: source• d = D S• SR(A): the set of relative addresses reachable by traversing

each of the dimensions listed in A.• en

i: n-bit vector consisting of a 1 in the ith bit position and 0 everywhere else.

• : append operation• TRANSMIT(j): Send packet (d ej, message payload, TD

j) along the jth dimensional link from the present node

04/19/23SONER DEDEOĞLU

36

Page 37: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Hypercube Fault Tolerant Routing (cont’d)

04/19/23SONER DEDEOĞLU

37

Page 38: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Hypercube Fault Tolerant Routing (cont’d)• Case study:

▫ We are given an H3 with faulty node 011.

▫ S = 000 wants to send a message to D = 111.

▫ At 000, d = 111, so it sends the message out on dimension-0, to node 001

▫ At node 001, d = 110 and TD = (0). This node attempts to send it out on its dimension-1 edge. However, because node 011 is down, it cannot do so.

▫ Since bit 2 of d is also 1, it checks and finds that the dimension-2 edge to 101 is available.

▫ The message is now sent to 101, from which it makes its way to 111.

04/19/23SONER DEDEOĞLU

38

Page 39: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Origin Based Routing in Mesh• Assumptions:

▫ Two-dimensional NxN mesh with at most N-1 failures

▫ All faulty regions are square, if not, additional nodes are declared to have pseudo faults

▫ Each node knows the distance along each direction to the nearest faulty region in that direction

▫ One node defined as the origin

▫ Origin chosen so that its row and column do not have any faulty nodes

04/19/23SONER DEDEOĞLU

39

Page 40: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Origin Based Routing in Mesh (cont’d)• Sending a message from S to D• IN-path: Edges that take the

message closer to the origin• OUT-path: takes the message

farther away from the origin• Outbox: the smallest rectangular

region that contains both the origin and the destination

• Safe node: V is a safe node with respect to D and a set of faulty nodes F if▫ V is in the outbox for D▫ There exists a fault-free OUT-path

from V to D• Diagonal band: Diagonal band for

D - all nodes V in the outbox such thatxv – yv = xD – yD + e where e {-1,0,1}

• Once we get to a safe node, there exists an OUT-path from that node to D

04/19/23SONER DEDEOĞLU

40

Page 41: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Origin Based Routing in Mesh (cont’d)• Three Phase Algorithm

• PHASE-1:▫ The message is routed on an IN path until it reaches the outbox, at

node U.

• PHASE-2:▫ Compute the distance from U to the nearest safe node and compare

to the distance to the nearest faulty region in that direction. If the safe node is closer than the fault, route to the safe node; otherwise, continue to route on the IN links.

• PHASE-3:▫ Once the message is at a safe node U, if there is a safe nonfaulty

neighbor V that is closer to the destination, send it to V ; otherwise, U must be on the edge of a faulty region. In such a case, move the message along the edge of

▫ the faulty region toward the destination D, and turn toward the diagonal band when it arrives at the corner of the faulty square.

04/19/23SONER DEDEOĞLU

41

Page 42: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

Origin Based Routing in Mesh (cont’d)• Case Study:

▫ Routing a message from node S at northwest end of the network to D.

▫ The message first moves along the IN links, getting closer to the origin.

▫ It enters the outbox at node A.

▫ Since there is a failure directly east of A, it continues on the IN links until it reaches the origin.

▫ Then it continues, skirting the edge of the faulty region until it reaches node B.

▫ At this point, it recognizes the existence of a safe.

▫ node immediately to the north and sends the message through this node to the destination.

04/19/23SONER DEDEOĞLU

42

Page 43: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

• I. Koren and C.M. Krishna, Fault Tolerance Systems, Morgan Kaufmann, 2007.

• D. M. Blough and N. Bagherzadeh, “Near-Optimal Message Routing and Broadcasting in Faulty Hypercubes,” International Journal of Parallel Programming, Vol. 19, pp. 405–423, October 1990.

• R. Libeskind-Hadas and E. Brandt, “Origin-Based Fault-Tolerant Routing in the Mesh,” IEEE Symposium on High Performance Computer Architecture, pp. 102–111, 1995.

04/19/23SONER DEDEOĞLU

43

Page 44: FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING SONER DEDEOĞLU 10/12/2015 1

04/19/23SONER DEDEOĞLU

44