©2003 dror feitelson parallel computing systems part ii: networks and routing dror feitelson hebrew...

71
©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

Post on 18-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Parallel Computing SystemsPart II: Networks and Routing

Dror Feitelson

Hebrew University

Page 2: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

The Issues

• Topology– The road map: what connects to what

• Routing– Selecting a route to the desired destination

• Flow Control– Traffic lights: who gets to go

• Switching– The mechanics of moving bits

Dally, VLSI & Parallel Computation chap. 3, 1990

Page 3: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Topologies and Routing

Page 4: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

The Model

• Network is separate from processors(In old systems nodes did the switching)

• Network composed of switches and links

• Each processor connected to some switch

PE

PE

PE

Page 5: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Considerations• Diameter

– Expected to correlate with maximal latency

• Switch degree – Harder to implement switches with high degree

• Capacity – Potential of serving multiple communications at once

• Number of switches– Obviously effects network cost

• Existence of a simple routing function– Implemented in hardware

Page 6: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Hypercubes

• Multi-dimensional cubes

• n dimensions, N = 2n nodes

• Nodes identified by n-bit numbers

• Each node connected to other nodes that differ in a single bit

• Degree: log N

• Diameter: log N

• Cost: N switches (one per node)

• Used in Intel iPSC, cCUBE

Page 7: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Recursive Construction

Page 8: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Node Numbering

• Each node has an n-bit number

• Each bit corresponds to a dimension of the hypercube

000

1

11

0

0

0

111

001

110

010

100

Page 9: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Routing

Given source and destination, correct one bit at a time

Example: go from 001 to 110

In what order?

000001

110

010

Page 10: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Mesh

• N nodes arranged in a rectangle or square

• Each node connected to 4 neighbors

• Diameter:

• Degree: 4

• Cost: N switches (one per node)

• Used in Intel Paragon

N2

Page 11: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Routing

• Each node identified by x,y coordinates

• X-Y routing: first along one dimension, then along the other

• Deadlock prevention– Always route along X dimension first– Turn model: disallow only one turn– Odd-even turns: disallow different turns in odd

or even rows/columns

Page 12: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Adaptive Routing

• Decide route on-line according to conditions– Avoid congested links– Circumvent failed links

Dally, IEEE Trans. Par. Dist. Syst., 1993

Page 13: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Congestion Avoidance

Desired pattern

Page 14: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Congestion Avoidance

Dimension order routing:first along X,then along Y

Congestion attop right(assumingpipelining)

Page 15: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Congestion Avoidance

Adaptive routing: disjoint paths can be used

Throughput increased 7-fold

Page 16: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Fault Tolerance

Example dimension order routing

Page 17: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Fault Tolerance

Example dimension order routing

Fault causes many nodes to be inaccessible

Page 18: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Adaptive Routing

• Decide route on-line according to conditions– Avoid congested links– Circumvent failed links

• In mesh, source and destination nodes define a rectangle of all minimal routes

• Also possible to use non-minimal routes

Page 19: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Torus

• Mesh with wrap-around

• Reduces diameter to

• In 2D this is topologically a donut

• But…– Harder to prevent deadlocks– Longer cables than simple mesh– Harder to partition

• Used in Cray T3D/T3E

N

Page 20: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Hypercubes vs. Meshes

• Hypercubes have a smaller diameter

• But this is not a real advantage– We live in 3D, so some cables have to be long– They also overlap other cables– Given a certain space, each cable must

therefore be thinner (less wires in parallel)– Result: less hops, but each takes more time

• Meshes also have a smaller degreeDally, IEEE Trans. Comput, 1990

Page 21: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Network Bisection

• Definition: bisection width is the minimal number of wires that, when cut, will divide the network into two halvesAssume W wires in each link– Binary tree: B = 1· W– Mesh: – Hypercube: B = N/2 · W

• Assumption: wire bisection is fixed

Note: count wires, not links!

WN

Page 22: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Network Bisection• Assume machine

packs N nodes in 3D• Expected traffic

proportional to N• Bisection

proportional to N2/3

(assuming arranged in 3D cube)

• May be proportional to N1/2

(if arranged in plane)

Page 23: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Communication Parameters

• N: number of nodes in system

• W: wires in each link

• B: bisection width

• D: average distance between nodes

• L: message length

• p: time to forward header

Time for message arrival: pD + L/W cycles

(assumes wormhole routing = pipelining)

Page 24: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Hypercube vs. MeshAssumptions: Binary 8-cube 1616 mesh

N = 256 nodes W=1 W=8

B = 128 wires D=4 D=11.6

0

200

400

600

800

1000

1200

0 16 64 256 1024

message size [bits]

late

nc

y

[ cyc

les

]

cube

meshpD

dominates

L/W dominates

Page 25: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

A Generalization

• K-ary N-cubes– N dimensions as in a hypercube– K nodes along each dimension as in a mesh

• Hypercubes are 2-ary N-cubes

• Meshes are K-ary 2-cubes

• Good tradeoffs provided by 2, 3, and maybe 4 dimensions

Page 26: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Capacity Problem

• Each message must occupy several links on the way to its destination, depending on the distance

• If all nodes attempt to transmit at the same time, there will not be enough space

• Possible solution: diluted networks

only some nodes have PEs attached

• Alternative solution: multistage networks and crossbars

Page 27: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Multistage Networks

• Organized as several (logarithmic) stages of switches

• Various interconnection patterns among the stages

• Diameter: log N• Switch degree: constant• Cost: O(N log N)

– Constant depends on switch degree

• Used in IBM SP2

Page 28: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Example: Cube Network• Cube-like pattern

among stages• Routing according

to address bits 0 = top exit 1 = bottom exit

• Example: go from 0010 to 1100

• Or from anywhere else!

0000

0010

1111

0000

1111

1100

Page 29: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Problems

• Only one path from each source to each destination– No fault tolerance– Susceptible to congestion

• Hot-spot can block the whole network

(tree saturation)

• Popular patterns also lead to congestion

Page 30: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Solution

• Use extra stages

• Obviously increases cost

Page 31: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Fat-Tree Implementation

• Tree: routing through common ancestor

• Problem: root becomes a bottleneck

• Fat-tree: make top of tree fatter

(i.e. with more bandwidth)

• Most commonly implemented using a multistage network with multiple “roots”

• Adaptiveness by selection of root

• Used in Connection Machine CM-5

Leiserson, IEEE Trans. Comput, 1989

Page 32: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Page 33: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Page 34: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Page 35: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Crossbars

• A switch for each possible connection

• Diameter: 1

• Degree: 4

• Cost: N2 switches

• No congestion if destination is free

• Used in Earth Simulator

Page 36: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Irregular Networks

• It is now possible to buy switches and cables to construct your own network– Myrinet– Quadrics– Switched giga/fast Ethernet

• Multistage network topologies are often recommended

• But you can connect the switches in arbitrary ways too

Page 37: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Myrinet Components

processor

memory

PC

I bu

s

NICLANai

processor

memory

switch

switch

Connections to other nodes and switches

Myrinet control program

Routing table

Boden et al, IEEE Micro, 1995

Communication buffers

Page 38: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Source Routing

• Routes decided by the source node

• Routing instructions included in packet header

• Typically several bits for each switch along the way, specifying which exit port to use

• Allows easier control by nodes

• Also simpler switches

Page 39: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Exercise

• Adaptive routing requires switches that make routing decisions themselves

• In source routing, routing is decided at the source, and switches just operate according to instructions

• Is it possible to create some form of adaptive source routing?– What are the goals?– Can they be achieved using source routing?– If not, can they be approximated?

Page 40: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Routing on Irregular Networks

• Find minimal routes(e.g. using Dijkstra’s algorithm)

• Problem: may lead to deadlock(assuming mutual blocking at switches due to buffer constraints)

Page 41: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Up-Down Routing

• Give a direction to each edge in the network

• Routing first goes with the direction, and then against it

• This prevents cycles, and hence deadlocks

• Need to ensure connectivity– Simple option: start with a spanning tree– But might lead to congestion at the root– Also many non-minimal routes

Schroeder, IEEE J. Select Areas Comm., 1991

Page 42: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Flow Control

Page 43: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Switching Mechanisms• Circuit switching

– Create a dedicated circuit and then transmit data on it– Like traditional telephone network

• Message switching– Store and forward the whole message at each switch

• Packet switching– Partition message into fixed-size packets and send

them independently– Guarantee buffer availability at receiver– Pipeline the packets to reduce latency– But need to reconstruct the message

Page 44: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Three Levels

• Message– Unit of communication at the application level

• Packet– Unit of transmission– Each packet has a header and is routed independently– Large messages partitioned into multiple packets

• Flit– Unit of flow control– Each packet divided into flits that follow each other

Page 45: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Moving Packets

• Store and forward of complete packets– Latency depends on number of hops

• Virtual cut-through: start forwarding as soon as header is decoded– Allow for overlap of transmission on consecutive

links– Still need buffers big enough for full packets in

case one gets stuck

• Wormhole routing: block in place– Reduce required buffer size at cost of additional

blockingDally, VLSI & Parallel Computation chap. 3, 1990

Page 46: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Store and Forward

packet latency

source destination

L/W D

Page 47: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Virtual Cut Through

flitlatency

source destination

L/W + D

Page 48: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Wormhole Routing

• Wormhole routing operates on flits – the units of flow control– Typically what can be transmitted in a single cycle– Equal to the number of wires in each link

• Packet header is typically one or two flits• Flits don’t contain routing information

– They must follow previous flits without interleaving with flits of another packet

• Each switch only has buffer space for a small number of flits

Page 49: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Collisions

• Store whole packet in a buffer(called virtual cut through)

• Block in-place across multiple switches(called wormhole routing)

• Drop the dataResources are lost!!!

• Misroute: keep moving, but in the wrong direction

What happens if a stream of flits arrives at a switch, and the desired output port is busy?

Page 50: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Throughput and Load

offered load (input)

thro

ughp

ut (

outp

ut)

saturation

blockingor buffering

dropping

Page 51: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Throttling

• Sources are usually permitted to inject traffic as fast as they wish

• Throttling slows them down by placing a limit on the rate of injecting new traffic

• This puts a cap on the maximal throughput possible

• But it also prevents excessive congestion, and may lead to better overall performance

Page 52: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Deadlock

Page 53: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Deadlocks

• In wormhole routing, packets hold switch resources while they move– Flit buffers– Output ports

• Another packet may arrive that needs the same resources

• Cyclic dependencies may lead to deadlock

Page 54: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Deadlocks

Page 55: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Dependencies

• Deadlocks are the most dramatic problems

• But can also just lead to inefficiency– A blocked packet still holds its channels

(because flits need to stay contiguous to maintain routing)

– Another packet may be able to utilize these channels

Page 56: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Inefficiency

Page 57: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Virtual Channels

• Divide the buffers in each switch into several virtual channels

• Each virtual channel also has its own state and routing information

• Virtual channels share the use of physical resources

Dally, IEEE Trans. Par. Dist. Syst., 1992

Page 58: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Efficiency!

Red packet occupies some (not all!!!)

buffer space

Green packet actually uses link

Page 59: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Deadlocks Again

Virtual channels can also be used to solve the deadlock problem

– In a network with diameter D, create D virtual channels on each link

– Newly injected messages can only use virtual channel no. 1

– Packets coming on virtual channel i can only move to virtual channel i+1

– Virtual channels used are strictly ordered, so no cycles

– This version limits flexibility, hence inefficient

Page 60: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Dally’s Methodology

• Create a routing function using minimal paths

• Remove arcs to make this acyclic, and hence deadlock free

• If this results in disconnecting the routing function, duplicate links using virtual channels

Dally, IEEE Trans. Comput., 1987

Page 61: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Duato’s Methodology

• Start with a deadlock-free routing function (e.g. Dally’s)

• Duplicate all channels by virtualization

• The extended routing function allows use of the new or the original channels; but once an original channel is used, you cannot revert to new channels

• Works for any topologyDuato, IEEE Trans. Par. Dist. Syst., 1993

Page 62: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Performance

Page 63: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Methodology• Network simulation

– Network topology– Routing algorithm– Packet-level simulation of flow control

(including virtual circuits)

• Workloads– Synthetic patterns

• Uniformly distributed random destinations• Hot spots

– Real applicationsRequires co-simulation of application and network

Experiment with different options

Page 64: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Simulation Results

• Metrics:– Throughput: packets per second delivered, or

fraction of capacity supported– Latency: delay in delivering packets

• Results:– Adaptive routing and virtual circuits improve

both metrics under loaded conditions

Page 65: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

However…

This does not necessarily translate into improved

application performance

• Realistic applications typically do not create sufficiently high communication loads– There is not a lot of congestion– So overcoming it is not an issue

• Supporting virtual channels and adaptive routing comes at a cost– Switches are more complex and therefore slower

Vaidya et al., IEEE Trans. Parallel Distrib. Syst., 2001

Page 66: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

The Bottom Line

• Virtual circuits are useful for deadlock prevention

• Virtual circuits and adaptive routing may hurt performance more than they improve it

• It may be more important to make switches fast

Page 67: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Switching

Page 68: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Switch Elements

• Input ports

• Output ports

• Buffers

• A crossbar connecting the inputs to the outputs

Xbar

Page 69: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Input Buffering

• Buffers are associated with input ports

• If desired output port is busy, no more data enters

• Suffers from head-of-line blocking

Xbar

Page 70: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Output Buffering

• Buffers are associated with output ports

• Packets block only if their desired output is busy

Xbar

Page 71: ©2003 Dror Feitelson Parallel Computing Systems Part II: Networks and Routing Dror Feitelson Hebrew University

©2003 Dror Feitelson

Central Queue

• Queues are associated with output ports

• Buffer space is shared– More for busier

inputs– More for busier

outputs

Xbar

Stunkel et al., IBM Syst. J., 1995