©2003 dror feitelson parallel computing systems part ii: networks and routing dror feitelson hebrew...
Post on 18-Dec-2015
218 views
TRANSCRIPT
©2003 Dror Feitelson
Parallel Computing SystemsPart II: Networks and Routing
Dror Feitelson
Hebrew University
©2003 Dror Feitelson
The Issues
• Topology– The road map: what connects to what
• Routing– Selecting a route to the desired destination
• Flow Control– Traffic lights: who gets to go
• Switching– The mechanics of moving bits
Dally, VLSI & Parallel Computation chap. 3, 1990
©2003 Dror Feitelson
Topologies and Routing
©2003 Dror Feitelson
The Model
• Network is separate from processors(In old systems nodes did the switching)
• Network composed of switches and links
• Each processor connected to some switch
PE
PE
PE
©2003 Dror Feitelson
Considerations• Diameter
– Expected to correlate with maximal latency
• Switch degree – Harder to implement switches with high degree
• Capacity – Potential of serving multiple communications at once
• Number of switches– Obviously effects network cost
• Existence of a simple routing function– Implemented in hardware
©2003 Dror Feitelson
Hypercubes
• Multi-dimensional cubes
• n dimensions, N = 2n nodes
• Nodes identified by n-bit numbers
• Each node connected to other nodes that differ in a single bit
• Degree: log N
• Diameter: log N
• Cost: N switches (one per node)
• Used in Intel iPSC, cCUBE
©2003 Dror Feitelson
Recursive Construction
©2003 Dror Feitelson
Node Numbering
• Each node has an n-bit number
• Each bit corresponds to a dimension of the hypercube
000
1
11
0
0
0
111
001
110
010
100
©2003 Dror Feitelson
Routing
Given source and destination, correct one bit at a time
Example: go from 001 to 110
In what order?
000001
110
010
©2003 Dror Feitelson
Mesh
• N nodes arranged in a rectangle or square
• Each node connected to 4 neighbors
• Diameter:
• Degree: 4
• Cost: N switches (one per node)
• Used in Intel Paragon
N2
©2003 Dror Feitelson
Routing
• Each node identified by x,y coordinates
• X-Y routing: first along one dimension, then along the other
• Deadlock prevention– Always route along X dimension first– Turn model: disallow only one turn– Odd-even turns: disallow different turns in odd
or even rows/columns
©2003 Dror Feitelson
Adaptive Routing
• Decide route on-line according to conditions– Avoid congested links– Circumvent failed links
Dally, IEEE Trans. Par. Dist. Syst., 1993
©2003 Dror Feitelson
Congestion Avoidance
Desired pattern
©2003 Dror Feitelson
Congestion Avoidance
Dimension order routing:first along X,then along Y
Congestion attop right(assumingpipelining)
©2003 Dror Feitelson
Congestion Avoidance
Adaptive routing: disjoint paths can be used
Throughput increased 7-fold
©2003 Dror Feitelson
Fault Tolerance
Example dimension order routing
©2003 Dror Feitelson
Fault Tolerance
Example dimension order routing
Fault causes many nodes to be inaccessible
©2003 Dror Feitelson
Adaptive Routing
• Decide route on-line according to conditions– Avoid congested links– Circumvent failed links
• In mesh, source and destination nodes define a rectangle of all minimal routes
• Also possible to use non-minimal routes
©2003 Dror Feitelson
Torus
• Mesh with wrap-around
• Reduces diameter to
• In 2D this is topologically a donut
• But…– Harder to prevent deadlocks– Longer cables than simple mesh– Harder to partition
• Used in Cray T3D/T3E
N
©2003 Dror Feitelson
Hypercubes vs. Meshes
• Hypercubes have a smaller diameter
• But this is not a real advantage– We live in 3D, so some cables have to be long– They also overlap other cables– Given a certain space, each cable must
therefore be thinner (less wires in parallel)– Result: less hops, but each takes more time
• Meshes also have a smaller degreeDally, IEEE Trans. Comput, 1990
©2003 Dror Feitelson
Network Bisection
• Definition: bisection width is the minimal number of wires that, when cut, will divide the network into two halvesAssume W wires in each link– Binary tree: B = 1· W– Mesh: – Hypercube: B = N/2 · W
• Assumption: wire bisection is fixed
Note: count wires, not links!
WN
©2003 Dror Feitelson
Network Bisection• Assume machine
packs N nodes in 3D• Expected traffic
proportional to N• Bisection
proportional to N2/3
(assuming arranged in 3D cube)
• May be proportional to N1/2
(if arranged in plane)
©2003 Dror Feitelson
Communication Parameters
• N: number of nodes in system
• W: wires in each link
• B: bisection width
• D: average distance between nodes
• L: message length
• p: time to forward header
Time for message arrival: pD + L/W cycles
(assumes wormhole routing = pipelining)
©2003 Dror Feitelson
Hypercube vs. MeshAssumptions: Binary 8-cube 1616 mesh
N = 256 nodes W=1 W=8
B = 128 wires D=4 D=11.6
0
200
400
600
800
1000
1200
0 16 64 256 1024
message size [bits]
late
nc
y
[ cyc
les
]
cube
meshpD
dominates
L/W dominates
©2003 Dror Feitelson
A Generalization
• K-ary N-cubes– N dimensions as in a hypercube– K nodes along each dimension as in a mesh
• Hypercubes are 2-ary N-cubes
• Meshes are K-ary 2-cubes
• Good tradeoffs provided by 2, 3, and maybe 4 dimensions
©2003 Dror Feitelson
Capacity Problem
• Each message must occupy several links on the way to its destination, depending on the distance
• If all nodes attempt to transmit at the same time, there will not be enough space
• Possible solution: diluted networks
only some nodes have PEs attached
• Alternative solution: multistage networks and crossbars
©2003 Dror Feitelson
Multistage Networks
• Organized as several (logarithmic) stages of switches
• Various interconnection patterns among the stages
• Diameter: log N• Switch degree: constant• Cost: O(N log N)
– Constant depends on switch degree
• Used in IBM SP2
©2003 Dror Feitelson
Example: Cube Network• Cube-like pattern
among stages• Routing according
to address bits 0 = top exit 1 = bottom exit
• Example: go from 0010 to 1100
• Or from anywhere else!
0000
0010
1111
0000
1111
1100
©2003 Dror Feitelson
Problems
• Only one path from each source to each destination– No fault tolerance– Susceptible to congestion
• Hot-spot can block the whole network
(tree saturation)
• Popular patterns also lead to congestion
©2003 Dror Feitelson
Solution
• Use extra stages
• Obviously increases cost
©2003 Dror Feitelson
Fat-Tree Implementation
• Tree: routing through common ancestor
• Problem: root becomes a bottleneck
• Fat-tree: make top of tree fatter
(i.e. with more bandwidth)
• Most commonly implemented using a multistage network with multiple “roots”
• Adaptiveness by selection of root
• Used in Connection Machine CM-5
Leiserson, IEEE Trans. Comput, 1989
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
©2003 Dror Feitelson
Crossbars
• A switch for each possible connection
• Diameter: 1
• Degree: 4
• Cost: N2 switches
• No congestion if destination is free
• Used in Earth Simulator
©2003 Dror Feitelson
Irregular Networks
• It is now possible to buy switches and cables to construct your own network– Myrinet– Quadrics– Switched giga/fast Ethernet
• Multistage network topologies are often recommended
• But you can connect the switches in arbitrary ways too
©2003 Dror Feitelson
Myrinet Components
processor
memory
PC
I bu
s
NICLANai
processor
memory
switch
switch
Connections to other nodes and switches
Myrinet control program
Routing table
Boden et al, IEEE Micro, 1995
Communication buffers
©2003 Dror Feitelson
Source Routing
• Routes decided by the source node
• Routing instructions included in packet header
• Typically several bits for each switch along the way, specifying which exit port to use
• Allows easier control by nodes
• Also simpler switches
©2003 Dror Feitelson
Exercise
• Adaptive routing requires switches that make routing decisions themselves
• In source routing, routing is decided at the source, and switches just operate according to instructions
• Is it possible to create some form of adaptive source routing?– What are the goals?– Can they be achieved using source routing?– If not, can they be approximated?
©2003 Dror Feitelson
Routing on Irregular Networks
• Find minimal routes(e.g. using Dijkstra’s algorithm)
• Problem: may lead to deadlock(assuming mutual blocking at switches due to buffer constraints)
©2003 Dror Feitelson
Up-Down Routing
• Give a direction to each edge in the network
• Routing first goes with the direction, and then against it
• This prevents cycles, and hence deadlocks
• Need to ensure connectivity– Simple option: start with a spanning tree– But might lead to congestion at the root– Also many non-minimal routes
Schroeder, IEEE J. Select Areas Comm., 1991
©2003 Dror Feitelson
Flow Control
©2003 Dror Feitelson
Switching Mechanisms• Circuit switching
– Create a dedicated circuit and then transmit data on it– Like traditional telephone network
• Message switching– Store and forward the whole message at each switch
• Packet switching– Partition message into fixed-size packets and send
them independently– Guarantee buffer availability at receiver– Pipeline the packets to reduce latency– But need to reconstruct the message
©2003 Dror Feitelson
Three Levels
• Message– Unit of communication at the application level
• Packet– Unit of transmission– Each packet has a header and is routed independently– Large messages partitioned into multiple packets
• Flit– Unit of flow control– Each packet divided into flits that follow each other
©2003 Dror Feitelson
Moving Packets
• Store and forward of complete packets– Latency depends on number of hops
• Virtual cut-through: start forwarding as soon as header is decoded– Allow for overlap of transmission on consecutive
links– Still need buffers big enough for full packets in
case one gets stuck
• Wormhole routing: block in place– Reduce required buffer size at cost of additional
blockingDally, VLSI & Parallel Computation chap. 3, 1990
©2003 Dror Feitelson
Store and Forward
packet latency
source destination
L/W D
©2003 Dror Feitelson
Virtual Cut Through
flitlatency
source destination
L/W + D
©2003 Dror Feitelson
Wormhole Routing
• Wormhole routing operates on flits – the units of flow control– Typically what can be transmitted in a single cycle– Equal to the number of wires in each link
• Packet header is typically one or two flits• Flits don’t contain routing information
– They must follow previous flits without interleaving with flits of another packet
• Each switch only has buffer space for a small number of flits
©2003 Dror Feitelson
Collisions
• Store whole packet in a buffer(called virtual cut through)
• Block in-place across multiple switches(called wormhole routing)
• Drop the dataResources are lost!!!
• Misroute: keep moving, but in the wrong direction
What happens if a stream of flits arrives at a switch, and the desired output port is busy?
©2003 Dror Feitelson
Throughput and Load
offered load (input)
thro
ughp
ut (
outp
ut)
saturation
blockingor buffering
dropping
©2003 Dror Feitelson
Throttling
• Sources are usually permitted to inject traffic as fast as they wish
• Throttling slows them down by placing a limit on the rate of injecting new traffic
• This puts a cap on the maximal throughput possible
• But it also prevents excessive congestion, and may lead to better overall performance
©2003 Dror Feitelson
Deadlock
©2003 Dror Feitelson
Deadlocks
• In wormhole routing, packets hold switch resources while they move– Flit buffers– Output ports
• Another packet may arrive that needs the same resources
• Cyclic dependencies may lead to deadlock
©2003 Dror Feitelson
Deadlocks
©2003 Dror Feitelson
Dependencies
• Deadlocks are the most dramatic problems
• But can also just lead to inefficiency– A blocked packet still holds its channels
(because flits need to stay contiguous to maintain routing)
– Another packet may be able to utilize these channels
©2003 Dror Feitelson
Inefficiency
©2003 Dror Feitelson
Virtual Channels
• Divide the buffers in each switch into several virtual channels
• Each virtual channel also has its own state and routing information
• Virtual channels share the use of physical resources
Dally, IEEE Trans. Par. Dist. Syst., 1992
©2003 Dror Feitelson
Efficiency!
Red packet occupies some (not all!!!)
buffer space
Green packet actually uses link
©2003 Dror Feitelson
Deadlocks Again
Virtual channels can also be used to solve the deadlock problem
– In a network with diameter D, create D virtual channels on each link
– Newly injected messages can only use virtual channel no. 1
– Packets coming on virtual channel i can only move to virtual channel i+1
– Virtual channels used are strictly ordered, so no cycles
– This version limits flexibility, hence inefficient
©2003 Dror Feitelson
Dally’s Methodology
• Create a routing function using minimal paths
• Remove arcs to make this acyclic, and hence deadlock free
• If this results in disconnecting the routing function, duplicate links using virtual channels
Dally, IEEE Trans. Comput., 1987
©2003 Dror Feitelson
Duato’s Methodology
• Start with a deadlock-free routing function (e.g. Dally’s)
• Duplicate all channels by virtualization
• The extended routing function allows use of the new or the original channels; but once an original channel is used, you cannot revert to new channels
• Works for any topologyDuato, IEEE Trans. Par. Dist. Syst., 1993
©2003 Dror Feitelson
Performance
©2003 Dror Feitelson
Methodology• Network simulation
– Network topology– Routing algorithm– Packet-level simulation of flow control
(including virtual circuits)
• Workloads– Synthetic patterns
• Uniformly distributed random destinations• Hot spots
– Real applicationsRequires co-simulation of application and network
Experiment with different options
©2003 Dror Feitelson
Simulation Results
• Metrics:– Throughput: packets per second delivered, or
fraction of capacity supported– Latency: delay in delivering packets
• Results:– Adaptive routing and virtual circuits improve
both metrics under loaded conditions
©2003 Dror Feitelson
However…
This does not necessarily translate into improved
application performance
• Realistic applications typically do not create sufficiently high communication loads– There is not a lot of congestion– So overcoming it is not an issue
• Supporting virtual channels and adaptive routing comes at a cost– Switches are more complex and therefore slower
Vaidya et al., IEEE Trans. Parallel Distrib. Syst., 2001
©2003 Dror Feitelson
The Bottom Line
• Virtual circuits are useful for deadlock prevention
• Virtual circuits and adaptive routing may hurt performance more than they improve it
• It may be more important to make switches fast
©2003 Dror Feitelson
Switching
©2003 Dror Feitelson
Switch Elements
• Input ports
• Output ports
• Buffers
• A crossbar connecting the inputs to the outputs
Xbar
©2003 Dror Feitelson
Input Buffering
• Buffers are associated with input ports
• If desired output port is busy, no more data enters
• Suffers from head-of-line blocking
Xbar
©2003 Dror Feitelson
Output Buffering
• Buffers are associated with output ports
• Packets block only if their desired output is busy
Xbar
©2003 Dror Feitelson
Central Queue
• Queues are associated with output ports
• Buffer space is shared– More for busier
inputs– More for busier
outputs
Xbar
Stunkel et al., IBM Syst. J., 1995