cs/ece 757: advanced computer architecture ii (parallel ...mr56/ece757/lecture17.pdftopologies...

31
Page 1 CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Interconnection Network Design (Chapter 10) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks! (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt, & Singh CS/ECE 757 2 Interconnection Networks Goal: Communication between computers Warning: Terminology-rich environment Focus on Networks for Parallel Computing today’s System Area Networks exhibit many of the same properties Traffic Patterns Arbitrary (bisection bandwidth matters) Near-neighbor Permutations (e.g., FFT) Multicasts or Broadcasts? Latency or Bandwidth Critical?

Upload: others

Post on 24-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 1

CS/ECE 757: Advanced Computer Architecture II(Parallel Computer Architecture)

Interconnection Network Design (Chapter 10)

Copyright 2001 Mark D. HillUniversity of Wisconsin-Madison

Slides are derived from work bySarita Adve (Illinois), Babak Falsafi (CMU),

Alvy Lebeck (Duke), Steve Reinhardt (Michigan),and J. P. Singh (Princeton). Thanks!

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 2

Interconnection Networks

• Goal: Communication between computers• Warning: Terminology-rich environment• Focus on Networks for Parallel Computing

– today’s System Area Networks exhibit many of the same properties

• Traffic Patterns– Arbitrary (bisection bandwidth matters)– Near-neighbor– Permutations (e.g., FFT)– Multicasts or Broadcasts?

– Latency or Bandwidth Critical?

Page 2: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 2

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 3

Terms

Network characterized by• Topology

– physical structure of the graph

• Routing Algorithm– through which paths can message flow

• Switching Strategy– How data in message traverses its route – Circuit Switched vs Packet Switched

• Flow Control– When does a packet (or portions of it) move along its route

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 4

More Terms

• Given topology constructed by linking switches and network interfaces, must deliver packet from node A to node B

• Link: cable with connectors on each end– connect switches to other switches or network interfaces

• Switch: N inputs N outputs (degree N)• Phit: Minimum # of bits physically moved across link

in one cycle• Flit: Minimum # of bits move across link as a single

unit• Packet: unit that requires routing information, some

number of flits

Page 3: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 3

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 5

Outline

• Topology

• Switching, Routing, & Deadlock

• Switch Design

• Flow Control

• Case Studies

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 6

Topology

• Structure of the interconnect• Determines

– Switch Degree: number of links from a node– Diameter: number of links crossed between nodes on maximum

shortest path– Average distance: number of hops to random destination– Bisection: minimum number of links that separate the network into

two halves

• Don’t forget Network Interface (NI)– “ On-” and “ off-ramp” to the interconnect “ freeway”– NI latency can dominate interconnect latency

(reducing the importance of topology)

Page 4: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 4

(C) 2003 J. E. Smith CS/ECE 757 7

Topology

Node

SW Interface

HW Interface

linklink

bandwidth

Node

SW Interface

HW Interface

linklink

bandwidth

. . .

. . .

Interconnect

Bisection Bandwidth

Overhead

Latency

Application Application

(C) 2003 J. E. Smith CS/ECE 757 8

Buses

• Direct interconnect• Performance/Cost:

– Switch cost: N– Wire cost: const– Avg. latency: const– Bisection B/W: const– Not neighbor optimized– May be local optimized

• Capable of broadcast (good for MP coherence)• Bandwidth not scalable => major problem

– Hierarchical buses?» Bisection B/W remains constant» Becomes neighbor optimized

MemoryMemoryMemoryMemory

Proc

cache

Proc

cache

Proc

cache

Proc

cache

Page 5: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 5

(C) 2003 J. E. Smith CS/ECE 757 9

Rings

• Direct interconnect• Performance/Cost:

– Switch cost: N– Wire cost: N– Avg. latency: N / 2– Bisection B/W: const– neighbor optimized, if bi-directional– probably local optimized

• Not easily scalable– Hierarchical rings?

» Bisection B/W remains constant» Becomes neighbor optimized

M

P

RING

S

M

P

S

M

P

S

(C) 2003 J. E. Smith CS/ECE 757 10

Hypercubes

• n-dimensional unit cube• Direct interconnect• Performance/Cost:

– Switch cost: N log2 N– Wire cost: (N log2 N) /2– Avg. latency: (log2 N) /2– Bisection B/W: N /2– neighbor optimized– probably local optimized

• latency and bandwidth scale well, BUT– individual switch complexity

grows=> max size is built in

S

M P

S

M P

S

M P

S

M P

S

M P

S

M P

S

M P

S

M P

000

110001 101

111011

010

100

Page 6: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 6

(C) 2003 J. E. Smith CS/ECE 757 11

2D Torus

• Direct interconnect• Performance/Cost:

– Switch cost: N– Wire cost: 2 N– Avg. latency: N 1/2

– Bisection B/W: 2 N 1/2

– neighbor optimized– probably local optimized

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

(C) 2003 J. E. Smith CS/ECE 757 12

2D Torus, contd.

• Cost scales well• Latency and bandwidth do not scale as well as

hypercube, BUT– difference is relatively small for practical-sized systems

• Weave nodes to make inter-node latencies const.• Routing can be tricky (deadlocks)• 2D Mesh similar, but without wraparound

SM P

SM P

SM P

SM P

SM P

SM P

SM P

SM P

Page 7: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 7

(C) 2003 J. E. Smith CS/ECE 757 13

3D Torus

• Direct interconnect• Performance/Cost:

– Switch cost: N– Wire cost: 3 N– Avg. latency: 3(N1/3 / 2)– Bisection B/W: 2 N2/3

– neighbor optimized– probably local optimized

(C) 2003 J. E. Smith CS/ECE 757 14

3D Torus, contd.

• Cost scales well• Latency and bandwidth do not scale as well

as hypercube, BUT– difference is relatively small for practical-sized systems

• Seems to have become an interconnect of choice:– Cray, Intel, Tera, DASH, etc.

• Routing can be tricky (deadlocks)• 3D Mesh similar, but without wraparound

Page 8: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 8

(C) 2003 J. E. Smith CS/ECE 757 15

Crossbars

• Indirect interconnect• Performance/Cost:

– Switch cost: N 2

– Wire cost: 2 N– Avg. latency: const– Bisection B/W: N– Not neighbor optimized– Not local optimized

Processor

MemoryMemoryMemoryMemory

MemoryMemoryMemoryMemory

Processor

Processor

Processor

corssbar

(C) 2003 J. E. Smith CS/ECE 757 16

Crossbars, contd.

• Capable of broadcast• No network conflicts• Cost does not scale well• Often implemented with

muxes

mux

P

M

P

P

P

mux

M

mux

M

mux

M

Page 9: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 9

(C) 2003 J. E. Smith CS/ECE 757 17

Multistage Interconnection Networks (MINs)

• AKA: Banyan, Baseline, Omega networks, etc.

• Indirect interconnect• Crossbars interconnected

with shuffles• Can be viewed as

overlapped MUX trees• Destination address

specifies the path• The shuffle interconnect

routes addresses in a binary fashion

• This can most easily be seen with MINs in the form at right

0101

1011

(C) 2003 J. E. Smith CS/ECE 757 18

MINs, contd.

• f switch outputs => decode log2 f bits in switch• Performance/Cost:

– Switch cost: (N log f N) / f– Wire cost: N log f N– Avg. latency: log f N– Bisection B/W: N– Not neighbor optimized

• Problem (in combination with latency growth)– Not local optimized

• Capable of broadcast• Commonly used in large UMA systems

Page 10: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 10

(C) 2003 J. E. Smith CS/ECE 757 19

Multistage Nets, Equivalence

• By rearranging switches, multistage nets can be shown to be equivalent

A

B

C

D

A

B

C

D

A

B

C

D

E

F

G

H

A

B

C

D

E

F

G

H

E

F

G

H

E

F

G

H

I

J

K

L

M

N

O

P

I

J

M

N

K

L

O

P

(C) 2003 J. E. Smith CS/ECE 757 20

Fat Trees

• Tree-like, with constant bandwidth at all levels

• Closely related to MINs• Indirect interconnect• Performance/Cost:

– Switch cost: N log f N

– Wire cost: f N log f N

– Avg. latency: approx 2 log f N– Bisection B/W: f N– neighbor optimized– may be local optimized

• Capable of broadcastP

SM

P

SM

P

SM

P

SM

P

SM

P

SM

P

SM

P

SM

Page 11: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 11

(C) 2003 J. E. Smith CS/ECE 757 21

Fat Trees, Contd.

• The MIN-derived Fat Tree, is, in fact, a Fat Tree:• However, the switching "nodes" in effect do not have full

crossbar connectivity

P

SM

P

SM

P

SM

P

SM

P

SM

P

SM

P

SM

P

SM

A B C D E F G H

I J K L M N O P

Q R S T U V W X

A B C D E F G H

I J K L M N O P

Q R S T U V W X

S S S S S S S S

M M M M M M M M

P P P P P P P P

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 22

Important Topologies

N = 1024

Type Degree Diameter Ave Dist Bisection Diam Ave D

1D mesh 2 N-1 2N/3 1

2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 63 21

3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 ~30 ~10

nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n

(N = kn)

Ring 2 N / 2 N/4 2

2D torus 4 N1/2 N1/2 / 2 2N1/2 32 16

k-ary n-cube 2n n(N1/n) nN1/n/2 15 8 (3D) (N = kn) nk/2 nk/4 2kn-1

Hypercube n n = LogN n/2 N/2 10 5

Page 12: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 12

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 23

N = 1024

Type Degree Diameter Ave Dist Bisection Diam Ave D

2D Tree 3 2Log2 N ~2Log2 N 1 20 ~20

4D Tree 5 2Log4 N 2Log4 N - 2/3 1 10 9.33

kD k+1 Logk N

2D fat tree 4 Log2 N N

2D butterfly 4 Log2 N Log2 N N/2 20 20

Topologies (cont)

CM-5 Thinned Fat Tree

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 24

Butterfly

N/2 Butterfly

°°°

N/2 Butterfly

°°°

Benes Network

N Butterfly

ReversedN

Butterfly

°°°

• Routes all permutations w/o conflict

• Notice similarity to Fat Tree (Fold in half)

• Randomization is major breakthrough

• All paths equal length

• Unique path from anyinput to any output

• Conflicts cause tree saturation

Multistage: nodes at ends, switches in middle

Page 13: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 13

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 25

Outline

• Topology

• Switching, Routing, & Deadlock

• Switch Design

• Flow Control

• Case Studies

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 26

Queue on each end• Can send both ways (“ Bi-directional, Full

Duplex” )• Rules for communication? “ protocol”

– Synchronous send » Need Request & Response signaling

– Name for standard group of bits sent: Packet

ABCs of Networks

• Starting Point: Send bits between 2 computers

Page 14: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 14

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 27

A Simple Example

• What is the packet format?– Fixed? (for HW Interpretation)– Number bytes?

Request/Response

Address/Data

1 bit 32 bits

0: Please send data from Address1: Packet contains data corresponding to request

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 28

Questions About Simple Example

• What if more than 2 computers want to communicate?

– Need node identifier field (destination) in packet– Routing and topology

• What if packet is garbled in transit?– Add error detection field in packet (e.g., CRC)

• What if packet is lost?– More elaborate protocols to detect loss (e.g., NAK, time outs)

• What if multiple processes/machine?– Dispatch– Queue per process

• Questions such as these lead to more complex protocols and packet formats

Page 15: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 15

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 29

General Packet Format

• Header– routing and control information

• Payload– carries data (non HW specific information)– can be further divided (framing, protocol stacks…)

• Error Code– generally at tail of packet so it can be generated on the way out

Header Payload Error Code

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 30

Message vs. Packet

• A Message may be composed of several packets• Applications reason about messages• Network transfers packets• Small fixed size packets. Problems?

Fragmentation and reassembly (SW overhead)• Variable Size packets. Problems?

Congestion

Page 16: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 16

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 31

Packet Switched vs Circuit Switched

Circuit Switched• Establish Route then Send Data• Telephone systemPacket Switched• Route each packet individually• Delivery Guarantees

– Reliable– In order, what if not?

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 32

Routing

• Store-and-forward• Cut-through• Virtual cut-through• Wormhole

Page 17: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 17

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 33

Store and Forward

• Store-and-forward policy: each switch waits for the full packet to arrive in the switch before it is sent on to the next switch

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 34

Cut Through

• Cut-through routing: switch examines the header, decides where to send the message, and then starts forwarding it immediately

Page 18: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 18

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 35

Virtual Cut-Through

• What to do if output port is blocked?• Lets the tail continue when the head is blocked,

absorbing the whole message into a single switch. – Requires a buffer large enough to hold the largest packet.

• Degenerates to store-and-forward with high contention

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 36

Wormhole

• When the head of the message is blocked the message stays strung out over the network

– Potentially blocks other messages (needs only buffer the piece of the packet that is sent between switches).

– CM-5 used it, with each switch buffer being 4 bits per port.– Myrinet uses it

• Interaction with Pack Size• Can cause tree saturation…

Page 19: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 19

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 37

Store and Forward vs. Cut-Through

• Advantage– Latency reduces from function of:

number of intermediate switches times the size of the packet

to

time for 1st part of the packet to negotiate the switches + the packet size ÷ interconnect BW

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 38

Routing Algorithm

• How do I know where a packet should go?• Arithmetic• Source-Based• Table Lookup• Adaptive—route based on network state

(e.g., contention)

Page 20: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 20

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 39

Arithmetic Routing

• For regular topology, simple arithmetic to determine route

• 2D Mesh (Also called NEWS network)– packet header contains signed offset to destination– switch ++ or -- one field of header (x or y dimension)– when x == 0 and y == 0, then at correct processor

• Requires ALU in switch• Must recompute CRC

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 40

Source Based and Table Lookup Routing

Source Based• Source specifies output port for each switch in route• Very Simple Switches

– no control state– strip output port off header

• Myrinet uses thisTable Lookup• Very Small Header, index into table for output port• Big tables, must be kept up to date...

Page 21: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 21

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 41

001

000

101

100

010 110

111011

Deterministic v.s. Adaptive Routing

• Deterministic—follows a pre-specified route

– mesh: dimension-order routing» (x1, y1) -> (x2, y2)» first Dx = x2 - x1,» then Dy = y2 - y1,

– hypercube: edge-cube routing» X = x0x1x2 . . .xn -> Y = y0y1y2 . . .yn» R = X xor Y» Traverse dimensions of differing

address in order– tree: common ancestor

• Adaptive—route determined by contention for output port

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 42

Adaptive Routing

• Essential for fault tolerance– at least multipath

• Can improve utilization of the network• Simple deterministic algorithms easily run into bad

permutations

• Fully/partially adaptive, minimal/non-minimal• Can introduce complexity or anomalies• Little adaptation goes a long way!

Page 22: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 22

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 43

Deadlock

Guri Sohi

• Break a Necessary Condition– Use more than one resource– Not willing to release resource in use– Cycle in order of recourse use

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 44

Deadlock Free Routing

• Virtual Channels– Not virtual cut-through– Add buffers so flits of wormhole packets can be interleaved

• Up*-Down*– Number switches: higher = farther away from processors– route up, make one turn, route down

• Turn Model Routing– Restrict order of turns

» West First» North Last» Negative First

– Can increase number of hops

Page 23: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 23

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 45

Minimal turn restrictions in 2D

West-f irst

north-last negative first

-x +x

+y

-y

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 46

Outline

• Topology

• Switching, Routing, & Deadlock

• Switch Design

• Flow Control

• Case Studies

Page 24: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 24

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 47

A Generic Switch

• At minimum, must route inputs to outputs

InputBuffer

OutputBuffer

Cross-bar

ControlRouting, Scheduling

Receiver Transmitter

OutputPorts

InputPorts

VLSI makes it easier to create larger fully connected switches

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 48

Switch Design Issues

• ports, pin limited• data path (cross bar designs)• non-blocking crossbar• routing logic per input

– ALU– table– Finite State Machine for cut-through

Page 25: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 25

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 49

Switch Buffering

• need to absorb some transient peaks• shared buffer pool

– need high bandwidth– one output could hog all buffer space

• input buffers

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 50

Input Buffering

• buffer per port• routing logic• head of line (HOL) blocking

– subsequent packet may be routed to unused output port

Page 26: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 26

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 51

Output Buffering

• Buffers “ logically” associated with output– split on either side of cross bar

• Arbitration for physical link (output scheduling)– static priority– random– round-robin– oldest-first

• Effects of adaptive routing?– Select output based on availability– requires feedback from output port

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 52

Stacked Dimension Switch

• Uses only 2x2 switch to build higher dimension switch

2x2

2x2

2x2

zin zout

yin

xin

yout

xout

Hostin

Hostout

Page 27: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 27

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 53

Outline

• Topology

• Switching, Routing, & Deadlock

• Switch Design

• Flow Control

• Case Studies

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 54

Congestion Control

• Packet switched networks do not reserve bandwidth; this leads to contention

• Solution: prevent packets from entering until contention is reduced (e.g., metering lights)

• Options:– End-to-end Flow Control– Link-level Flow Control

Page 28: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 28

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 55

Flow Control

• Packet discarding: If a packet arrives at a switch and there is no room in the buffer, the packet is discarded

– no communication between switches, requires higher level protocol

• Flow control: between pairs of receivers and senders; use feedback to tell the sender when it is allowed to send the next packet

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 56

Link-Level Flow Control

Data

Ready

• Transfer single flit when receiver is ready• Could have long links with many flits in flight

Page 29: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 29

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 57

Credit-based (Window) Flow Control

• Receiver gives N credits to sender– sender decrements count– stops sending if zero– receiver sends back credit as it drains its buffer– bundle credits to reduce overhead

• Must account for link latency

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 58

Water Level

• high water, low water• stop & go back to source switch (Myrinet)• can send redundant stop/go

Stop

Go

Incoming phits

Outgoing phits

Page 30: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 30

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 59

Outline

• Topology

• Switching, Routing, & Deadlock

• Switch Design

• Flow Control

• Case Studies

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 60

Case Study Cray T3D

• 1024 switch nodes each connected to 2 processors• 3D Torus, bidirectional, 300 MB/s• Link: 16 bits, 8 control bits• Variable size packet (multiple of 16 bits)• Logical request & response networks

– 2 virtual channels each for deadlock

• Stacked dimension routing• Wormhole for large packets, virtual cut-through for

small packets

Page 31: CS/ECE 757: Advanced Computer Architecture II (Parallel ...mr56/ece757/Lecture17.pdfTopologies (cont) CM-5 Thinned Fat Tree (C) 2001 Mark D. Hill from Adve,Falsafi, Lebeck, Reinhardt,

Page 31

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 61

IBM SP-2 (Vulcan)

• Switch has eight bidirectional 40 MB/s links• Link: 8 data bits, 1 tag, 1 reverse flow-control• Flit is 16 bits, phit is 8• input FIFO + output FIFO + central buffer 128 8-byte

segments

(C) 2001 Mark D. Hill from Adve,Falsafi , Lebeck, Reinhardt, & Singh CS/ECE 757 62

Real Machines

• Wide links, smaller routing delay• Tremendous variation