why the network matters - virginia techaccel.cs.vt.edu/files/lecture4.pdf · 2012-02-09 · design...

Copyright © 2009 by W. Feng. Based on material from Matthew Sottile.

Why the Network Matters Week 2, Lecture 2

Why the Network Matters

Copyright © 2009 by W. Feng. Based on material from M. Sottile.

So Far …

•  Overview of Multicore Systems … Why Memory Matters … Memory Architectures …

•  Emerging Chip Multiprocessors (CMP) –  Increasing number of cores on a chip –  Cache coherency shopping list –  Memory performance network performance

•  Today: The Network –  Moving data around to support “shopping list” model

•  How to connect processors to memory and its impact on performance and applications

–  (Much of the material today is derived from Culler and Singh.)



Data Mobility

•  It’s actually all about the data, data, data. •  No matter how fast the functional units of a system, the

performance bottleneck has always been (and will continue to be) moving data around.

•  Challenge –  How to efficiently feed the functional units? –  How to layout and track data and get it quickly from A to B?



Granularity

•  Increasing our focus in granularity –  Functional unit pipelines –  Single and multicore cache hierarchies –  Coherence to manage nondeterminism between tightly coupled

cores

•  And now … interconnection networks … –  Cannot practically sustain bus snooping protocols in hardware –  Use to interconnect “small” multiprocessors to form arbitrarily

large supercomputers.



Interconnection Networks

•  The network that connects processing elements together. •  Broad applicability

–  Infrastructure in shared and distributed memory systems that tie processors to memories and to each other.

•  Examples: (1) Distributed memory system with potentially large message sizes SGI Altix. (2) Massively parallel collections of small processors that communicate in small amounts but frequently GPGPU :-)

–  Cannot practically sustain bus snooping protocols in hardware –  Use to interconnect “small” multiprocessors to form arbitrarily

large supercomputers.



Interconnection Networks for Multicore?

•  On-chip –  How cores are linked together.

•  Off-chip –  How CMPs are connected to motherboard buses.

•  Recall the bi-directional circular “EIB” interconnect on the Cell. –  On-chip or off-chip interconnect?



Design Factors

•  Economic factors for the actual hardware •  Performance

–  Peak –  Sustained / Actual / Practical –  Other

•  Routing and switching characteristics



Design Dimensions

•  Topology –  The physical interconnection structure of the network

•  Routing algorithms –  The method for choosing which route that messages take

through the network graph from source to destination

•  Switching Strategy –  How the data in a message traverse the route

•  Flow Control –  Determination of when a message (or portions thereof) moves

along its route.



Terminology •  Channel: A link between two nodes on the network, including

buffers to hold data

•  Bandwidth: b = wf, where w is the channel width and f is the signaling rate with a cycle time of T

•  Degree: Connectivity of a node (# channels to/from a node)

•  Route: A path through the network graph •  Diameter: Length of the maximum shortest path between any

two nodes

•  Routing Distance: Number of links traversed enroute between two nodes

•  Average Distance: Average routing distance over all pairs of nodes



Bandwidth

•  Raw bandwidth is b = wf, where w = width and f = frequency

•  Effective bandwidth is impacted by overhead nE for encapsulating a packet of size n

•  If a switch delays routing decisions by d, the bandwidth degrades further.



Bisection Bandwidth •  Multiple nodes on an interconnect send messages at the

same time. How to measure?

•  Bisection Bandwidth: Sum of the bandwidths of the minimum set of channels that, if removed, partition the network into two equal unconnected sets of nodes. –  Value? If all nodes communicate in a uniform pattern, half the

messages will be expected to cross the bisection in each direction.



Routing •  A request between two processors must be routed in

some way, preferably in an optimal manner that minimizes hops.

•  Desirable properties –  Simple Low complexity, low overhead, ease of correctness

(deadlock free) –  Minimal latency in the presence of large message sizes.



Routing Strategies •  Store-and-forward: A method typically used in LAN or WAN

networks. Data is sent in packets that are received in their entirety at switches before being forwarded



Routing Strategies •  Cut-through routing: A method that reduces latency for

packets to traverse a path. Think of it as network pipelining.



Store-and-Forward vs. Cut-Through Routing •  Store-and-forward makes the routing decision only when

all phits are received of a packet. •  Cut-through routing makes the routing decision

immediately upon receiving the physical unit (phit) of the beginning of the packet, and all subsequent phits “cut-through” this route.

•  Train analogy –  What train scenario looks like store-and-forward? –  What train scenario looks like cut-through?



Train Analogy •  Train at a station (as a connection to the next station)

–  Store-and-forward routing: The entire train must stop before moving on.

•  Train encountering a railroad switch –  Cut-through routing: The first car makes the “decision” as to

which direction to take, and all the others simply follow along.



Routing Strategies: Analysis •  What’s the big deal? •  Latency

–  Let h be the routing distance, b the bandwidth, n the size of the message, and d the delay at each switch.

–  How to make a store-and-forward look more like a cut-through, thus reaping some benefits of pipelining?



Anatomy of a Switch



The Crossbar •  Provides the internal switching structure for the switch.

•  Non-blocking crossbar + Guarantees a path between each distinct input and output

simultaneously in any permutation – Costs go up quadratically. Cost of full NxN crossbar, N = # inputs

= # outputs?

•  Anatomy of a fully-connected NxN crossbar? –  Collection of multiplexers that forms a crossbar …



The Crossbar •  Provides the internal switching structure for the switch.

•  Blocking crossbar –  Pros and cons complement the above. –  “Degenerate” crossbar is a bus.

•  Cost of a bus-based NxN “crossbar”? –  Multistage interconnection network (MIN)?

•  What does it look like? •  Cost of a MIN NxN“crossbar”?

•  More on this … coming up next in Topology …



Topology •  Oftentimes infeasible to connect every processing

element to each other. –  Example

•  Macroscale, e.g., cluster supercomputers –  PE count: O(1,000) to O(10,000) –  Functionally possible but very, very expensive.

»  As much as half the price of a supercomputer •  Microscale, e.g., emerging chip multiprocessors like Cell, GPGPU

–  Larger interconnect larger real estate required for “non-compute” entities

•  Solution: Be smart about how to connect PEs together. •  This connection pattern is the topology.



Simple Topology •  One-Dimensional Topologies

–  Chain •  Order all N processors in a line number 1 ...

P and connect processor P with processors P-1 and P+1

•  Sending a message from P1 to P4 must traverse 3 links.

•  Best case? Average case? Worst case? –  Torus (or Ring)

•  Instead of letting ends dangle, connect first to last to form a ring.

•  Best case? Average case? Worst case?



Simple Topology •  One-Dimensional Topologies

–  Chain •  Order all N processors in a line number 1 ...

P and connect processor P with processors P-1 and P+1

•  Sending a message from P1 to P4 must traverse 3 links.

•  Best case? Average case? Worst case? –  Torus (or Ring)

•  Instead of letting ends dangle, connect first to last to form a ring.

•  Best case? Average case? Worst case?

Cell?



The Effect of Adding Dimensions •  Increase to two dimensions, i.e., 1-D chain 2-D grid

–  Each side (or dimension) will have how many processors? –  What about an k-dimensional “grid”?

•  For 2-D, connect each processor to its neighbors. –  Up to 4 connections per processor. –  Boundaries can be wired to form a 2-D torus



2-D Mesh and Torus



Higher-Dimensional Meshes and Tori •  Keep playing this “trick” of embedding processors into

grids of increasing dimensionality.

•  Key Observation –  Each time the dimension is increased, the # of point-to-point

connections for each processor increases.

•  Generalization –  The # of point-to-point connections per node within a k-

dimensional grid is?



Hypercubes •  A d-dimensional hypercube has 2d corners, each of which

is an endpoint for d edges.

•  Such interconnection networks were the rage of the 1980s and 1990s

•  Pros and Cons?



Trees •  Another topology for attacking the hop count problem … •  Hop distance is logarithmic. Yay! •  Bisection bandwidth is O(1) due to single critical node at

root. Boo! (See figure to right.)



Butterflies “Extend the ‘tree’ with butterflies …”

•  Takes same logarithmic-depth approach but with multiple roots.

•  Can be built out of basic 2x2 switches. •  For N = 2d nodes, we have log2N levels of switches.



Butterflies “Extend the ‘tree’ with butterflies …”

•  Pro –  Natural correspondence to algorithmic structures, e.g., Fast

Fourier Transform (FFT) and sorting networks.

•  Con –  Cost of short diameter (logarithmic) and bisection (N/2) is $$$$.

•  Each node needs log2N switches!



Butterflies Fat Trees

•  Butterflies are related to another topology encountered in practice – fat trees – particularly in large cluster supercomputers.



Topology Properties

* : d = dimension ** : Bisection can be 1 for some switches, N for crossbar



Topologies and Routing

•  Topologies with regular structure have simple routing algorithms.

•  Example: Hypercube (2-D and 3-D) –  Simple labeling of nodes with the binary encoding of the number

0 … 2N – 1 yields a convenient routing pattern



Connectivity and Routing: Hypercube

•  Connectivity: A matter of edges between nodes that differ by exactly one bit.

•  Routing: A to B must traverse the dimensions that have bits on in XOR(A, B).

•  Shortest Path Length: Hamming Distiance



Routing Algorithms

•  Key Insight –  Build algorithms that take advantage of intrinsic properties of

topology.

•  Other Considerations –  Minimize hop counts –  Minimize data transmissions

•  What happens when link to root (in a tree) goes down due to heat?

•  Consider a torus-based network where each processor holds a set of numbers.

•  Goal: Compute the sum of all numbers and store the result on each processor.



Global Sum on a Torus

1.  Each processor computes sum of local data. 2.  Each processor sends its sum to their left neighbor. Sum

of neighbor is added to local sum. This new partial sum is passed to the left.

3.  After sqrt(P) steps, the partial sum along one dimension (i.e., row) returns to each processor.

4.  Repeat 1-3 but along the other dimension (i.e., column).

•  Total time for data set of size N split over P processors? •  Is there a faster way?



Considerations

•  Faster than sequential? •  Local sums obviously faster.

–  Concurrently compute the partial sums of N/P elements faster than any one processor can compute the sum of all N elements.

•  Problem? –  Interconnect overhead to execute the 2 * sqrt(P) transmissions

may be quite high relative to the computing capability of each processor.

•  Why is the above a problem?



Performance: Machine Balance

•  Last example refers to the need to “balance a machine or algorithm” …

•  Quantity that we are tuning? “Surface” (communication) to “volume” (computation) ratio.

•  Performance factors to consider … performance profiling –  Time to compute a local sum over a local data set. –  Time to send a single small message over the interconnect.

•  Performance profiling will come into play when using the CPU vs. CPU+GPGPU, e.g., adding a grid of 16 numbers on a quad-core CPU vs. CPU+GPGPU.



Architectural Aspects

•  Currently, caches still a key performance enhancement for multiprocessor systems (just like single CPU systems).

•  Caches require some additional logic to make them continue to function and provide determinism in the main memory of a compute node.

•  Coherence protocols and any other form of data transport between cores requires an interconnection network. –  At scale, all-to-all bus-like structures are infeasible. –  Solution: Novel topologies that sacrifice peak performance (avg

latency, bandwidth, contention characteristics, etc.) for economical (and physical) factors underlying their design and manufacturing.

Reflection



Multicore Considerations

•  Interconnection networks are constrained more in the multicore context than in the large-scale SMP world. Why?

•  But AMD Barcelona quad-core processor utilizes 11 Cu layers. –  Relative to # transistors in in the two planar dimensions of the

processor, the CPU remains for all intents and purposes … flat. –  Cramming a sophisticated interconnection network that is not

planar into a limited number of layers is quite hard. (Caveat: Proximity interconnect.) Thus, there is a limitation on the type of interconnect on-chip.

Reflection



Multicore at Scale

•  Life becoming more interesting as core counts continue to increase. –  Intel Terascale Chip: 80 cores –  Tilera: “Reconfigurable” 64 cores based on a 2-D mesh

topology. –  AMD/ATi HD 4870: 800 cores –  NVIDIA GeForce GTX 280: 240 cores. Why not 256? :-)

•  In the not-to-distant future, interconnect topology will be back “in vogue” for parallel computing …

Reflection



Multicore at Scale




•  This concludes the “architectural stuff” … now onto …

Reflection



Parallel Software: Correctness & Performance




•  This concludes the “architectural stuff” … now onto …



Correctness

•  Hardest aspect of parallel algorithm design and parallel programming? Writing programs that are “correct” … –  What good is a program that generates wrong answers faster?

•  What do we mean by correctness? –  Traditionally, proving that a given algorithm produced the output

that is desired. –  Example: Prim’s algorithm produces a minimum spanning tree. –  Correctness means that the tree produced by Prim’s algorithm is

indeed a minimum spanning tree.



Underlying Assumption

•  Traditional algorithms take the following for granted: –  The machine is deterministic. –  Only one flow of control is active at any given time.

•  Nondeterminism only comes into play in a purely theoretical sense when talking about automata theory, NFAs vs. DFAs, and P vs. NP.

•  This is not the sort of determinism that we are talking about here. What are we talking about?

•  When two uncoordinated flows of control that interact with each other, no guarantee that without explicit guidance that the relative effects and interactions of the multiple threads of control will happen in a predictable order.



Performance

•  The “holy grail” of parallel computing …

•  A parallel program should run at least as fast as the sequential equivalent for a fixed input size. –  One may use parallelism to increase the volume that can be

computed, in which case, comparisons of time are not as important. (Weak scaling)



Performance and Correctness (or Correctness and Performance?) •  Performance and correctness are often intimately

coupled. –  Without protections in place, a program can run very quickly but

suffer from severe correctness problems. –  Very conservative decisions can be made to ensure correctness

but at the cost of significant performance degradation. Example of this?

•  Other performance factors (unrelated to logic flow in place) to maintain determinism and correctness. –  Example: Granularity of computation and communication can be

poorly chosen resulting in abysmal performance.

why the network matters - virginia techaccel.cs.vt.edu/files/lecture4.pdf · 2012-02-09 · design...

Documents