chapter 2 parallel architectures. outline interconnection networks interconnection networks...

Chapter 2

Parallel ArchitecturesParallel Architectures

Outline

Interconnection networksInterconnection networks Processor arraysProcessor arrays MultiprocessorsMultiprocessors MulticomputersMulticomputers Flynn’s taxonomyFlynn’s taxonomy

Interconnection Networks

Uses of interconnection networksUses of interconnection networks Connect processors to shared memoryConnect processors to shared memory Connect processors to each otherConnect processors to each other

Interconnection media typesInterconnection media types Shared mediumShared medium Switched mediumSwitched medium

Shared versus Switched Media

QuickTime™ and a decompressor

are needed to see this picture.

Shared Medium

Allows only one message at a timeAllows only one message at a time Messages are broadcastMessages are broadcast Each processor “listens” to every messageEach processor “listens” to every message Arbitration is decentralizedArbitration is decentralized Collisions require resending of messagesCollisions require resending of messages Ethernet is an exampleEthernet is an example

Switched Medium

Supports point-to-point messages between Supports point-to-point messages between pairs of processorspairs of processors

Each processor has its own path to switchEach processor has its own path to switch Advantages over shared mediaAdvantages over shared media

Allows multiple messages to be sent Allows multiple messages to be sent simultaneouslysimultaneously

Allows scaling of network to Allows scaling of network to accommodate increase in processorsaccommodate increase in processors

Switch Network Topologies

View switched network as a graphView switched network as a graph Vertices = processors or switchesVertices = processors or switches Edges = communication pathsEdges = communication paths

Two kinds of topologiesTwo kinds of topologies DirectDirect IndirectIndirect

Direct Topology

Ratio of switch nodes to processor nodes is Ratio of switch nodes to processor nodes is 1:11:1

Every switch node is connected toEvery switch node is connected to 1 processor node1 processor node At least 1 other switch nodeAt least 1 other switch node

Indirect Topology

Ratio of switch nodes to processor nodes is Ratio of switch nodes to processor nodes is greater than 1:1greater than 1:1

Some switches simply connect other Some switches simply connect other switchesswitches

Evaluating Switch Topologies Diameter Diameter

distance between farthest two nodesdistance between farthest two nodes Clique K_n best: d = O(1) Clique K_n best: d = O(1) but #edges m = O(n^2);but #edges m = O(n^2);

m = O(n) in a path P_n or cycle C_n, but d = O(n) as wellm = O(n) in a path P_n or cycle C_n, but d = O(n) as well Bisection widthBisection width

Min. number of edges in a cut which roughly divides a network in two halves Min. number of edges in a cut which roughly divides a network in two halves - determines the min. bandwidth of the network - determines the min. bandwidth of the network

K_n’s bisection width is O(n), but C_n’s O(1)K_n’s bisection width is O(n), but C_n’s O(1) Degree = Number of edges / node Degree = Number of edges / node

constant degree board can be mass producedconstant degree board can be mass produced Constant edge length? (yes/no)Constant edge length? (yes/no) Planar? – easier to buildPlanar? – easier to build

2-D Mesh Network

Direct topologyDirect topology Switches arranged into a 2-D latticeSwitches arranged into a 2-D lattice Communication allowed only between Communication allowed only between

neighboring switchesneighboring switches Variants allow wraparound connections Variants allow wraparound connections

between switches on edge of meshbetween switches on edge of mesh

2-D Meshes Torus

Evaluating 2-D Meshes

Diameter: Diameter: ((nn1/21/2)) m = m = (n)(n) Bisection width: Bisection width: ((nn1/21/2)) Number of edges per switch: 4Number of edges per switch: 4 Constant edge length? YesConstant edge length? Yes planarplanar

Binary Tree Network

Indirect topologyIndirect topology nn = 2 = 2dd processor nodes, processor nodes, nn-1 switches-1 switches

Evaluating Binary Tree Network

Diameter: 2 log nDiameter: 2 log n M = O(n)M = O(n) Bisection width: 1Bisection width: 1 Edges / node: 3Edges / node: 3 Constant edge length? NoConstant edge length? No planarplanar

Hypertree Network

Indirect topologyIndirect topology Shares low diameter of binary treeShares low diameter of binary tree Greatly improves bisection widthGreatly improves bisection width From “front” looks like From “front” looks like kk-ary tree of height -ary tree of height

dd From “side” looks like upside down binary From “side” looks like upside down binary

tree of height tree of height dd

Hypertree Network

Evaluating 4-ary Hypertree

Diameter: logDiameter: log n n

Bisection width: Bisection width: nn / 2 / 2

Edges / node: 6Edges / node: 6

Constant edge length? NoConstant edge length? No

Butterfly Network

Indirect topologyIndirect topology nn = 2 = 2dd processor processor

nodes connectednodes connectedby by nn(log (log nn + 1) + 1)switching nodesswitching nodes

0 1 2 3 4 5 6 7

3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7

2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7

1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7Rank 0

Rank 1

Rank 2

Rank 3

Butterfly Network Routing

Evaluating Butterfly Network

Diameter: log Diameter: log nn

Edges per node: 4Edges per node: 4

Hypercube

Direct topologyDirect topology 2 2 xx 2 2 xx … … xx 2 mesh 2 mesh Number of nodes a power of 2Number of nodes a power of 2 Node addresses 0, 1, …, 2Node addresses 0, 1, …, 2kk-1-1 Node Node ii connected to connected to kk nodes whose nodes whose

addresses differ from addresses differ from ii in exactly one bit in exactly one bit positionposition

Hypercube Addressing

0110 0111

1000 1001

11011100

Hypercubes Illustrated

Evaluating Hypercube Network

Diameter: log Diameter: log nn

Edges per node: log Edges per node: log nn

Shuffle-exchange

Direct topologyDirect topology Number of nodes a power of 2Number of nodes a power of 2 Nodes have addresses 0, 1, …, 2Nodes have addresses 0, 1, …, 2kk-1-1 Two outgoing links from node Two outgoing links from node ii

Shuffle link to node Shuffle link to node LeftCycle(i)LeftCycle(i) Exchange link to node [xor (Exchange link to node [xor (ii, 1)], 1)]

Shuffle-exchange Illustrated

0 1 2 3 4 5 6 7

Shuffle-exchange Addressing

0000 0001 0010 0011 0100 0101

1110 11111000 1001 1010 1011 1100 1101

0110 0111

Evaluating Shuffle-exchange

Diameter: 2log Diameter: 2log nn - 1 - 1

Bisection width: Bisection width: n n / log / log nn

Edges per node: 2Edges per node: 2

Comparing Networks

All have logarithmic diameterAll have logarithmic diameterexcept 2-D meshexcept 2-D mesh

Hypertree, butterfly, and hypercube have Hypertree, butterfly, and hypercube have bisection width bisection width nn / 2 / 2

All have constant edges per node except All have constant edges per node except hypercubehypercube

Only 2-D mesh keeps edge lengths constant Only 2-D mesh keeps edge lengths constant as network size increasesas network size increases

Vector Computers

Vector computer: instruction set includes Vector computer: instruction set includes operations on vectors as well as scalarsoperations on vectors as well as scalars

Two ways to implement vector computersTwo ways to implement vector computers Pipelined vector processor: streams data Pipelined vector processor: streams data

through pipelined arithmetic units - CRAY-I, IIthrough pipelined arithmetic units - CRAY-I, II Processor array: many identical, synchronized Processor array: many identical, synchronized

arithmetic processing elements - Maspar’s MP-arithmetic processing elements - Maspar’s MP-I, III, II

Why Processor Arrays?

Historically, high cost of a control unitHistorically, high cost of a control unit Scientific applications have data parallelismScientific applications have data parallelism

Processor Array

Data/instruction Storage

Front end computerFront end computer ProgramProgram Data manipulated sequentiallyData manipulated sequentially

Processor arrayProcessor array Data manipulated in parallelData manipulated in parallel

Processor Array Performance

Performance: work done per time unitPerformance: work done per time unit Performance of processor arrayPerformance of processor array

Speed of processing elementsSpeed of processing elements Utilization of processing elementsUtilization of processing elements

Performance Example 1

1024 processors1024 processors Each adds a pair of integers in 1 Each adds a pair of integers in 1 secsec What is performance when adding two What is performance when adding two

1024-element vectors (one per processor)?1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024 ×==

Performance Example 2

512 processors512 processors Each adds two integers in 1 Each adds two integers in 1 secsec Performance adding two vectors of length Performance adding two vectors of length

600?600?

sec/ops103ePerformanc 6sec2

operations600 ×==

2-D Processor Interconnection Network

Each VLSI chip has 16 processing elements

if (COND) then A else B

Processor Array Shortcomings

Not all problems are data-parallelNot all problems are data-parallel Speed drops for conditionally executed Speed drops for conditionally executed

codecode Don’t adapt to multiple users wellDon’t adapt to multiple users well Do not scale down well to “starter” systemsDo not scale down well to “starter” systems Rely on custom VLSI for processorsRely on custom VLSI for processors Expense of control units has droppedExpense of control units has dropped

Multiprocessors

Multiprocessor: multiple-CPU computer Multiprocessor: multiple-CPU computer with a shared memorywith a shared memory

Same address on two different CPUs refers Same address on two different CPUs refers to the same memory locationto the same memory location

Avoid three problems of processor arraysAvoid three problems of processor arrays Can be built from commodity CPUsCan be built from commodity CPUs Naturally support multiple usersNaturally support multiple users Maintain efficiency in conditional codeMaintain efficiency in conditional code

Centralized Multiprocessor

Straightforward extension of uniprocessorStraightforward extension of uniprocessor Add CPUs to busAdd CPUs to bus All processors share same primary memoryAll processors share same primary memory Memory access time same for all CPUsMemory access time same for all CPUs

Uniform memory access (UMA) Uniform memory access (UMA) multiprocessormultiprocessor

Symmetrical multiprocessor (SMP) - Sequent Symmetrical multiprocessor (SMP) - Sequent Balance Series, SGI Power and Challenge Balance Series, SGI Power and Challenge seriesseries

Centralized Multiprocessor

Private and Shared Data

Private data: items used only by a single Private data: items used only by a single processorprocessor

Shared data: values used by multiple Shared data: values used by multiple processorsprocessors

In a multiprocessor, processors In a multiprocessor, processors communicate via shared data valuescommunicate via shared data values

Problems Associated with Shared Data

Cache coherenceCache coherence Replicating data across multiple caches Replicating data across multiple caches

reduces contentionreduces contention How to ensure different processors have How to ensure different processors have

same value for same address?same value for same address? SynchronizationSynchronization

Mutual exclusionMutual exclusion BarrierBarrier

Cache-coherence Problem

Memory

CPU A CPU B

Memory

CPU A CPU B

Memory

CPU A CPU B

Memory

Write Invalidate Protocol

CPU A CPU B

7 7 Cache control monitor

CPU A CPU B

Intent to write X

CPU A CPU B

Intent to write X

CPU A CPU B

Distributed Multiprocessor

Distribute primary memory among Distribute primary memory among processorsprocessors

Increase aggregate memory bandwidth and Increase aggregate memory bandwidth and lower average memory access timelower average memory access time

Allow greater number of processorsAllow greater number of processors Also called non-uniform memory access Also called non-uniform memory access

(NUMA) multiprocessor - SGI Origin (NUMA) multiprocessor - SGI Origin SeriesSeries

Distributed Multiprocessor

Cache Coherence

Some NUMA multiprocessors do not Some NUMA multiprocessors do not support it in hardwaresupport it in hardware Only instructions, private data in cacheOnly instructions, private data in cache Large memory access time varianceLarge memory access time variance

Implementation more difficultImplementation more difficult No shared memory bus to “snoop”No shared memory bus to “snoop” Directory-based protocol neededDirectory-based protocol needed

Directory-based Protocol

Distributed directory contains information Distributed directory contains information about cacheable memory blocksabout cacheable memory blocks

One directory entry for each cache blockOne directory entry for each cache block Each entry hasEach entry has

Sharing statusSharing status Which processors have copiesWhich processors have copies

Sharing Status

UncachedUncached Block not in any processor’s cacheBlock not in any processor’s cache

SharedShared Cached by one or more processorsCached by one or more processors Read onlyRead only

ExclusiveExclusive Cached by exactly one processorCached by exactly one processor Processor has written blockProcessor has written block Copy in memory is obsoleteCopy in memory is obsolete

Directory-based ProtocolInterconnection Network

chapter 2 parallel architectures. outline interconnection networks interconnection networks...

d processor nodes

node addresses

node leftcycleie

meshnumber of nodes

log nbisection width

produced constant edge

log nconstant edge length

nlog n

Documents

multiprocessors— large vs. small scale multiprocessors—...

13-1 lec 6 chap. 13multiprocessors 13-1characteristics of...

l31 multiprocessors

multiprocessors - university of california, san...

chip-multiprocessors & you

lecture 5: direct and indirect interconnection networks for...

1 lecture 21: coherence and interconnection networks papers:...

12 - more multiprocessors

chapter 7 multicores, multiprocessors, and...

multiprocessors interconnection networks

an analysis of on-chip interconnection networks for...

multiprocessors multiprocessors - wordpress.com ·...

eecc756 - shaaban #1 lec # 10 spring2006 5-4-2006...

l7 shared memory multiprocessors · l7 shared memory...

issues in multiprocessors

multiprocessors cs 6410

symmetric multiprocessors

multicomputer 1. multicomputer, yang dipelajari...

cmp-msi feb. 11 th 2007 core to memory interconnection...

hierarchical checking of multiprocessors using · pdf...