chapter 2 parallel architectures. outline interconnection networks interconnection networks...

Chapter 2

Parallel ArchitecturesParallel Architectures

Outline

Interconnection networksInterconnection networks Processor arraysProcessor arrays MultiprocessorsMultiprocessors MulticomputersMulticomputers Flynn’s taxonomyFlynn’s taxonomy

Interconnection Networks

Uses of interconnection networksUses of interconnection networks Connect processors to shared memoryConnect processors to shared memory Connect processors to each otherConnect processors to each other

Interconnection media typesInterconnection media types Shared mediumShared medium Switched mediumSwitched medium

Shared versus Switched Media

QuickTime™ and a decompressor

are needed to see this picture.

Shared Medium

Allows only one message at a timeAllows only one message at a time Messages are broadcastMessages are broadcast Each processor “listens” to every messageEach processor “listens” to every message Arbitration is decentralizedArbitration is decentralized Collisions require resending of messagesCollisions require resending of messages Ethernet is an exampleEthernet is an example

Switched Medium

Supports point-to-point messages between Supports point-to-point messages between pairs of processorspairs of processors

Each processor has its own path to switchEach processor has its own path to switch Advantages over shared mediaAdvantages over shared media

Allows multiple messages to be sent Allows multiple messages to be sent simultaneouslysimultaneously

Allows scaling of network to Allows scaling of network to accommodate increase in processorsaccommodate increase in processors

Switch Network Topologies

View switched network as a graphView switched network as a graph Vertices = processors or switchesVertices = processors or switches Edges = communication pathsEdges = communication paths

Two kinds of topologiesTwo kinds of topologies DirectDirect IndirectIndirect

Direct Topology

Ratio of switch nodes to processor nodes is Ratio of switch nodes to processor nodes is 1:11:1

Every switch node is connected toEvery switch node is connected to 1 processor node1 processor node At least 1 other switch nodeAt least 1 other switch node

Indirect Topology

Ratio of switch nodes to processor nodes is Ratio of switch nodes to processor nodes is greater than 1:1greater than 1:1

Some switches simply connect other Some switches simply connect other switchesswitches

Evaluating Switch Topologies Diameter Diameter

distance between farthest two nodesdistance between farthest two nodes Clique K_n best: d = O(1) Clique K_n best: d = O(1) but #edges m = O(n^2);but #edges m = O(n^2);

m = O(n) in a path P_n or cycle C_n, but d = O(n) as wellm = O(n) in a path P_n or cycle C_n, but d = O(n) as well Bisection widthBisection width

Min. number of edges in a cut which roughly divides a network in two halves Min. number of edges in a cut which roughly divides a network in two halves - determines the min. bandwidth of the network - determines the min. bandwidth of the network

K_n’s bisection width is O(n), but C_n’s O(1)K_n’s bisection width is O(n), but C_n’s O(1) Degree = Number of edges / node Degree = Number of edges / node

constant degree board can be mass producedconstant degree board can be mass produced Constant edge length? (yes/no)Constant edge length? (yes/no) Planar? – easier to buildPlanar? – easier to build

2-D Mesh Network

Direct topologyDirect topology Switches arranged into a 2-D latticeSwitches arranged into a 2-D lattice Communication allowed only between Communication allowed only between

neighboring switchesneighboring switches Variants allow wraparound connections Variants allow wraparound connections

between switches on edge of meshbetween switches on edge of mesh

2-D Meshes Torus



Evaluating 2-D Meshes

Diameter: Diameter: ((nn1/21/2)) m = m = (n)(n) Bisection width: Bisection width: ((nn1/21/2)) Number of edges per switch: 4Number of edges per switch: 4 Constant edge length? YesConstant edge length? Yes planarplanar

Binary Tree Network

Indirect topologyIndirect topology nn = 2 = 2dd processor nodes, processor nodes, nn-1 switches-1 switches



Evaluating Binary Tree Network

Diameter: 2 log nDiameter: 2 log n M = O(n)M = O(n) Bisection width: 1Bisection width: 1 Edges / node: 3Edges / node: 3 Constant edge length? NoConstant edge length? No planarplanar

Hypertree Network

Indirect topologyIndirect topology Shares low diameter of binary treeShares low diameter of binary tree Greatly improves bisection widthGreatly improves bisection width From “front” looks like From “front” looks like kk-ary tree of height -ary tree of height

dd From “side” looks like upside down binary From “side” looks like upside down binary

tree of height tree of height dd

Hypertree Network



Evaluating 4-ary Hypertree

Diameter: logDiameter: log n n

Bisection width: Bisection width: nn / 2 / 2

Edges / node: 6Edges / node: 6

Constant edge length? NoConstant edge length? No

Butterfly Network

Indirect topologyIndirect topology nn = 2 = 2dd processor processor

nodes connectednodes connectedby by nn(log (log nn + 1) + 1)switching nodesswitching nodes

0 1 2 3 4 5 6 7

3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7

2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7

1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7

0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7Rank 0

Rank 1

Rank 2

Rank 3

Butterfly Network Routing



Evaluating Butterfly Network

Diameter: log Diameter: log nn


Edges per node: 4Edges per node: 4


Hypercube

Direct topologyDirect topology 2 2 xx 2 2 xx … … xx 2 mesh 2 mesh Number of nodes a power of 2Number of nodes a power of 2 Node addresses 0, 1, …, 2Node addresses 0, 1, …, 2kk-1-1 Node Node ii connected to connected to kk nodes whose nodes whose

addresses differ from addresses differ from ii in exactly one bit in exactly one bit positionposition

Hypercube Addressing

0010

0000

0100

0110 0111

1110

0001

0101

1000 1001

0011

1010

1111

1011

11011100

Hypercubes Illustrated

Evaluating Hypercube Network

Diameter: log Diameter: log nn


Edges per node: log Edges per node: log nn


Shuffle-exchange

Direct topologyDirect topology Number of nodes a power of 2Number of nodes a power of 2 Nodes have addresses 0, 1, …, 2Nodes have addresses 0, 1, …, 2kk-1-1 Two outgoing links from node Two outgoing links from node ii

Shuffle link to node Shuffle link to node LeftCycle(i)LeftCycle(i) Exchange link to node [xor (Exchange link to node [xor (ii, 1)], 1)]

Shuffle-exchange Illustrated

0 1 2 3 4 5 6 7

Shuffle-exchange Addressing

0000 0001 0010 0011 0100 0101

1110 11111000 1001 1010 1011 1100 1101

0110 0111

Evaluating Shuffle-exchange

Diameter: 2log Diameter: 2log nn - 1 - 1

Bisection width: Bisection width: n n / log / log nn

Edges per node: 2Edges per node: 2


Comparing Networks

All have logarithmic diameterAll have logarithmic diameterexcept 2-D meshexcept 2-D mesh

Hypertree, butterfly, and hypercube have Hypertree, butterfly, and hypercube have bisection width bisection width nn / 2 / 2

All have constant edges per node except All have constant edges per node except hypercubehypercube

Only 2-D mesh keeps edge lengths constant Only 2-D mesh keeps edge lengths constant as network size increasesas network size increases

Vector Computers

Vector computer: instruction set includes Vector computer: instruction set includes operations on vectors as well as scalarsoperations on vectors as well as scalars

Two ways to implement vector computersTwo ways to implement vector computers Pipelined vector processor: streams data Pipelined vector processor: streams data

through pipelined arithmetic units - CRAY-I, IIthrough pipelined arithmetic units - CRAY-I, II Processor array: many identical, synchronized Processor array: many identical, synchronized

arithmetic processing elements - Maspar’s MP-arithmetic processing elements - Maspar’s MP-I, III, II

Why Processor Arrays?

Historically, high cost of a control unitHistorically, high cost of a control unit Scientific applications have data parallelismScientific applications have data parallelism

Processor Array



Data/instruction Storage

Front end computerFront end computer ProgramProgram Data manipulated sequentiallyData manipulated sequentially

Processor arrayProcessor array Data manipulated in parallelData manipulated in parallel

Processor Array Performance

Performance: work done per time unitPerformance: work done per time unit Performance of processor arrayPerformance of processor array

Speed of processing elementsSpeed of processing elements Utilization of processing elementsUtilization of processing elements

Performance Example 1

1024 processors1024 processors Each adds a pair of integers in 1 Each adds a pair of integers in 1 secsec What is performance when adding two What is performance when adding two

1024-element vectors (one per processor)?1024-element vectors (one per processor)?

sec/ops10024.1ePerformanc 9sec1

operations1024 ×==

Performance Example 2

512 processors512 processors Each adds two integers in 1 Each adds two integers in 1 secsec Performance adding two vectors of length Performance adding two vectors of length

600?600?

sec/ops103ePerformanc 6sec2

operations600 ×==

2-D Processor Interconnection Network



Each VLSI chip has 16 processing elements

if (COND) then A else B

Processor Array Shortcomings

Not all problems are data-parallelNot all problems are data-parallel Speed drops for conditionally executed Speed drops for conditionally executed

codecode Don’t adapt to multiple users wellDon’t adapt to multiple users well Do not scale down well to “starter” systemsDo not scale down well to “starter” systems Rely on custom VLSI for processorsRely on custom VLSI for processors Expense of control units has droppedExpense of control units has dropped

Multiprocessors

Multiprocessor: multiple-CPU computer Multiprocessor: multiple-CPU computer with a shared memorywith a shared memory

Same address on two different CPUs refers Same address on two different CPUs refers to the same memory locationto the same memory location

Avoid three problems of processor arraysAvoid three problems of processor arrays Can be built from commodity CPUsCan be built from commodity CPUs Naturally support multiple usersNaturally support multiple users Maintain efficiency in conditional codeMaintain efficiency in conditional code

Centralized Multiprocessor

Straightforward extension of uniprocessorStraightforward extension of uniprocessor Add CPUs to busAdd CPUs to bus All processors share same primary memoryAll processors share same primary memory Memory access time same for all CPUsMemory access time same for all CPUs

Uniform memory access (UMA) Uniform memory access (UMA) multiprocessormultiprocessor

Symmetrical multiprocessor (SMP) - Sequent Symmetrical multiprocessor (SMP) - Sequent Balance Series, SGI Power and Challenge Balance Series, SGI Power and Challenge seriesseries

Centralized Multiprocessor



Private and Shared Data

Private data: items used only by a single Private data: items used only by a single processorprocessor

Shared data: values used by multiple Shared data: values used by multiple processorsprocessors

In a multiprocessor, processors In a multiprocessor, processors communicate via shared data valuescommunicate via shared data values

Problems Associated with Shared Data

Cache coherenceCache coherence Replicating data across multiple caches Replicating data across multiple caches

reduces contentionreduces contention How to ensure different processors have How to ensure different processors have

same value for same address?same value for same address? SynchronizationSynchronization

Mutual exclusionMutual exclusion BarrierBarrier

Cache-coherence Problem

Cache

CPU A

Cache

CPU B

Memory

7X


CPU A CPU B

Memory

7X

7


CPU A CPU B

Memory

7X

7 7


CPU A CPU B

Memory

2X

7 2

Write Invalidate Protocol

CPU A CPU B

7X

7 7 Cache control monitor


CPU A CPU B

7X

7 7

Intent to write X


CPU A CPU B

7X

7

Intent to write X


CPU A CPU B

X 2

2

Distributed Multiprocessor

Distribute primary memory among Distribute primary memory among processorsprocessors

Increase aggregate memory bandwidth and Increase aggregate memory bandwidth and lower average memory access timelower average memory access time

Allow greater number of processorsAllow greater number of processors Also called non-uniform memory access Also called non-uniform memory access

(NUMA) multiprocessor - SGI Origin (NUMA) multiprocessor - SGI Origin SeriesSeries

Distributed Multiprocessor



Cache Coherence

Some NUMA multiprocessors do not Some NUMA multiprocessors do not support it in hardwaresupport it in hardware Only instructions, private data in cacheOnly instructions, private data in cache Large memory access time varianceLarge memory access time variance

Implementation more difficultImplementation more difficult No shared memory bus to “snoop”No shared memory bus to “snoop” Directory-based protocol neededDirectory-based protocol needed

Directory-based Protocol

Distributed directory contains information Distributed directory contains information about cacheable memory blocksabout cacheable memory blocks

One directory entry for each cache blockOne directory entry for each cache block Each entry hasEach entry has

Sharing statusSharing status Which processors have copiesWhich processors have copies

Sharing Status

UncachedUncached Block not in any processor’s cacheBlock not in any processor’s cache

SharedShared Cached by one or more processorsCached by one or more processors Read onlyRead only

ExclusiveExclusive Cached by exactly one processorCached by exactly one processor Processor has written blockProcessor has written block Copy in memory is obsoleteCopy in memory is obsolete

Directory-based ProtocolInterconnection Network

Directory

Local Memory

Cache

CPU 0

Directory

Local Memory

Cache

CPU 1

Directory

Local Memory

Cache

CPU 2

Directory-based ProtocolInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X U 0 0 0

Bit Vector

CPU 0 Reads XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories


Read Miss


CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X S 1 0 0


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X

Read Miss


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X 7X

CPU 0 Writes 6 to XInterconnection Network

CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X 7X

Write Miss


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


7X 7X

Invalidate


CPU 0 CPU 1 CPU 2

7X

Caches

Memories

Directories X E 1 0 0

6X


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


6X

Read Miss


CPU 0 CPU 1 CPU 2

7X

Caches

Memories


6X

Switch to Shared


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X 6X


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X 6X

Write Miss


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


6X 6X

Invalidate


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


5X


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


5X

Write Miss


CPU 0 CPU 1 CPU 2

6X

Caches

Memories


Take Away

5X


CPU 0 CPU 1 CPU 2

5X

Caches

Memories


5X


CPU 0 CPU 1 CPU 2

5X

Caches

Memories



CPU 0 CPU 1 CPU 2

5X

Caches

Memories


5X


CPU 0 CPU 1 CPU 2

5X

Caches

Memories


4X

CPU 0 Writes Back X BlockInterconnection Network

CPU 0 CPU 1 CPU 2

5X

Caches

Memories


4X

4X

Data Write Back

CPU 0 Writes Back X BlockInterconnection Network

CPU 0 CPU 1 CPU 2

4X

Caches

Memories


Multicomputer

Distributed memory multiple-CPU computerDistributed memory multiple-CPU computer Same address on different processors refers to Same address on different processors refers to

different physical memory locationsdifferent physical memory locations Processors interact through message passingProcessors interact through message passing Commercial multicomputers iPSC I, II, Intel Commercial multicomputers iPSC I, II, Intel

Paragon, Ncube I, IIParagon, Ncube I, II Commodity clusters – e.g., CheetahCommodity clusters – e.g., Cheetah

Asymmetrical Multicomputer



Asymmetrical MC Advantages

Back-end processors dedicated to parallel Back-end processors dedicated to parallel computations computations Easier to understand, Easier to understand, model, tune performancemodel, tune performance

Only a simple back-end operating system Only a simple back-end operating system needed needed Easy for a vendor to create Easy for a vendor to create

Asymmetrical MC Disadvantages

Front-end computer is a single point of Front-end computer is a single point of failurefailure

Single front-end computer limits scalability Single front-end computer limits scalability of systemof system

Primitive operating system in back-end Primitive operating system in back-end processors makes debugging difficultprocessors makes debugging difficult

Every application requires development of Every application requires development of both front-end and back-end programboth front-end and back-end program

Symmetrical Multicomputer



Symmetrical MC Advantages

Alleviate performance bottleneck caused by Alleviate performance bottleneck caused by single front-end computersingle front-end computer

Better support for debuggingBetter support for debugging Every processor executes same programEvery processor executes same program

Symmetrical MC Disadvantages

More difficult to maintain illusion of single More difficult to maintain illusion of single “parallel computer”“parallel computer”

No simple way to balance program No simple way to balance program development workload among processorsdevelopment workload among processors

More difficult to achieve high performance More difficult to achieve high performance when multiple processes on each processorwhen multiple processes on each processor

ParPar Cluster, A Mixed Model



Commodity Cluster

Co-located computersCo-located computers Dedicated to running parallel jobsDedicated to running parallel jobs No keyboards or displaysNo keyboards or displays Identical operating systemIdentical operating system Identical local disk imagesIdentical local disk images Administered as an entityAdministered as an entity

Network of Workstations

Dispersed computersDispersed computers First priority: person at keyboardFirst priority: person at keyboard Parallel jobs run in backgroundParallel jobs run in background Different operating systemsDifferent operating systems Different local imagesDifferent local images Checkpointing and restarting importantCheckpointing and restarting important

Flynn’s Taxonomy

Instruction streamInstruction stream Data streamData stream Single vs. multipleSingle vs. multiple Four combinationsFour combinations

SISDSISD SIMDSIMD MISDMISD MIMDMIMD

SISD

Single Instruction, Single DataSingle Instruction, Single Data Single-CPU systemsSingle-CPU systems Note: co-processors don’t countNote: co-processors don’t count

FunctionalFunctional I/OI/O

Example: PCsExample: PCs

SIMD

Single Instruction, Multiple DataSingle Instruction, Multiple Data Two architectures fit this categoryTwo architectures fit this category

Pipelined vector processorPipelined vector processor(e.g., Cray-1)(e.g., Cray-1)

Processor arrayProcessor array(e.g., Connection Machine CM-1, (e.g., Connection Machine CM-1, MASPAR 1000/2000)MASPAR 1000/2000)

MISD

MultipleMultipleInstruction,Instruction,Single DataSingle Data

Example:Example:systolic array??systolic array??



MIMD

Multiple Instruction, Multiple DataMultiple Instruction, Multiple Data Multiple-CPU computersMultiple-CPU computers

MultiprocessorsMultiprocessors MulticomputersMulticomputers

Summary

Commercial parallel computers appearedCommercial parallel computers appearedin 1980sin 1980s

Multiple-CPU computers now dominateMultiple-CPU computers now dominate Small-scale: Centralized multiprocessorsSmall-scale: Centralized multiprocessors Large-scale: Distributed memory Large-scale: Distributed memory

architectures (multiprocessors or architectures (multiprocessors or multicomputers)multicomputers)

chapter 2 parallel architectures. outline interconnection networks interconnection networks...

Documents

d processor nodes

node addresses

node leftcycleie

meshnumber of nodes

log nbisection width

produced constant edge

log nconstant edge length

nlog n