chapter 2 parallel architectures. outline interconnection networks interconnection networks...
Post on 31-Dec-2015
235 Views
Preview:
TRANSCRIPT
Chapter 2
Parallel ArchitecturesParallel Architectures
Outline
Interconnection networksInterconnection networks Processor arraysProcessor arrays MultiprocessorsMultiprocessors MulticomputersMulticomputers Flynn’s taxonomyFlynn’s taxonomy
Interconnection Networks
Uses of interconnection networksUses of interconnection networks Connect processors to shared memoryConnect processors to shared memory Connect processors to each otherConnect processors to each other
Interconnection media typesInterconnection media types Shared mediumShared medium Switched mediumSwitched medium
Shared versus Switched Media
QuickTime™ and a decompressor
are needed to see this picture.
Shared Medium
Allows only one message at a timeAllows only one message at a time Messages are broadcastMessages are broadcast Each processor “listens” to every messageEach processor “listens” to every message Arbitration is decentralizedArbitration is decentralized Collisions require resending of messagesCollisions require resending of messages Ethernet is an exampleEthernet is an example
Switched Medium
Supports point-to-point messages between Supports point-to-point messages between pairs of processorspairs of processors
Each processor has its own path to switchEach processor has its own path to switch Advantages over shared mediaAdvantages over shared media
Allows multiple messages to be sent Allows multiple messages to be sent simultaneouslysimultaneously
Allows scaling of network to Allows scaling of network to accommodate increase in processorsaccommodate increase in processors
Switch Network Topologies
View switched network as a graphView switched network as a graph Vertices = processors or switchesVertices = processors or switches Edges = communication pathsEdges = communication paths
Two kinds of topologiesTwo kinds of topologies DirectDirect IndirectIndirect
Direct Topology
Ratio of switch nodes to processor nodes is Ratio of switch nodes to processor nodes is 1:11:1
Every switch node is connected toEvery switch node is connected to 1 processor node1 processor node At least 1 other switch nodeAt least 1 other switch node
Indirect Topology
Ratio of switch nodes to processor nodes is Ratio of switch nodes to processor nodes is greater than 1:1greater than 1:1
Some switches simply connect other Some switches simply connect other switchesswitches
Evaluating Switch Topologies Diameter Diameter
distance between farthest two nodesdistance between farthest two nodes Clique K_n best: d = O(1) Clique K_n best: d = O(1) but #edges m = O(n^2);but #edges m = O(n^2);
m = O(n) in a path P_n or cycle C_n, but d = O(n) as wellm = O(n) in a path P_n or cycle C_n, but d = O(n) as well Bisection widthBisection width
Min. number of edges in a cut which roughly divides a network in two halves Min. number of edges in a cut which roughly divides a network in two halves - determines the min. bandwidth of the network - determines the min. bandwidth of the network
K_n’s bisection width is O(n), but C_n’s O(1)K_n’s bisection width is O(n), but C_n’s O(1) Degree = Number of edges / node Degree = Number of edges / node
constant degree board can be mass producedconstant degree board can be mass produced Constant edge length? (yes/no)Constant edge length? (yes/no) Planar? – easier to buildPlanar? – easier to build
2-D Mesh Network
Direct topologyDirect topology Switches arranged into a 2-D latticeSwitches arranged into a 2-D lattice Communication allowed only between Communication allowed only between
neighboring switchesneighboring switches Variants allow wraparound connections Variants allow wraparound connections
between switches on edge of meshbetween switches on edge of mesh
2-D Meshes Torus
QuickTime™ and a decompressor
are needed to see this picture.
Evaluating 2-D Meshes
Diameter: Diameter: ((nn1/21/2)) m = m = (n)(n) Bisection width: Bisection width: ((nn1/21/2)) Number of edges per switch: 4Number of edges per switch: 4 Constant edge length? YesConstant edge length? Yes planarplanar
Binary Tree Network
Indirect topologyIndirect topology nn = 2 = 2dd processor nodes, processor nodes, nn-1 switches-1 switches
QuickTime™ and a decompressor
are needed to see this picture.
Evaluating Binary Tree Network
Diameter: 2 log nDiameter: 2 log n M = O(n)M = O(n) Bisection width: 1Bisection width: 1 Edges / node: 3Edges / node: 3 Constant edge length? NoConstant edge length? No planarplanar
Hypertree Network
Indirect topologyIndirect topology Shares low diameter of binary treeShares low diameter of binary tree Greatly improves bisection widthGreatly improves bisection width From “front” looks like From “front” looks like kk-ary tree of height -ary tree of height
dd From “side” looks like upside down binary From “side” looks like upside down binary
tree of height tree of height dd
Hypertree Network
QuickTime™ and a decompressor
are needed to see this picture.
Evaluating 4-ary Hypertree
Diameter: logDiameter: log n n
Bisection width: Bisection width: nn / 2 / 2
Edges / node: 6Edges / node: 6
Constant edge length? NoConstant edge length? No
Butterfly Network
Indirect topologyIndirect topology nn = 2 = 2dd processor processor
nodes connectednodes connectedby by nn(log (log nn + 1) + 1)switching nodesswitching nodes
0 1 2 3 4 5 6 7
3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7
2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7
1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7
0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7Rank 0
Rank 1
Rank 2
Rank 3
Butterfly Network Routing
QuickTime™ and a decompressor
are needed to see this picture.
Evaluating Butterfly Network
Diameter: log Diameter: log nn
Bisection width: Bisection width: nn / 2 / 2
Edges per node: 4Edges per node: 4
Constant edge length? NoConstant edge length? No
Hypercube
Direct topologyDirect topology 2 2 xx 2 2 xx … … xx 2 mesh 2 mesh Number of nodes a power of 2Number of nodes a power of 2 Node addresses 0, 1, …, 2Node addresses 0, 1, …, 2kk-1-1 Node Node ii connected to connected to kk nodes whose nodes whose
addresses differ from addresses differ from ii in exactly one bit in exactly one bit positionposition
Hypercube Addressing
0010
0000
0100
0110 0111
1110
0001
0101
1000 1001
0011
1010
1111
1011
11011100
Hypercubes Illustrated
Evaluating Hypercube Network
Diameter: log Diameter: log nn
Bisection width: Bisection width: nn / 2 / 2
Edges per node: log Edges per node: log nn
Constant edge length? NoConstant edge length? No
Shuffle-exchange
Direct topologyDirect topology Number of nodes a power of 2Number of nodes a power of 2 Nodes have addresses 0, 1, …, 2Nodes have addresses 0, 1, …, 2kk-1-1 Two outgoing links from node Two outgoing links from node ii
Shuffle link to node Shuffle link to node LeftCycle(i)LeftCycle(i) Exchange link to node [xor (Exchange link to node [xor (ii, 1)], 1)]
Shuffle-exchange Illustrated
0 1 2 3 4 5 6 7
Shuffle-exchange Addressing
0000 0001 0010 0011 0100 0101
1110 11111000 1001 1010 1011 1100 1101
0110 0111
Evaluating Shuffle-exchange
Diameter: 2log Diameter: 2log nn - 1 - 1
Bisection width: Bisection width: n n / log / log nn
Edges per node: 2Edges per node: 2
Constant edge length? NoConstant edge length? No
Comparing Networks
All have logarithmic diameterAll have logarithmic diameterexcept 2-D meshexcept 2-D mesh
Hypertree, butterfly, and hypercube have Hypertree, butterfly, and hypercube have bisection width bisection width nn / 2 / 2
All have constant edges per node except All have constant edges per node except hypercubehypercube
Only 2-D mesh keeps edge lengths constant Only 2-D mesh keeps edge lengths constant as network size increasesas network size increases
Vector Computers
Vector computer: instruction set includes Vector computer: instruction set includes operations on vectors as well as scalarsoperations on vectors as well as scalars
Two ways to implement vector computersTwo ways to implement vector computers Pipelined vector processor: streams data Pipelined vector processor: streams data
through pipelined arithmetic units - CRAY-I, IIthrough pipelined arithmetic units - CRAY-I, II Processor array: many identical, synchronized Processor array: many identical, synchronized
arithmetic processing elements - Maspar’s MP-arithmetic processing elements - Maspar’s MP-I, III, II
Why Processor Arrays?
Historically, high cost of a control unitHistorically, high cost of a control unit Scientific applications have data parallelismScientific applications have data parallelism
Processor Array
QuickTime™ and a decompressor
are needed to see this picture.
Data/instruction Storage
Front end computerFront end computer ProgramProgram Data manipulated sequentiallyData manipulated sequentially
Processor arrayProcessor array Data manipulated in parallelData manipulated in parallel
Processor Array Performance
Performance: work done per time unitPerformance: work done per time unit Performance of processor arrayPerformance of processor array
Speed of processing elementsSpeed of processing elements Utilization of processing elementsUtilization of processing elements
Performance Example 1
1024 processors1024 processors Each adds a pair of integers in 1 Each adds a pair of integers in 1 secsec What is performance when adding two What is performance when adding two
1024-element vectors (one per processor)?1024-element vectors (one per processor)?
sec/ops10024.1ePerformanc 9sec1
operations1024 ×==
Performance Example 2
512 processors512 processors Each adds two integers in 1 Each adds two integers in 1 secsec Performance adding two vectors of length Performance adding two vectors of length
600?600?
sec/ops103ePerformanc 6sec2
operations600 ×==
2-D Processor Interconnection Network
QuickTime™ and a decompressor
are needed to see this picture.
Each VLSI chip has 16 processing elements
if (COND) then A else B
if (COND) then A else B
if (COND) then A else B
Processor Array Shortcomings
Not all problems are data-parallelNot all problems are data-parallel Speed drops for conditionally executed Speed drops for conditionally executed
codecode Don’t adapt to multiple users wellDon’t adapt to multiple users well Do not scale down well to “starter” systemsDo not scale down well to “starter” systems Rely on custom VLSI for processorsRely on custom VLSI for processors Expense of control units has droppedExpense of control units has dropped
Multiprocessors
Multiprocessor: multiple-CPU computer Multiprocessor: multiple-CPU computer with a shared memorywith a shared memory
Same address on two different CPUs refers Same address on two different CPUs refers to the same memory locationto the same memory location
Avoid three problems of processor arraysAvoid three problems of processor arrays Can be built from commodity CPUsCan be built from commodity CPUs Naturally support multiple usersNaturally support multiple users Maintain efficiency in conditional codeMaintain efficiency in conditional code
Centralized Multiprocessor
Straightforward extension of uniprocessorStraightforward extension of uniprocessor Add CPUs to busAdd CPUs to bus All processors share same primary memoryAll processors share same primary memory Memory access time same for all CPUsMemory access time same for all CPUs
Uniform memory access (UMA) Uniform memory access (UMA) multiprocessormultiprocessor
Symmetrical multiprocessor (SMP) - Sequent Symmetrical multiprocessor (SMP) - Sequent Balance Series, SGI Power and Challenge Balance Series, SGI Power and Challenge seriesseries
Centralized Multiprocessor
QuickTime™ and a decompressor
are needed to see this picture.
Private and Shared Data
Private data: items used only by a single Private data: items used only by a single processorprocessor
Shared data: values used by multiple Shared data: values used by multiple processorsprocessors
In a multiprocessor, processors In a multiprocessor, processors communicate via shared data valuescommunicate via shared data values
Problems Associated with Shared Data
Cache coherenceCache coherence Replicating data across multiple caches Replicating data across multiple caches
reduces contentionreduces contention How to ensure different processors have How to ensure different processors have
same value for same address?same value for same address? SynchronizationSynchronization
Mutual exclusionMutual exclusion BarrierBarrier
Cache-coherence Problem
Cache
CPU A
Cache
CPU B
Memory
7X
Cache-coherence Problem
CPU A CPU B
Memory
7X
7
Cache-coherence Problem
CPU A CPU B
Memory
7X
7 7
Cache-coherence Problem
CPU A CPU B
Memory
2X
7 2
Write Invalidate Protocol
CPU A CPU B
7X
7 7 Cache control monitor
Write Invalidate Protocol
CPU A CPU B
7X
7 7
Intent to write X
Write Invalidate Protocol
CPU A CPU B
7X
7
Intent to write X
Write Invalidate Protocol
CPU A CPU B
X 2
2
Distributed Multiprocessor
Distribute primary memory among Distribute primary memory among processorsprocessors
Increase aggregate memory bandwidth and Increase aggregate memory bandwidth and lower average memory access timelower average memory access time
Allow greater number of processorsAllow greater number of processors Also called non-uniform memory access Also called non-uniform memory access
(NUMA) multiprocessor - SGI Origin (NUMA) multiprocessor - SGI Origin SeriesSeries
Distributed Multiprocessor
QuickTime™ and a decompressor
are needed to see this picture.
Cache Coherence
Some NUMA multiprocessors do not Some NUMA multiprocessors do not support it in hardwaresupport it in hardware Only instructions, private data in cacheOnly instructions, private data in cache Large memory access time varianceLarge memory access time variance
Implementation more difficultImplementation more difficult No shared memory bus to “snoop”No shared memory bus to “snoop” Directory-based protocol neededDirectory-based protocol needed
Directory-based Protocol
Distributed directory contains information Distributed directory contains information about cacheable memory blocksabout cacheable memory blocks
One directory entry for each cache blockOne directory entry for each cache block Each entry hasEach entry has
Sharing statusSharing status Which processors have copiesWhich processors have copies
Sharing Status
UncachedUncached Block not in any processor’s cacheBlock not in any processor’s cache
SharedShared Cached by one or more processorsCached by one or more processors Read onlyRead only
ExclusiveExclusive Cached by exactly one processorCached by exactly one processor Processor has written blockProcessor has written block Copy in memory is obsoleteCopy in memory is obsolete
Directory-based ProtocolInterconnection Network
Directory
Local Memory
Cache
CPU 0
Directory
Local Memory
Cache
CPU 1
Directory
Local Memory
Cache
CPU 2
Directory-based ProtocolInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X U 0 0 0
Bit Vector
CPU 0 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X U 0 0 0
Read Miss
CPU 0 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 0
CPU 0 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 0
7X
CPU 2 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 0
7X
Read Miss
CPU 2 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 1
7X
CPU 2 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 1
7X 7X
CPU 0 Writes 6 to XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 1
7X 7X
Write Miss
CPU 0 Writes 6 to XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 1
7X 7X
Invalidate
CPU 0 Writes 6 to XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X E 1 0 0
6X
CPU 1 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X E 1 0 0
6X
Read Miss
CPU 1 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X E 1 0 0
6X
Switch to Shared
CPU 1 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X E 1 0 0
6X
CPU 1 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X S 1 1 0
6X 6X
CPU 2 Writes 5 to XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X S 1 1 0
6X 6X
Write Miss
CPU 2 Writes 5 to XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X S 1 1 0
6X 6X
Invalidate
CPU 2 Writes 5 to XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X E 0 0 1
5X
CPU 0 Writes 4 to XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X E 0 0 1
5X
Write Miss
CPU 0 Writes 4 to XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X E 1 0 0
Take Away
5X
CPU 0 Writes 4 to XInterconnection Network
CPU 0 CPU 1 CPU 2
5X
Caches
Memories
Directories X E 0 1 0
5X
CPU 0 Writes 4 to XInterconnection Network
CPU 0 CPU 1 CPU 2
5X
Caches
Memories
Directories X E 1 0 0
CPU 0 Writes 4 to XInterconnection Network
CPU 0 CPU 1 CPU 2
5X
Caches
Memories
Directories X E 1 0 0
5X
CPU 0 Writes 4 to XInterconnection Network
CPU 0 CPU 1 CPU 2
5X
Caches
Memories
Directories X E 1 0 0
4X
CPU 0 Writes Back X BlockInterconnection Network
CPU 0 CPU 1 CPU 2
5X
Caches
Memories
Directories X E 1 0 0
4X
4X
Data Write Back
CPU 0 Writes Back X BlockInterconnection Network
CPU 0 CPU 1 CPU 2
4X
Caches
Memories
Directories X U 0 0 0
Multicomputer
Distributed memory multiple-CPU computerDistributed memory multiple-CPU computer Same address on different processors refers to Same address on different processors refers to
different physical memory locationsdifferent physical memory locations Processors interact through message passingProcessors interact through message passing Commercial multicomputers iPSC I, II, Intel Commercial multicomputers iPSC I, II, Intel
Paragon, Ncube I, IIParagon, Ncube I, II Commodity clusters – e.g., CheetahCommodity clusters – e.g., Cheetah
Asymmetrical Multicomputer
QuickTime™ and a decompressor
are needed to see this picture.
Asymmetrical MC Advantages
Back-end processors dedicated to parallel Back-end processors dedicated to parallel computations computations Easier to understand, Easier to understand, model, tune performancemodel, tune performance
Only a simple back-end operating system Only a simple back-end operating system needed needed Easy for a vendor to create Easy for a vendor to create
Asymmetrical MC Disadvantages
Front-end computer is a single point of Front-end computer is a single point of failurefailure
Single front-end computer limits scalability Single front-end computer limits scalability of systemof system
Primitive operating system in back-end Primitive operating system in back-end processors makes debugging difficultprocessors makes debugging difficult
Every application requires development of Every application requires development of both front-end and back-end programboth front-end and back-end program
Symmetrical Multicomputer
QuickTime™ and a decompressor
are needed to see this picture.
Symmetrical MC Advantages
Alleviate performance bottleneck caused by Alleviate performance bottleneck caused by single front-end computersingle front-end computer
Better support for debuggingBetter support for debugging Every processor executes same programEvery processor executes same program
Symmetrical MC Disadvantages
More difficult to maintain illusion of single More difficult to maintain illusion of single “parallel computer”“parallel computer”
No simple way to balance program No simple way to balance program development workload among processorsdevelopment workload among processors
More difficult to achieve high performance More difficult to achieve high performance when multiple processes on each processorwhen multiple processes on each processor
ParPar Cluster, A Mixed Model
QuickTime™ and a decompressor
are needed to see this picture.
Commodity Cluster
Co-located computersCo-located computers Dedicated to running parallel jobsDedicated to running parallel jobs No keyboards or displaysNo keyboards or displays Identical operating systemIdentical operating system Identical local disk imagesIdentical local disk images Administered as an entityAdministered as an entity
Network of Workstations
Dispersed computersDispersed computers First priority: person at keyboardFirst priority: person at keyboard Parallel jobs run in backgroundParallel jobs run in background Different operating systemsDifferent operating systems Different local imagesDifferent local images Checkpointing and restarting importantCheckpointing and restarting important
Flynn’s Taxonomy
Instruction streamInstruction stream Data streamData stream Single vs. multipleSingle vs. multiple Four combinationsFour combinations
SISDSISD SIMDSIMD MISDMISD MIMDMIMD
SISD
Single Instruction, Single DataSingle Instruction, Single Data Single-CPU systemsSingle-CPU systems Note: co-processors don’t countNote: co-processors don’t count
FunctionalFunctional I/OI/O
Example: PCsExample: PCs
SIMD
Single Instruction, Multiple DataSingle Instruction, Multiple Data Two architectures fit this categoryTwo architectures fit this category
Pipelined vector processorPipelined vector processor(e.g., Cray-1)(e.g., Cray-1)
Processor arrayProcessor array(e.g., Connection Machine CM-1, (e.g., Connection Machine CM-1, MASPAR 1000/2000)MASPAR 1000/2000)
MISD
MultipleMultipleInstruction,Instruction,Single DataSingle Data
Example:Example:systolic array??systolic array??
QuickTime™ and a decompressor
are needed to see this picture.
MIMD
Multiple Instruction, Multiple DataMultiple Instruction, Multiple Data Multiple-CPU computersMultiple-CPU computers
MultiprocessorsMultiprocessors MulticomputersMulticomputers
Summary
Commercial parallel computers appearedCommercial parallel computers appearedin 1980sin 1980s
Multiple-CPU computers now dominateMultiple-CPU computers now dominate Small-scale: Centralized multiprocessorsSmall-scale: Centralized multiprocessors Large-scale: Distributed memory Large-scale: Distributed memory
architectures (multiprocessors or architectures (multiprocessors or multicomputers)multicomputers)
top related