lecture 3 tth 03:30am-04:45pm dr. jianjun hu csce569 parallel computing university of south...
TRANSCRIPT
Lecture 3TTH 03:30AM-04:45PM
Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csc
e569/
CSCE569 Parallel Computing
University of South CarolinaDepartment of Computer Science and Engineering
Outline- Parallel ArchitectureInterconnection networksProcessor arraysMultiprocessorsMulticomputersFlynn’s taxonomy
Interconnection NetworksUses of interconnection networks
Connect processors to shared memoryConnect processors to each other
Interconnection media typesShared mediumSwitched medium
Shared versus Switched Media
Shared MediumAllows only message at a timeMessages are broadcastEach processor “listens” to every messageArbitration is decentralizedCollisions require resending of messagesEthernet is an example
Switched MediumSupports point-to-point messages between
pairs of processorsEach processor has its own path to switchAdvantages over shared media
Allows multiple messages to be sent simultaneously
Allows scaling of network to accommodate increase in processors
Switch Network TopologiesView switched network as a graph
Vertices = processors or switchesEdges = communication paths
Two kinds of topologiesDirectIndirect
Direct TopologyRatio of switch nodes to processor nodes is
1:1Every switch node is connected to
1 processor nodeAt least 1 other switch node
Indirect TopologyRatio of switch nodes to processor nodes is
greater than 1:1Some switches simply connect other
switches
Evaluating Switch TopologiesDiameterBisection widthNumber of edges / nodeConstant edge length? (yes/no)
2-D Mesh NetworkDirect topologySwitches arranged into a 2-D latticeCommunication allowed only between
neighboring switchesVariants allow wraparound connections
between switches on edge of mesh
2-D Meshes
Evaluating 2-D MeshesDiameter: (n1/2)
Bisection width: (n1/2)
Number of edges per switch: 4
Constant edge length? Yes
Binary Tree NetworkIndirect topologyn = 2d processor nodes, n-1 switches
Evaluating Binary Tree NetworkDiameter: 2 log n
Bisection width: 1
Edges / node: 3
Constant edge length? No
Hypertree NetworkIndirect topologyShares low diameter of binary treeGreatly improves bisection widthFrom “front” looks like k-ary tree of height dFrom “side” looks like upside down binary
tree of height d
Hypertree Network
Evaluating 4-ary HypertreeDiameter: log n
Bisection width: n / 2
Edges / node: 6
Constant edge length? No
Butterfly NetworkIndirect topologyn = 2d processor
nodes connectedby n(log n + 1)switching nodes
0 1 2 3 4 5 6 7
3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7
2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7
1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7
0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0
R ank 1
R ank 2
R ank 3
Butterfly Network Routing
Evaluating Butterfly NetworkDiameter: log n
Bisection width: n / 2
Edges per node: 4
Constant edge length? No
HypercubeDirectory topology2 x 2 x … x 2 meshNumber of nodes a power of 2Node addresses 0, 1, …, 2k-1Node i connected to k nodes whose
addresses differ from i in exactly one bit position
Hypercube Addressing
0010
0000
0100
0110 0111
1110
0001
0101
1000 1001
0011
1010
1111
1011
11011100
Hypercubes Illustrated
Evaluating Hypercube NetworkDiameter: log n
Bisection width: n / 2
Edges per node: log n
Constant edge length? No
Shuffle-exchangeDirect topologyNumber of nodes a power of 2Nodes have addresses 0, 1, …, 2k-1Two outgoing links from node i
Shuffle link to node LeftCycle(i)Exchange link to node [xor (i, 1)]
Shuffle-exchange Illustrated
0 1 2 3 4 5 6 7
Shuffle-exchange Addressing
0 0 0 0 0 0 0 1 0 0 1 0 0 0 11 0 1 0 0 0 1 0 1
11 1 0 11 1 11 0 0 0 1 0 0 1 1 0 1 0 1 0 11 11 0 0 11 0 1
0 11 0 0 11 1
Evaluating Shuffle-exchangeDiameter: 2log n - 1
Bisection width: n / log n
Edges per node: 2
Constant edge length? No
Comparing NetworksAll have logarithmic diameter
except 2-D meshHypertree, butterfly, and hypercube have
bisection width n / 2All have constant edges per node except
hypercubeOnly 2-D mesh keeps edge lengths
constant as network size increases
Vector ComputersVector computer: instruction set includes
operations on vectors as well as scalarsTwo ways to implement vector computers
Pipelined vector processor: streams data through pipelined arithmetic units
Processor array: many identical, synchronized arithmetic processing elements
Why Processor Arrays?Historically, high cost of a control unitScientific applications have data parallelism
Processor Array
Data/instruction StorageFront end computer
ProgramData manipulated sequentially
Processor arrayData manipulated in parallel
Processor Array PerformancePerformance: work done per time unitPerformance of processor array
Speed of processing elementsUtilization of processing elements
Performance Example 11024 processorsEach adds a pair of integers in 1 secWhat is performance when adding two
1024-element vectors (one per processor)?
sec/ops10024.1ePerformanc 9sec1
operations1024
Performance Example 2512 processorsEach adds two integers in 1 secPerformance adding two vectors of length
600?
sec/ops103ePerformanc 6sec2
operations600
2-D Processor Interconnection Network
Each VLSI chip has 16 processing elements
Processor Array ShortcomingsNot all problems are data-parallelSpeed drops for conditionally executed
codeDon’t adapt to multiple users wellDo not scale down well to “starter” systemsRely on custom VLSI for processorsExpense of control units has dropped
MultiprocessorsMultiprocessor: multiple-CPU computer with
a shared memorySame address on two different CPUs refers
to the same memory locationAvoid three problems of processor arrays
Can be built from commodity CPUsNaturally support multiple usersMaintain efficiency in conditional code
Centralized MultiprocessorStraightforward extension of uniprocessorAdd CPUs to busAll processors share same primary memoryMemory access time same for all CPUs
Uniform memory access (UMA) multiprocessor
Symmetrical multiprocessor (SMP)
Centralized Multiprocessor
Private and Shared DataPrivate data: items used only by a single
processorShared data: values used by multiple
processorsIn a multiprocessor, processors
communicate via shared data values
Problems Associated with Shared DataCache coherence
Replicating data across multiple caches reduces contention
How to ensure different processors have same value for same address?
SynchronizationMutual exclusionBarrier
Cache-coherence Problem
Cache
CPU A
Cache
CPU B
Memory
7X
Cache-coherence Problem
CPU A CPU B
Memory
7X
7
Cache-coherence Problem
CPU A CPU B
Memory
7X
77
Cache-coherence Problem
CPU A CPU B
Memory
2X
72
Write Invalidate Protocol
CPU A CPU B
7X
77 Cache control monitor
Write Invalidate Protocol
CPU A CPU B
7X
77
Intent to write X
Write Invalidate Protocol
CPU A CPU B
7X
7
Intent to write X
Write Invalidate Protocol
CPU A CPU B
X 2
2
Memory consistencyUse memory consistency model to achieve
memory consistencyDefine allowable behavior of the memory
system used by programmer of parallel programs.
Consistency model guarantee that for any read/write input to the memory system, only allowable outputs are produced..
Sequential Consistency ModelRequire all write operations appear to all
processors in the same orderCan be guaranteed by the following sufficient
conditions:1) Every processor issues memory operations in
program order2) After a processor issues write op, wait until
write is done bf. Issue next op3) After a processor issued Read op, it waits until
this read and write op whose value should be returned to completed.
4) RR RW WW WR 1) XY: X must complete bf. Y
Network embedding/MappingE.g.: How to embed a network into
hypercubeReflected Gray Code (RGC): Mapping nodes
of network A to nodes of network B.Example: Map ring to hypercube
Embed ring/2d mesh to hypercube
Embed ring to hyperCube
Embed 2d mesh to hyperCube
RoutingRouting algorithms depend on network
topology, network contention, and network congestion.
Deterministic or adaptive routing algo. XY routing for 2D meshes:
E-Cube routing for Hypercube
Routing for Omega network
Stage 0 Stage 1 Stage 2
Omega network allows max 8 concurrent messaging.
Blocking network since only 1 path exists for (AB)
Non-blocking network: Benes
For any permutation ofThere exists a switching of the Benes network to realize connection without collision… HOW do u find this configuration?
Distributed MultiprocessorDistribute primary memory among
processorsIncrease aggregate memory bandwidth and
lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access
(NUMA) multiprocessor
Distributed Multiprocessor
MulticomputerDistributed memory multiple-CPU computerSame address on different processors
refers to different physical memory locations
Processors interact through message passing
Commercial multicomputersCommodity clusters
Asymmetrical Multicomputer
Asymmetrical MC AdvantagesBack-end processors dedicated to parallel
computations Easier to understand, model, tune performance
Only a simple back-end operating system needed Easy for a vendor to create
Asymmetrical MC DisadvantagesFront-end computer is a single point of
failureSingle front-end computer limits scalability
of systemPrimitive operating system in back-end
processors makes debugging difficultEvery application requires development of
both front-end and back-end program
Symmetrical Multicomputer
Symmetrical MC AdvantagesAlleviate performance bottleneck caused
by single front-end computerBetter support for debuggingEvery processor executes same program
Symmetrical MC DisadvantagesMore difficult to maintain illusion of single
“parallel computer”No simple way to balance program
development workload among processorsMore difficult to achieve high performance
when multiple processes on each processor
ParPar Cluster, A Mixed Model
Commodity ClusterCo-located computersDedicated to running parallel jobsNo keyboards or displaysIdentical operating systemIdentical local disk imagesAdministered as an entity
Network of WorkstationsDispersed computersFirst priority: person at keyboardParallel jobs run in backgroundDifferent operating systemsDifferent local imagesCheckpointing and restarting important
Flynn’s TaxonomyInstruction streamData streamSingle vs. multipleFour combinations
SISDSIMDMISDMIMD
SISDSingle Instruction, Single DataSingle-CPU systemsNote: co-processors don’t count
FunctionalI/O
Example: PCs
SIMDSingle Instruction, Multiple DataTwo architectures fit this category
Pipelined vector processor(e.g., Cray-1)
Processor array(e.g., Connection Machine)
MISDMultiple
Instruction,Single Data
Example:systolic array
MIMDMultiple Instruction, Multiple DataMultiple-CPU computers
MultiprocessorsMulticomputers
SummaryCommercial parallel computers appeared
in 1980sMultiple-CPU computers now dominateSmall-scale: Centralized multiprocessorsLarge-scale: Distributed memory
architectures (multiprocessors or multicomputers)
Summary (1/2)High performance computing
U.S. governmentCapital-intensive industriesMany companies and research labs
Parallel computersCommercial systemsCommodity-based systems