introduction to parallel processing with multi-core part iii – architecture jie liu, ph.d....
TRANSCRIPT
![Page 1: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/1.jpg)
Introduction to Parallel Processing with Multi-core
Part III – ArchitectureJie Liu, Ph.D.
Professor Department of Computer Science
Western Oregon UniversityUSA
1
![Page 2: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/2.jpg)
Part I outline1. Three models of parallel computation2. Processor organizations3. Processor arrays4. Multiprocessors5. Multi-computers6. Flynn’s taxonomy7. Affordable parallel computers8. Algorithms with processor organizations
2
![Page 3: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/3.jpg)
Processor Organization
In a parallel computer, processors need to “cooperate.” To do so, a processor must be able to “reach” other processors. The method of connecting processors is a parallel computer is called processor Organization.
In a processor organization chart, vertices represent processors and edges represent communication paths.
3
![Page 4: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/4.jpg)
Processor Organization Criteria Diameter: the largest distance between two nodes. The
lower, the better because it affects the communicate costs. Bisection width: the minimum number of edges that must
be removed to divide the network into to halves (within one). The higher, the better because it affect the number of concurrent communication channels.
Number of edges per node: we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability.
Maximum edge length: again, we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability.
4
![Page 5: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/5.jpg)
Mesh Networks
1. A mesh always has a dimension q, which could be 1, 2, 3, or even higher.
Each interior nodes can communicate with 2q other processors
5
![Page 6: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/6.jpg)
Mesh Networks (2)
For a mesh with nodes ( as shown)• Diameter: q(k-1) (too large to have NC algorithm)• Bisection width: (Reasonable)• Maximum number of edge per note: 2q (constant --
good)• Maximum edge length: Constant (good)
Many parallel computers used this architecture because it is simple and scalable
Intel Paragon XP/S used this architecture
qk
1qk
6
24
![Page 7: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/7.jpg)
Hypertree Networks
7
Degree k = 4
And depth d = 2
![Page 8: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/8.jpg)
Hypertree Networks
For a hypertree of degree k and depth d, generally, we only consider the cases where k = 4• Number of nodes: • Diameter: 2d (good for design NC class algorithms)• Bisection width: (Reasonable)• Maximum number of edge per note: 6 (kind of constant)• Maximum edge length: changes depend on d
Only one parallel computers Thinking Machines’ CM-5 (Connection Machine) used this architecture
The designed maximum number of processors was 64K The processors were vector processors that were capable
of performing 32 pairs of arithmetic operations per clock cycle.
8
)12(2 1 dd
12 d
![Page 9: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/9.jpg)
Butterfly Network
A butterfly network has nodes. The one on the right has k= 3.
In practice, rank 0 and rank k are combined, so each node has four connections.
Each rank contains n= nodes. If n(i, j) is the jth node on the ith rank,
then it connects to two nodes on rank i-1: n(i-1, j) and n(i-1, m), where m is the integer formed by inverting the ith most significant bit in the k-bit binary number of j.
For example, n(2,3) is connected to n(1,3) and n(1,1) because 3 is 011, inverting the second most significant bit makes it 001, which is 1.
kk 2*)1(
k2
9
![Page 10: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/10.jpg)
Butterfly Network (2)
For a butterfly network with nodes• Diameter: 2k -1 (good for design NC class
algorithms)• Bisection width: (very good)• Maximum number of edge per note: 4 (constant)• Maximum edge length: changes depend on k
The network is also called an network. A few computers used this connection network including BBN’s TC2000
kk 2*)1(
12 k
10
![Page 11: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/11.jpg)
Routing on Butterfly Network To route a message from
rank 0 to a node on rank k, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right. The chart shows routing from n(0, 2) to n(3, 5)
To route a message from rank k to a node on rank 0, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right.
11
![Page 12: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/12.jpg)
Hypercube
Is a butterfly in which each column of switch nodes is collapsed into a single node. A binary n-cube has processors and equal number of switch nodes.
The chart on the right shows a hypercube of degree 4.
Two switch nodes are connected if their binary labels differ in exactly one bit position.
dn 2
12
![Page 13: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/13.jpg)
Hypercube Measures For a hypercube with nodes
• Diameter: k (good for design NC class algorithms)• Bisection width: n/2 (very good)• Maximum number of edge per note: k• Maximum edge length: depend on nodes
Routing, just find the difference, one bit at a time, either from left to right, or from right to left. For example, for 0100 to 1011 we can go
0100 1100 1000 1010 1011, or 0100 0101 0111 0011 1011 A company named nCUBE Corporation makes machine of
this structure up to k = 13 (theoretically). The company was later bought by ORACLE
kn 2
13
![Page 14: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/14.jpg)
Shuffle-Exchange Network It has nodes numbered 0, 1, … … n-1. It has two kinds of connections: shuffle and
exchange. Exchange connections link two nodes whose
number differ in their least significant bit. Shuffle connections link node i with Below is a Shuffle-Exchange network with
kn 2
)1(mod2 n i
14
32n
![Page 15: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/15.jpg)
Shuffle-Exchange Networkkn 2
15
• For a Shuffle-Exchange Network with nodes– Diameter: 2k -1 (good for design NC class algorithms)– Bisection width: n/k (very good)– Maximum number of edge per note: 2– Maximum edge length: depend on nodes
• Routing is not easy. It is hard to build a real Shuffle-Exchange Network because there are lines crossing each other.
• This architecture is studied for its theoretical significance
![Page 16: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/16.jpg)
Summary
16
![Page 17: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/17.jpg)
Processor Array
17
![Page 18: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/18.jpg)
Processor Array (2) Parallel computers employ processor array
technology can perform many arithmetic operation per clock cycle achieved by with pipelined vector processors, such as Cray-1, or processor array, such as Thinking Machines’ CM-200.
This type of parallel did not really survive because • $$$$$ because of the special CPUs• Hard to utilize all processors• Cannot handle if-then-else types of statements well
because all the processors must carry out the same instruction
• Partitioning is very difficult • It really need to deal with very large amount of data,
which make I/O impossible18
![Page 19: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/19.jpg)
Multiprocessors Parallel computers with multiple CPUs and shared memory
space. • + can use commodity CPU reasonable costs• + support multiple user• + different CPUs can execute different instructions
UMA– uniform memory access, also called symmetric multiprocessor (SMP) – all the processors access any memory address with the same cost• Sequent Symmetry can have up to 32 Intel X386processors
All the CPUs share the same bus• The problem with SMP/UMA is that the number of processors is limited
NUMA – nonuniform memory access – processors can access it own memory, though accessible by others, much cheaper. • Processors are connected through some connection network, such as
butterfly• Kendall Square Research support over 1000 processors• The connection network costs too much, around 40% of the overall
costs19
![Page 20: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/20.jpg)
UMA VS. NUMA
20
![Page 21: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/21.jpg)
Cache Coherence Problem
21
![Page 22: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/22.jpg)
Multicomputers Parallel computers with multiple CPUs and NO shared memory. Processors
interact through message passing.• + all of multiprocessor and possible to have a large number of CPUs• - message passing is hard to implement and take a lot of time to carry out
The first generation of message passing is store-and-forward where a processor receives the complete message then forward to the next processor
• iPSC and nCUBE/10 The second generation of message passing is circuit-switched where a path is
first established with a cost, then subsequent messages use the path without the start up cost
• iPSC/2 and nCUBE 2 The cost of message passing (www.cs.bgu.ac.il/~discsp/papers/ConstrDisAC.pdf)
• startup time – must occur even if you send an empty message• per byte cost• cost of one floating point operation, for comparison reason
22
bTsT
fpT
![Page 23: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/23.jpg)
Multicomputers--nCUBE
An nCUBE parallel computer has three parts, the frontend, the backend, and I/Os. The frontend is a fully functioning computer The backend nodes, each is a computer of it own, has an simple OS that support message passing Note, the capability of the frontend stays the same regardless the number of processors at the
backend The largest nCUBE can have 8K processors 23
![Page 24: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/24.jpg)
Multicomputers—CM-5
Each node consist of a SPARC CPU, up to 32 MB of RAM, and four pipeline lined vector processing, each with 32 MFlop
It can have up to 16K nodes With a theoretical peak speed of 2 teraflops (in 1991)
24
![Page 25: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/25.jpg)
Flynn’s Taxonomy SISD – single core PC SIMD processor array
or CM-200 MISD – systolic array MIMD – multiple core
PC, nCUBE, Symmetry, CM-5, Paragon XP/S
25
Multiple DataSingle Data
SISD SIMD
MINDMISD
Single-Instruction
Multiple-Instruction
![Page 26: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/26.jpg)
Inexpensive “Parallel Computers”
Beowulf • PCs connected by a switch
NOW• Work stations on an intranet
Multi-core• PCs with few multicore CPUS
# of node Cost Perfor-mance
Easy to program
Dedicated
Beowulf Few to 100 OK OK OK Yes
NOW 100s none OK No No
Multi-core Two to few low Low Yes No
26
![Page 27: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/27.jpg)
Summation on Hypercube
27
![Page 28: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/28.jpg)
Summation on Hypercube
for j (log p) -1 down to 0 do
{for all where 0 <= i < p - 1
{if ( i < )
// variable tmp on receives
//value of sum from
tmp <= [i + ] sum
sum = sum + tmp
}
}
28
j2
ip
)2( jip
j2
ip
![Page 29: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/29.jpg)
Summation on Hypercube (2)What could the code looks like?
29
![Page 30: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/30.jpg)
Summation on 2-D mesh –SIMD Code
The mesh has lxl processors, 1 basedfor i l -1 down to 1 do // push from right to left
{ for all where 1 <= j < l do // only l processor is working{ // variable tmp on receives value of sum from
tmp <= [j, i+1] sumsum = sum + tmp
}
}
for i l -1 down to 1 do // push from bottom up
{ for all // really only two processors are working{ // variable tmp on receives value of sum from
tmp <= [i+1,1] sumsum = sum + tmp
}
}
1, ijpijp ,
30
ijp ,
1,ip
1,ip 1,1ip
![Page 31: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/31.jpg)
Summation on 2-D mesh
31
![Page 32: Introduction to Parallel Processing with Multi-core Part III – Architecture Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University](https://reader035.vdocument.in/reader035/viewer/2022062714/56649cf85503460f949c8e8b/html5/thumbnails/32.jpg)
Summation on Shuffle-exchange
Shuffle-exchange SIMD code
for j 0 to (log p) -1 {
for all where 0 <= i < p - 1
{Shuffle(sum) <= sum
Exchange(tmp) <= sum
sum = sum + tmp
}
}
32
ip