Introduction to Parallel Processing with Multi-core
Part III – ArchitectureJie Liu, Ph.D.
Professor Department of Computer Science
Western Oregon UniversityUSA
1
Part I outline1. Three models of parallel computation2. Processor organizations3. Processor arrays4. Multiprocessors5. Multi-computers6. Flynn’s taxonomy7. Affordable parallel computers8. Algorithms with processor organizations
2
Processor Organization
In a parallel computer, processors need to “cooperate.” To do so, a processor must be able to “reach” other processors. The method of connecting processors is a parallel computer is called processor Organization.
In a processor organization chart, vertices represent processors and edges represent communication paths.
3
Processor Organization Criteria Diameter: the largest distance between two nodes. The
lower, the better because it affects the communicate costs. Bisection width: the minimum number of edges that must
be removed to divide the network into to halves (within one). The higher, the better because it affect the number of concurrent communication channels.
Number of edges per node: we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability.
Maximum edge length: again, we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability.
4
Mesh Networks
1. A mesh always has a dimension q, which could be 1, 2, 3, or even higher.
Each interior nodes can communicate with 2q other processors
5
Mesh Networks (2)
For a mesh with nodes ( as shown)• Diameter: q(k-1) (too large to have NC algorithm)• Bisection width: (Reasonable)• Maximum number of edge per note: 2q (constant --
good)• Maximum edge length: Constant (good)
Many parallel computers used this architecture because it is simple and scalable
Intel Paragon XP/S used this architecture
qk
1qk
6
24
Hypertree Networks
7
Degree k = 4
And depth d = 2
Hypertree Networks
For a hypertree of degree k and depth d, generally, we only consider the cases where k = 4• Number of nodes: • Diameter: 2d (good for design NC class algorithms)• Bisection width: (Reasonable)• Maximum number of edge per note: 6 (kind of constant)• Maximum edge length: changes depend on d
Only one parallel computers Thinking Machines’ CM-5 (Connection Machine) used this architecture
The designed maximum number of processors was 64K The processors were vector processors that were capable
of performing 32 pairs of arithmetic operations per clock cycle.
8
)12(2 1 dd
12 d
Butterfly Network
A butterfly network has nodes. The one on the right has k= 3.
In practice, rank 0 and rank k are combined, so each node has four connections.
Each rank contains n= nodes. If n(i, j) is the jth node on the ith rank,
then it connects to two nodes on rank i-1: n(i-1, j) and n(i-1, m), where m is the integer formed by inverting the ith most significant bit in the k-bit binary number of j.
For example, n(2,3) is connected to n(1,3) and n(1,1) because 3 is 011, inverting the second most significant bit makes it 001, which is 1.
kk 2*)1(
k2
9
Butterfly Network (2)
For a butterfly network with nodes• Diameter: 2k -1 (good for design NC class
algorithms)• Bisection width: (very good)• Maximum number of edge per note: 4 (constant)• Maximum edge length: changes depend on k
The network is also called an network. A few computers used this connection network including BBN’s TC2000
kk 2*)1(
12 k
10
Routing on Butterfly Network To route a message from
rank 0 to a node on rank k, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right. The chart shows routing from n(0, 2) to n(3, 5)
To route a message from rank k to a node on rank 0, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right.
11
Hypercube
Is a butterfly in which each column of switch nodes is collapsed into a single node. A binary n-cube has processors and equal number of switch nodes.
The chart on the right shows a hypercube of degree 4.
Two switch nodes are connected if their binary labels differ in exactly one bit position.
dn 2
12
Hypercube Measures For a hypercube with nodes
• Diameter: k (good for design NC class algorithms)• Bisection width: n/2 (very good)• Maximum number of edge per note: k• Maximum edge length: depend on nodes
Routing, just find the difference, one bit at a time, either from left to right, or from right to left. For example, for 0100 to 1011 we can go
0100 1100 1000 1010 1011, or 0100 0101 0111 0011 1011 A company named nCUBE Corporation makes machine of
this structure up to k = 13 (theoretically). The company was later bought by ORACLE
kn 2
13
Shuffle-Exchange Network It has nodes numbered 0, 1, … … n-1. It has two kinds of connections: shuffle and
exchange. Exchange connections link two nodes whose
number differ in their least significant bit. Shuffle connections link node i with Below is a Shuffle-Exchange network with
kn 2
)1(mod2 n i
14
32n
Shuffle-Exchange Networkkn 2
15
• For a Shuffle-Exchange Network with nodes– Diameter: 2k -1 (good for design NC class algorithms)– Bisection width: n/k (very good)– Maximum number of edge per note: 2– Maximum edge length: depend on nodes
• Routing is not easy. It is hard to build a real Shuffle-Exchange Network because there are lines crossing each other.
• This architecture is studied for its theoretical significance
Summary
16
Processor Array
17
Processor Array (2) Parallel computers employ processor array
technology can perform many arithmetic operation per clock cycle achieved by with pipelined vector processors, such as Cray-1, or processor array, such as Thinking Machines’ CM-200.
This type of parallel did not really survive because • $$$$$ because of the special CPUs• Hard to utilize all processors• Cannot handle if-then-else types of statements well
because all the processors must carry out the same instruction
• Partitioning is very difficult • It really need to deal with very large amount of data,
which make I/O impossible18
Multiprocessors Parallel computers with multiple CPUs and shared memory
space. • + can use commodity CPU reasonable costs• + support multiple user• + different CPUs can execute different instructions
UMA– uniform memory access, also called symmetric multiprocessor (SMP) – all the processors access any memory address with the same cost• Sequent Symmetry can have up to 32 Intel X386processors
All the CPUs share the same bus• The problem with SMP/UMA is that the number of processors is limited
NUMA – nonuniform memory access – processors can access it own memory, though accessible by others, much cheaper. • Processors are connected through some connection network, such as
butterfly• Kendall Square Research support over 1000 processors• The connection network costs too much, around 40% of the overall
costs19
UMA VS. NUMA
20
Cache Coherence Problem
21
Multicomputers Parallel computers with multiple CPUs and NO shared memory. Processors
interact through message passing.• + all of multiprocessor and possible to have a large number of CPUs• - message passing is hard to implement and take a lot of time to carry out
The first generation of message passing is store-and-forward where a processor receives the complete message then forward to the next processor
• iPSC and nCUBE/10 The second generation of message passing is circuit-switched where a path is
first established with a cost, then subsequent messages use the path without the start up cost
• iPSC/2 and nCUBE 2 The cost of message passing (www.cs.bgu.ac.il/~discsp/papers/ConstrDisAC.pdf)
• startup time – must occur even if you send an empty message• per byte cost• cost of one floating point operation, for comparison reason
22
bTsT
fpT
Multicomputers--nCUBE
An nCUBE parallel computer has three parts, the frontend, the backend, and I/Os. The frontend is a fully functioning computer The backend nodes, each is a computer of it own, has an simple OS that support message passing Note, the capability of the frontend stays the same regardless the number of processors at the
backend The largest nCUBE can have 8K processors 23
Multicomputers—CM-5
Each node consist of a SPARC CPU, up to 32 MB of RAM, and four pipeline lined vector processing, each with 32 MFlop
It can have up to 16K nodes With a theoretical peak speed of 2 teraflops (in 1991)
24
Flynn’s Taxonomy SISD – single core PC SIMD processor array
or CM-200 MISD – systolic array MIMD – multiple core
PC, nCUBE, Symmetry, CM-5, Paragon XP/S
25
Multiple DataSingle Data
SISD SIMD
MINDMISD
Single-Instruction
Multiple-Instruction
Inexpensive “Parallel Computers”
Beowulf • PCs connected by a switch
NOW• Work stations on an intranet
Multi-core• PCs with few multicore CPUS
# of node Cost Perfor-mance
Easy to program
Dedicated
Beowulf Few to 100 OK OK OK Yes
NOW 100s none OK No No
Multi-core Two to few low Low Yes No
26
Summation on Hypercube
27
Summation on Hypercube
for j (log p) -1 down to 0 do
{for all where 0 <= i < p - 1
{if ( i < )
// variable tmp on receives
//value of sum from
tmp <= [i + ] sum
sum = sum + tmp
}
}
28
j2
ip
)2( jip
j2
ip
Summation on Hypercube (2)What could the code looks like?
29
Summation on 2-D mesh –SIMD Code
The mesh has lxl processors, 1 basedfor i l -1 down to 1 do // push from right to left
{ for all where 1 <= j < l do // only l processor is working{ // variable tmp on receives value of sum from
tmp <= [j, i+1] sumsum = sum + tmp
}
}
for i l -1 down to 1 do // push from bottom up
{ for all // really only two processors are working{ // variable tmp on receives value of sum from
tmp <= [i+1,1] sumsum = sum + tmp
}
}
1, ijpijp ,
30
ijp ,
1,ip
1,ip 1,1ip
Summation on 2-D mesh
31
Summation on Shuffle-exchange
Shuffle-exchange SIMD code
for j 0 to (log p) -1 {
for all where 0 <= i < p - 1
{Shuffle(sum) <= sum
Exchange(tmp) <= sum
sum = sum + tmp
}
}
32
ip