introduction to parallel processing with multi-core part iii – architecture jie liu, ph.d....

Introduction to Parallel Processing with Multi-core

Part III – ArchitectureJie Liu, Ph.D.

Professor Department of Computer Science

Western Oregon UniversityUSA

[email protected]

1

Part I outline1. Three models of parallel computation2. Processor organizations3. Processor arrays4. Multiprocessors5. Multi-computers6. Flynn’s taxonomy7. Affordable parallel computers8. Algorithms with processor organizations

2

Processor Organization

In a parallel computer, processors need to “cooperate.” To do so, a processor must be able to “reach” other processors. The method of connecting processors is a parallel computer is called processor Organization.

In a processor organization chart, vertices represent processors and edges represent communication paths.

3

Processor Organization Criteria Diameter: the largest distance between two nodes. The

lower, the better because it affects the communicate costs. Bisection width: the minimum number of edges that must

be removed to divide the network into to halves (within one). The higher, the better because it affect the number of concurrent communication channels.

Number of edges per node: we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability.

Maximum edge length: again, we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability.

4

Mesh Networks

1. A mesh always has a dimension q, which could be 1, 2, 3, or even higher.

Each interior nodes can communicate with 2q other processors

5

Mesh Networks (2)

For a mesh with nodes ( as shown)• Diameter: q(k-1) (too large to have NC algorithm)• Bisection width: (Reasonable)• Maximum number of edge per note: 2q (constant --

good)• Maximum edge length: Constant (good)

Many parallel computers used this architecture because it is simple and scalable

Intel Paragon XP/S used this architecture

qk

1qk

6

24

Hypertree Networks

7

Degree k = 4

And depth d = 2

Hypertree Networks

For a hypertree of degree k and depth d, generally, we only consider the cases where k = 4• Number of nodes: • Diameter: 2d (good for design NC class algorithms)• Bisection width: (Reasonable)• Maximum number of edge per note: 6 (kind of constant)• Maximum edge length: changes depend on d

Only one parallel computers Thinking Machines’ CM-5 (Connection Machine) used this architecture

The designed maximum number of processors was 64K The processors were vector processors that were capable

of performing 32 pairs of arithmetic operations per clock cycle.

8

)12(2 1 dd

12 d

Butterfly Network

A butterfly network has nodes. The one on the right has k= 3.

In practice, rank 0 and rank k are combined, so each node has four connections.

Each rank contains n= nodes. If n(i, j) is the jth node on the ith rank,

then it connects to two nodes on rank i-1: n(i-1, j) and n(i-1, m), where m is the integer formed by inverting the ith most significant bit in the k-bit binary number of j.

For example, n(2,3) is connected to n(1,3) and n(1,1) because 3 is 011, inverting the second most significant bit makes it 001, which is 1.

kk 2*)1(

k2

9

Butterfly Network (2)

For a butterfly network with nodes• Diameter: 2k -1 (good for design NC class

algorithms)• Bisection width: (very good)• Maximum number of edge per note: 4 (constant)• Maximum edge length: changes depend on k

The network is also called an network. A few computers used this connection network including BBN’s TC2000

kk 2*)1(

12 k

10

Routing on Butterfly Network To route a message from

rank 0 to a node on rank k, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right. The chart shows routing from n(0, 2) to n(3, 5)

To route a message from rank k to a node on rank 0, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right.

11

Hypercube

Is a butterfly in which each column of switch nodes is collapsed into a single node. A binary n-cube has processors and equal number of switch nodes.

The chart on the right shows a hypercube of degree 4.

Two switch nodes are connected if their binary labels differ in exactly one bit position.

dn 2

12

Hypercube Measures For a hypercube with nodes

• Diameter: k (good for design NC class algorithms)• Bisection width: n/2 (very good)• Maximum number of edge per note: k• Maximum edge length: depend on nodes

Routing, just find the difference, one bit at a time, either from left to right, or from right to left. For example, for 0100 to 1011 we can go

0100 1100 1000 1010 1011, or 0100 0101 0111 0011 1011 A company named nCUBE Corporation makes machine of

this structure up to k = 13 (theoretically). The company was later bought by ORACLE

kn 2

13

Shuffle-Exchange Network It has nodes numbered 0, 1, … … n-1. It has two kinds of connections: shuffle and

exchange. Exchange connections link two nodes whose

number differ in their least significant bit. Shuffle connections link node i with Below is a Shuffle-Exchange network with

kn 2

)1(mod2 n i

14

32n

Shuffle-Exchange Networkkn 2

15

• For a Shuffle-Exchange Network with nodes– Diameter: 2k -1 (good for design NC class algorithms)– Bisection width: n/k (very good)– Maximum number of edge per note: 2– Maximum edge length: depend on nodes

• Routing is not easy. It is hard to build a real Shuffle-Exchange Network because there are lines crossing each other.

• This architecture is studied for its theoretical significance

Summary

16

Processor Array

17

Processor Array (2) Parallel computers employ processor array

technology can perform many arithmetic operation per clock cycle achieved by with pipelined vector processors, such as Cray-1, or processor array, such as Thinking Machines’ CM-200.

This type of parallel did not really survive because • $$$$$ because of the special CPUs• Hard to utilize all processors• Cannot handle if-then-else types of statements well

because all the processors must carry out the same instruction

• Partitioning is very difficult • It really need to deal with very large amount of data,

which make I/O impossible18

Multiprocessors Parallel computers with multiple CPUs and shared memory

space. • + can use commodity CPU reasonable costs• + support multiple user• + different CPUs can execute different instructions

UMA– uniform memory access, also called symmetric multiprocessor (SMP) – all the processors access any memory address with the same cost• Sequent Symmetry can have up to 32 Intel X386processors

All the CPUs share the same bus• The problem with SMP/UMA is that the number of processors is limited

NUMA – nonuniform memory access – processors can access it own memory, though accessible by others, much cheaper. • Processors are connected through some connection network, such as

butterfly• Kendall Square Research support over 1000 processors• The connection network costs too much, around 40% of the overall

costs19

UMA VS. NUMA

20

Cache Coherence Problem

21

Multicomputers Parallel computers with multiple CPUs and NO shared memory. Processors

interact through message passing.• + all of multiprocessor and possible to have a large number of CPUs• - message passing is hard to implement and take a lot of time to carry out

The first generation of message passing is store-and-forward where a processor receives the complete message then forward to the next processor

• iPSC and nCUBE/10 The second generation of message passing is circuit-switched where a path is

first established with a cost, then subsequent messages use the path without the start up cost

• iPSC/2 and nCUBE 2 The cost of message passing (www.cs.bgu.ac.il/~discsp/papers/ConstrDisAC.pdf)

• startup time – must occur even if you send an empty message• per byte cost• cost of one floating point operation, for comparison reason

22

bTsT

fpT

Multicomputers--nCUBE

An nCUBE parallel computer has three parts, the frontend, the backend, and I/Os. The frontend is a fully functioning computer The backend nodes, each is a computer of it own, has an simple OS that support message passing Note, the capability of the frontend stays the same regardless the number of processors at the

backend The largest nCUBE can have 8K processors 23

Multicomputers—CM-5

Each node consist of a SPARC CPU, up to 32 MB of RAM, and four pipeline lined vector processing, each with 32 MFlop

It can have up to 16K nodes With a theoretical peak speed of 2 teraflops (in 1991)

24

Flynn’s Taxonomy SISD – single core PC SIMD processor array

or CM-200 MISD – systolic array MIMD – multiple core

PC, nCUBE, Symmetry, CM-5, Paragon XP/S

25

Multiple DataSingle Data

SISD SIMD

MINDMISD

Single-Instruction

Multiple-Instruction

Inexpensive “Parallel Computers”

Beowulf • PCs connected by a switch

NOW• Work stations on an intranet

Multi-core• PCs with few multicore CPUS

# of node Cost Perfor-mance

Easy to program

Dedicated

Beowulf Few to 100 OK OK OK Yes

NOW 100s none OK No No

Multi-core Two to few low Low Yes No

26

Summation on Hypercube

27

Summation on Hypercube

for j (log p) -1 down to 0 do

{for all where 0 <= i < p - 1

{if ( i < )

// variable tmp on receives

//value of sum from

tmp <= [i + ] sum

sum = sum + tmp

}

}

28

j2

ip

)2( jip

j2

ip

Summation on Hypercube (2)What could the code looks like?

29

Summation on 2-D mesh –SIMD Code

The mesh has lxl processors, 1 basedfor i l -1 down to 1 do // push from right to left

{ for all where 1 <= j < l do // only l processor is working{ // variable tmp on receives value of sum from

tmp <= [j, i+1] sumsum = sum + tmp

}

}

for i l -1 down to 1 do // push from bottom up

{ for all // really only two processors are working{ // variable tmp on receives value of sum from

tmp <= [i+1,1] sumsum = sum + tmp

}

}

1, ijpijp ,

30

ijp ,

1,ip

1,ip 1,1ip

Summation on 2-D mesh

31

Summation on Shuffle-exchange

Shuffle-exchange SIMD code

for j 0 to (log p) -1 {

for all where 0 <= i < p - 1

{Shuffle(sum) <= sum

Exchange(tmp) <= sum

sum = sum + tmp

}

}

32

ip

introduction to parallel processing with multi-core part iii – architecture jie liu, ph.d....

Documents