parallel analysis of algorithms: pram + cgm

Post on 23-Mar-2016

46 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Parallel Analysis of Algorithms: PRAM + CGM. Outline. Parallel Performance Parallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM). Parallel Analysis of Algorithms. Question?. - PowerPoint PPT Presentation

TRANSCRIPT

Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 2

OutlineParallel PerformanceParallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Parallel Analysis of Algorithms 3

Question?Professor speedy says he has a parallel algorithm for sorting n arbitrary items in n time using p>1 processors. Do you believe him?

Parallel Analysis of Algorithms

Parallel Analysis of Algorithms 4

Performance of a Parallel Algorithm

n : problem size (e.g.: sort n numbers)p : number of processorsT(p): parallel timeTs : sequential time (optimal sequ. alg.)s(p) = Ts / T(p) : speedup (1sp)

s

p

s(p)=p

super-linear

linear

sub-linear

Parallel Analysis of Algorithms

Parallel Analysis of Algorithms 5

Speeduplinear speedup s(p) = p optimal

super linear speedup s(p) > p : impossible

Proof. Assume that parallel algorithm A has a speedup s > p for processors, i.e. s = Ts / T > p. Hence: Ts > T p. Simulate A on a sequential, single processor machine. Then T(1) = T · p < Ts. Hence, Ts was not optimal. Contradiction.

Parallel Analysis of Algorithms

Parallel Analysis of Algorithms 6

Amdahl’s LawLet f, 0<f<1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup is s <= 1 / [f+(1-f)/p].

Proof: Ts = sequ. time. The T(p) f Ts + (1-f)Ts / p.Hence

s Ts / [f Ts +(1-f) Ts /p] = 1 / [f+(1-f)/p].

Parallel Analysis of Algorithms

Amdahl’s Law

Parallel Analysis of Algorithms 7

Serial section Parallelizable sections(a) One processor

(b) Multipleprocessors

fts (1 - f)tsts

(1 - f)ts/ptp

p processors

Parallel Analysis of Algorithms 8

Amdahl’s Law

P=5

P=1

P=10

time

P=1000

Parallel Analysis of Algorithms

Parallel Analysis of Algorithms 9

s(p) 1 / [f+(1-f)/p]

f 0 : s (p) pf 1 : s(p) 1f = 0.5 : s(p) = 2 [p/(p+1)] <= 2f = 1/k : s(p) = k / [1+(k-1)/p] <= k

Amdahl’s LawParallel Analysis of Algorithms

Parallel Analysis of Algorithms 10

k

s

Parallel Analysis of Algorithms

Parallel Analysis of Algorithms 11

Scaled or Relative Speedup

Ts may be unknown (in fact, for most real experiments this is the case)

Relative speedup s’ (p) = T(1) / T(p)

s’ (p) s(p)

Parallel Analysis of Algorithms

Parallel Analysis of Algorithms 12

Efficiencye(p) = s(p) / p efficiency (0e1)

optimal linear speedup s(p) = p e(p) = 1

e’(p) = s’(p) / p Relative efficiency

Parallel Analysis of Algorithms

Parallel Analysis of Algorithms 13

OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Parallel Analysis of Algorithms 14

Parallel Random Access Machine (PRAM)

Exclusive-Read (ER)Concurrent-Read (CR)

Exclusive-Write (EW)Concurrent-Write (CW)

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 15

Concurrent-Write (CW) Common: All proc. must

write the same value Arbitrary: An arbitrary

value “wins” Smallest: The smallest

value “wins” Priority: The proc. with

smallest ID number “wins”

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Parallel Random Access Machine (PRAM)

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 16

Default: CREW (Concurrent Read Exclusive Write)

p = O(n) fine grainedmassively parallel

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Parallel Random Access Machine (PRAM)

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 17

Performance of a PRAM Algorithm

Optimal T = O ( Ts / p )

Efficient T = O ( logk(n) Ts / p )

NC T = O (logk(n) ) for p= polynomial (n)

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 18

Example: Multiply n numbersInput: a1, a2, …, an

Output: a1 * a2 * a3 * … * an

* : associative operator

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

1

2

j

n-1n

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 19

Algorithm 1

p = n/2

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 20

Analysisp = n/2 T = O( log n )

Ts = O(n), Ts / p = O(1)

algorithm is efficient & NC but not optimal

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 21

Algorithm 2make available only p = n / log n processorsexecute Algorithm 1 using “rescheduling”:

whenever Algorithm 1 has a parallel step where m > (n / log n) processors are used, simulate this step by a “phase” consisting of m / (n / log n) steps for (n / log n) processors

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 22

proc

Shared Memory (PRAM, SMP)

Parallel Analysis of Algorithms 23

Analysis# steps in phase i : (n / 2i) / (n / log n) = log n / 2i

T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n )

p = n / log nTs / p = O( n / [n / log n] ) = O( log n )

algorithm is efficient & NC & optimal

Shared Memory (PRAM, SMP)

Problem 2: List RankingInput: A linked list represented by an array.Output: The distance of each node to the last node.

Algorithm: Pointer Jumping

Assign proc. i to node iInitialize (all proc. i in parallel):

D(i) := 0 if P(i)=i1 otherwise

REPEAT log n TIMES (all proc. i in parallel):

D(i) := D(i) + D(P(i))P(i) := P(P(i))

Analysisp = nT = O( log n )

efficient & NC but not optimal

Problem 3: Partial SumsInput: a1, a2, …, an

Output: a1

a1 + a2

a1 + a2 + a3

... a1 + a2 + a3 + … + an

Parallel RecursionCompute (in parallel): a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an

Recursively (all proc. together) solve the problem for the n/2 numbers a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an

The result is: (a1+a2) (a1+a2+a3+a4) (a1+a2+a3+a4+a5+a6 ) ... (a1+a2... an-

3+an-2) (a1+a2+an-1+an)Compute each gap by multiplying its predecessor by a single number

Analysisp = nT (n) = T(n/2) + O(1)T(1) = O(1) T(n) = O(log n)

efficient and NC but not optimal

Improving through rescheduling

set p = n / log nsimulate previous algorithm

proc

Analysis

# steps in phase i : (n / 2i) / (n / log n) = log n / 2i

T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n

)p = n / log nTs / p = O( n / [n / log n] ) = O( log n )

algorithm is efficient & NC & optimal

Problem 4: SortingInput: a1, a2, …, an

Output: a1, a2, …, an permuted into sorted order

proc. 1

proc. 2

proc. 3

proc. i

proc. p

...

...

sharedmemory

12

j

n-1n

Unimodal sequence:9 10 13 17 21 19 16 15

Bitonic sequence: cyclic shift of a unimodal sequence

16 15 9 10 13 17 21 19

Bitonic Sorting (Batcher)

Properties of bitonic sequences

X = x1 x2... xn xn+1 xn+2 ... x2n bitonicL(X) = y1 ... yn yi = min {xi, xn+i}U(X) = z1 ... zn zi = max {xi, xn+i}

(1) L(X) and U(X) are bitonic(2) every element of L(X) is smaller

than every element of U(X).

Bitonic Merge: sorting a bitonic sequence

a bitonic sequence of length n can be sorted in time O(log n) using p=n processors

sorting an arbitrary sequence a1, a2, …, an

split a1, a2, …, an into two sub-sequences: a1, …, an/2 and a(n/2)+1, a(n/2)+2, …, an

recursively, in parallel, sort each sub-sequence using p/2 processorsmerge the two sorted sub-sequences into one sorted sequence using bitonic merge

Note: If X and Y are sorted sequences (increasing order), then X YR is a bitonic sequence.

Analysisp = nT (n) = T(n/2) + O(log n)T(1) = O(1) T(n) = O(log2 n)

efficient and NC but not optimal

Parallel Analysis of Algorithms 40

So what about a SMP machine?

PRAM? EREW? CREW? CRCW?How does OpenMP play into this?

Parallel Analysis of Algorithms 41

OpenMP/SMP= CREW PRAM but coarse grainedT(p) f Ts + (1-f)Ts / p, for f = sequential fractionT(n,p) = f Ts + sum over all parallel regions of max time fork

Parallel Regions

Master Thread

Parallel Analysis of Algorithms 42

OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Distributed Memory Models

Parallel Computingp: # processorsn: problem sizeTs(n): sequential timeT(p,n): parallel time

speedup: S(p,n) = Ts(n) / T(p,n)

Goal: obtain linear speedup S(p,n)=p

Parallel Computers

BeowulfCluster

...

Blue Gene/Q

Cray XK7Custom MPP (Tianhe-2)

Parallel Machine ModelsHow to abstract the machine into a

simplified model such that algorithm/application design is not

hampered by too many details calculated time complexity

predictions match the actually observed running times (with sufficient accuracy)

Parallel Machine ModelsPRAMFine grained networks (array, ring, mesh, hypercube)Bulk Synchronous Parallelism (BSP), Valiant, 1990Coarse Grained Multicomputer (CGM), Dehne, Rau-Chaplin, 1993Multithread (CILK), Leiserson, 1995

many more...

PRAMp=O(n) processors

massively parallel...PPPPPP

shared memory

12

...

n-1n

Example: PRAM Sort

list merge…

Bitonic Sort: O(log n) per merge => O(log2 n)Cole: O(1) per merge => O(log n)

PPPPPP

shared memory

12

...

n-1n

Fine Grained Networks

p=O(n) processors

massively parallel ...

P PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP P

Example: Mesh Sort

O(n1/2) time

sub mesh merge

Back to reality...Would anyone use a parallel machine with n processors in order to sort n items ?

Of course NOT…

Typical parallel machines have large ratios n/p (e.g. n/p = 16M)

Brent's TheoremMapping: Fine grained => Coarse grained.Via virtual processorsIf we simulate n virtual processors on p real processors then S(p) = S(n) * p/n

S(n)=O(n) "optimal" => S(p)=O(p) "optimal"

The Problem!• Fine Grained PRAM and Fixed

Network algorithms are VERY slow when implemented on commercial parallel machines.

Why ?

Why ?

p=n

S(p)S(p)

P=n

Why ?

The assumption is not true: in most cases, S(n) is NOT optimal

p=n

S(p)n1/2

n1/2

S(n) = n log n / n1/2

CGM

p=n p

S(p)

Coarse Grained MulticomputerDehne, Rau-Chaplin, 1993

S(p)

pp=n

CGMCoarse grained memoryCoarse grained computation Coarse grained communication

Coarse Grained Multicomputer

Coarse Grained Memory

Ignore small n/p

e.g. assume n/p > p

network orshared memory

proc mem

comm proc mem

comm

proc mem

comm

proc mem

comm

proc mem

comm

proc mem

comm

n/p

n/p

n/p

n/p

n/p

n/p

n/p

proc mem

comm

Coarse Grained Comp.

PPPP

time

round 1 round 3round 2

Compute in supersteps with barrier synchronization (as in BSP)

Coarse Grained Comm.

PPPP

time

round 1 round 3round 2

• All communication steps are h-relations, h=O(n/p)• No individual messages

h-relat ion

h-relat ion

h-relat ion

h-Relationproc

mem

comm

procmem

comm

procmem

comm

procmem

comm

h=O(n/p)

CGM

• Complexity measure:

– number of rounds (e.g. O(1), O(log p), …)– scalability (e.g. n/p > p)– local computation

– communication volume

Coarse Grained Multicomputer

CGM

• Coarse grained memory• Coarse grained computation • Coarse grained communication

=> - practical parallel algorithms- efficient and portable

Coarse Grained Multicomputer

Det. Sample SortCGM Algorithm:

sort locally and create p-sample

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Det. Sample SortCGM Algorithm:

• send all p-samples to processor 1

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Det. Sample SortCGM Algorithm:

• proc.1: sort all received samples and compute global p-sample

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Det. Sample SortCGM Algorithm:

• broadcast global p-sample

• bucket locally according to global p-sample

• send bucket i to proc.i

• resort locallyn/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

Det. Sample SortCGM Algorithm:

• O(1) roundsfor n/p > p2

• O(n/p log n) local comp.

• Goodrich (FOCS'98): O(1) roundsfor n/p > pe

n/p

procmem

comm

procmem

comm

procmem

comm

procmem

comm

data

p-sample

data

p-sample

data

p-sample

data

p-sample

top related