parallel analysis of algorithms: pram + cgm

Parallel Analysis of Algorithms: PRAM + CGM

Parallel Analysis of Algorithms 2

OutlineParallel PerformanceParallel Models Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Question?Professor speedy says he has a parallel algorithm for sorting n arbitrary items in n time using p>1 processors. Do you believe him?

Parallel Analysis of Algorithms

Performance of a Parallel Algorithm

n : problem size (e.g.: sort n numbers)p : number of processorsT(p): parallel timeTs : sequential time (optimal sequ. alg.)s(p) = Ts / T(p) : speedup (1sp)

s(p)=p

super-linear

linear

sub-linear

Speeduplinear speedup s(p) = p optimal

super linear speedup s(p) > p : impossible

Proof. Assume that parallel algorithm A has a speedup s > p for processors, i.e. s = Ts / T > p. Hence: Ts > T p. Simulate A on a sequential, single processor machine. Then T(1) = T · p < Ts. Hence, Ts was not optimal. Contradiction.

Amdahl’s LawLet f, 0<f<1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup is s <= 1 / [f+(1-f)/p].

Proof: Ts = sequ. time. The T(p) f Ts + (1-f)Ts / p.Hence

s Ts / [f Ts +(1-f) Ts /p] = 1 / [f+(1-f)/p].

Amdahl’s Law

Serial section Parallelizable sections(a) One processor

(b) Multipleprocessors

fts (1 - f)tsts

(1 - f)ts/ptp

p processors

Amdahl’s Law

P=1000

s(p) 1 / [f+(1-f)/p]

f 0 : s (p) pf 1 : s(p) 1f = 0.5 : s(p) = 2 [p/(p+1)] <= 2f = 1/k : s(p) = k / [1+(k-1)/p] <= k

Amdahl’s LawParallel Analysis of Algorithms

Scaled or Relative Speedup

Ts may be unknown (in fact, for most real experiments this is the case)

Relative speedup s’ (p) = T(1) / T(p)

s’ (p) s(p)

Efficiencye(p) = s(p) / p efficiency (0e1)

optimal linear speedup s(p) = p e(p) = 1

e’(p) = s’(p) / p Relative efficiency

OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Parallel Random Access Machine (PRAM)

Exclusive-Read (ER)Concurrent-Read (CR)

Exclusive-Write (EW)Concurrent-Write (CW)

proc. 1

proc. 2

proc. 3

proc. i

proc. p

sharedmemory

Shared Memory (PRAM, SMP)

Concurrent-Write (CW) Common: All proc. must

write the same value Arbitrary: An arbitrary

value “wins” Smallest: The smallest

value “wins” Priority: The proc. with

smallest ID number “wins”

proc. 1

proc. 2

proc. 3

proc. i

proc. p

sharedmemory

Default: CREW (Concurrent Read Exclusive Write)

p = O(n) fine grainedmassively parallel

proc. 1

proc. 2

proc. 3

proc. i

proc. p

sharedmemory

Performance of a PRAM Algorithm

Optimal T = O ( Ts / p )

Efficient T = O ( logk(n) Ts / p )

NC T = O (logk(n) ) for p= polynomial (n)

Example: Multiply n numbersInput: a1, a2, …, an

Output: a1 * a2 * a3 * … * an

* : associative operator

proc. 1

proc. 2

proc. 3

proc. i

proc. p

sharedmemory

Algorithm 1

p = n/2

Analysisp = n/2 T = O( log n )

Ts = O(n), Ts / p = O(1)

algorithm is efficient & NC but not optimal

Algorithm 2make available only p = n / log n processorsexecute Algorithm 1 using “rescheduling”:

whenever Algorithm 1 has a parallel step where m > (n / log n) processors are used, simulate this step by a “phase” consisting of m / (n / log n) steps for (n / log n) processors

Analysis# steps in phase i : (n / 2i) / (n / log n) = log n / 2i

T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n )

p = n / log nTs / p = O( n / [n / log n] ) = O( log n )

algorithm is efficient & NC & optimal

Problem 2: List RankingInput: A linked list represented by an array.Output: The distance of each node to the last node.

Algorithm: Pointer Jumping

Assign proc. i to node iInitialize (all proc. i in parallel):

D(i) := 0 if P(i)=i1 otherwise

REPEAT log n TIMES (all proc. i in parallel):

D(i) := D(i) + D(P(i))P(i) := P(P(i))

Analysisp = nT = O( log n )

efficient & NC but not optimal

Problem 3: Partial SumsInput: a1, a2, …, an

Output: a1

a1 + a2

a1 + a2 + a3

... a1 + a2 + a3 + … + an

Parallel RecursionCompute (in parallel): a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an

Recursively (all proc. together) solve the problem for the n/2 numbers a1 + a2 , a3 + a4 , a5 + a6 , ... , an-1 + an

The result is: (a1+a2) (a1+a2+a3+a4) (a1+a2+a3+a4+a5+a6 ) ... (a1+a2... an-

3+an-2) (a1+a2+an-1+an)Compute each gap by multiplying its predecessor by a single number

Analysisp = nT (n) = T(n/2) + O(1)T(1) = O(1) T(n) = O(log n)

efficient and NC but not optimal

Improving through rescheduling

set p = n / log nsimulate previous algorithm

Analysis

# steps in phase i : (n / 2i) / (n / log n) = log n / 2i

T = O(1in log n / 2i )= O( log n 1in 1/ 2i ) = O( log n

)p = n / log nTs / p = O( n / [n / log n] ) = O( log n )

algorithm is efficient & NC & optimal

Problem 4: SortingInput: a1, a2, …, an

Output: a1, a2, …, an permuted into sorted order

proc. 1

proc. 2

proc. 3

proc. i

proc. p

sharedmemory

Unimodal sequence:9 10 13 17 21 19 16 15

Bitonic sequence: cyclic shift of a unimodal sequence

16 15 9 10 13 17 21 19

Bitonic Sorting (Batcher)

Properties of bitonic sequences

X = x1 x2... xn xn+1 xn+2 ... x2n bitonicL(X) = y1 ... yn yi = min {xi, xn+i}U(X) = z1 ... zn zi = max {xi, xn+i}

(1) L(X) and U(X) are bitonic(2) every element of L(X) is smaller

than every element of U(X).

Bitonic Merge: sorting a bitonic sequence

a bitonic sequence of length n can be sorted in time O(log n) using p=n processors

sorting an arbitrary sequence a1, a2, …, an

split a1, a2, …, an into two sub-sequences: a1, …, an/2 and a(n/2)+1, a(n/2)+2, …, an

recursively, in parallel, sort each sub-sequence using p/2 processorsmerge the two sorted sub-sequences into one sorted sequence using bitonic merge

Note: If X and Y are sorted sequences (increasing order), then X YR is a bitonic sequence.

Analysisp = nT (n) = T(n/2) + O(log n)T(1) = O(1) T(n) = O(log2 n)

efficient and NC but not optimal

So what about a SMP machine?

PRAM? EREW? CREW? CRCW?How does OpenMP play into this?

OpenMP/SMP= CREW PRAM but coarse grainedT(p) f Ts + (1-f)Ts / p, for f = sequential fractionT(n,p) = f Ts + sum over all parallel regions of max time fork

Parallel Regions

Master Thread

OutlineParallel Analysis of AlgorithmsModels Shared Memory (PRAM, SMP) Distributed Memory (BSP, CGM)

Distributed Memory Models

Parallel Computingp: # processorsn: problem sizeTs(n): sequential timeT(p,n): parallel time

speedup: S(p,n) = Ts(n) / T(p,n)

Goal: obtain linear speedup S(p,n)=p

Parallel Computers

BeowulfCluster

Blue Gene/Q

Cray XK7Custom MPP (Tianhe-2)

Parallel Machine ModelsHow to abstract the machine into a

simplified model such that algorithm/application design is not

hampered by too many details calculated time complexity

predictions match the actually observed running times (with sufficient accuracy)

Parallel Machine ModelsPRAMFine grained networks (array, ring, mesh, hypercube)Bulk Synchronous Parallelism (BSP), Valiant, 1990Coarse Grained Multicomputer (CGM), Dehne, Rau-Chaplin, 1993Multithread (CILK), Leiserson, 1995

many more...

PRAMp=O(n) processors

massively parallel...PPPPPP

shared memory

Example: PRAM Sort

list merge…

Bitonic Sort: O(log n) per merge => O(log2 n)Cole: O(1) per merge => O(log n)

PPPPPP

shared memory

Fine Grained Networks

p=O(n) processors

massively parallel ...

P PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP PP P

Example: Mesh Sort

O(n1/2) time

sub mesh merge

Back to reality...Would anyone use a parallel machine with n processors in order to sort n items ?

Of course NOT…

Typical parallel machines have large ratios n/p (e.g. n/p = 16M)

Brent's TheoremMapping: Fine grained => Coarse grained.Via virtual processorsIf we simulate n virtual processors on p real processors then S(p) = S(n) * p/n

S(n)=O(n) "optimal" => S(p)=O(p) "optimal"

The Problem!• Fine Grained PRAM and Fixed

Network algorithms are VERY slow when implemented on commercial parallel machines.

S(p)S(p)

The assumption is not true: in most cases, S(n) is NOT optimal

S(p)n1/2

S(n) = n log n / n1/2

Coarse Grained MulticomputerDehne, Rau-Chaplin, 1993

CGMCoarse grained memoryCoarse grained computation Coarse grained communication

Coarse Grained Multicomputer

Coarse Grained Memory

Ignore small n/p

e.g. assume n/p > p

network orshared memory

proc mem

comm proc mem

proc mem

Coarse Grained Comp.

round 1 round 3round 2

Compute in supersteps with barrier synchronization (as in BSP)

Coarse Grained Comm.

round 1 round 3round 2

• All communication steps are h-relations, h=O(n/p)• No individual messages

h-relat ion

h-Relationproc

procmem

h=O(n/p)

• Complexity measure:

– number of rounds (e.g. O(1), O(log p), …)– scalability (e.g. n/p > p)– local computation

– communication volume

• Coarse grained memory• Coarse grained computation • Coarse grained communication

=> - practical parallel algorithms- efficient and portable

Det. Sample SortCGM Algorithm:

sort locally and create p-sample

procmem

p-sample

• send all p-samples to processor 1

procmem

p-sample

• proc.1: sort all received samples and compute global p-sample

procmem

p-sample

• broadcast global p-sample

• bucket locally according to global p-sample

• send bucket i to proc.i

• resort locallyn/p

procmem

p-sample

• O(1) roundsfor n/p > p2

• O(n/p log n) local comp.

• Goodrich (FOCS'98): O(1) roundsfor n/p > pe

procmem

p-sample

parallel analysis of algorithms: pram + cgm

parallel timets

parallel algorithmn

t p ts

speedup s p

n time

sequential time optimal

sort n numbersp

sorting n arbitrary

Documents

2.2 pram algorithms

parallel and scalable architectures...

pram (parallel random access machine) david...

thm13 - pram models

pram machine models

pram (parallel random access machine)

ii extreme models study the two extremes of parallel...

tutorial 10 – pram

ocean green beach villas pram pram brochure2013

comp 633: parallel computing pram...

poetry intro pram

the pram model for parallel computation. references 1.selim...

parallel random-access machines - computer science...

pram rock lookbook

parallel crew matrix multiplication · parallel crew matrix...

the pram model for parallel computation

for general-purpose parallel computing: it is pram or...

parallel and distributed algorithms. overview parallel...

uzi vishkin. introduction objective model of parallel...

drawn 05-10-2013 simple 8' pram an easy to make, paddle and...