parallel algorithm

Parallel Algorithms

Computation Models

• Goal of computation model is to provide a realistic representation of the costs of programming.

• Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “good” (i.e. performance-efficient)

Goal for Modeling

• We want to develop computational models which accurately represent the cost and performance of programs

• If model is poor, optimum in model may not coincide with optimum observed in practice

Model Real World

optimum

Models of Computation

What’s a model good for??

• Provides a way to think about computers. Influences design of:

• Architectures

• Languages

• Algorithms

• Provides a way of estimating how well a program will perform.

Cost in model should be roughly same as cost of executing program

The Random Access Machine Model

RAM model of serial computers:– Memory is a sequence of words, each

capable of containing an integer.

– Each memory access takes one unit of time

– Basic operations (add, multiply, compare) take one unit time.

– Instructions are not modifiable

– Read-only input tape, write-only output tape

Has RAM influenced our thinking?

Language design:

No way to designate registers, cache, DRAM.

Most convenient disk access is as streams.

How do you express atomic read/modify/write?

Machine & system design:

It’s not very easy to modify code.

Systems pretend instructions are executed in-order.

Performance Analysis:

Primary measures are operations/sec (MFlop/sec, MHz, ...)

What’s the difference between Quicksort and Heapsort??

What about parallel computers

• RAM model is generally considered a very successful “bridging model” between programmer and hardware.

• “Since RAM is so successful, let’s generalize it for parallel computers ...”

PRAM [Parallel Random Access Machine]

PRAM composed of:– P processors, each with its own unmodifiable

program.

– A single shared memory composed of a sequence of words, each capable of containing an arbitrary integer.

– a read-only input tape.

– a write-only output tape.

PRAM model is a synchronous, MIMD, shared address space parallel computer.

(Introduced by Fortune and Wyllie, 1978)

PRAM model of computation

• p processors, each with local memory

• Synchronous operation

• Shared memory reads and writes

• Each processor has unique id in range 1-p

Shared memory

Characteristics• At each unit of time, a processor is either

active or idle (depending on id)

• All processors execute same program

• At each time step, all processors execute same instruction on different data (“data-parallel”)

• Focuses on concurrency only

Variants of PRAM modelExclusive

WriteConcurrent

ExclusiveRead

EREW ERCW

ConcurrentRead

CREW CRCW

More PRAM taxonomy• Different protocols can be used for reading

and writing shared memory.– EREW - exclusive read, exclusive write

A program isn’t allowed to have two processors access the same memory location at the same time.

– CREW - concurrent read, exclusive write

– CRCW - concurrent read, concurrent write Needs protocol for arbitrating write conflicts

– CROW – concurrent read, owner writeEach memory location has an official “owner”

• PRAM can emulate a message-passing machine by partitioning memory into private memories.

Sub-variants of CRCW• Common CRCW

– CW iff all processors writing same value

• Arbitrary CRCW– Arbitrary value of write set stored

• Priority CRCW– Value of min-index processor stored

• Combining CRCW

Why study PRAM algorithms?• Well-developed body of literature on

design and analysis of such algorithms

• Baseline model of concurrency

• Explicit model– Specify operations at each step

– Scheduling of operations on processors

• Robust design paradigm

Work-Time paradigm• Higher-level abstraction for PRAM algorithms

• WT algorithm = (finite) sequence of time steps with arbitrary number of operations at each step

• Two complexity measures– Step complexity T(n)

– Work complexity W(n)

WT algorithm work-efficient if W(n) = (TS(n))

optimal sequential Algorithm

Designing PRAM algorithms• Balanced trees

• Pointer jumping

• Euler tours

• Divide and conquer

• Symmetry breaking

• . . .

Balanced trees• Key idea: Build balanced binary tree on

input data, sweep tree up and down

• “Tree” not a data structure, often a control structure (e.g., recursion)

Alg : Sum • Given: Sequence a of n = 2k elements

• Given: Binary associative operator +

• Compute: S = a1 + ... + an

WT description of sum

integer B[1..n]forall i in 1 : n do B[i] := ai

enddofor h = 1 to k do forall i in 1 : n/2h do B[i] := B[2i-1] + B[2i] enddoenddoS := B[1]

Points to note about WT pgm• Global program: no references to

processor id

• Contains both serial and concurrent operations

• Semantics of forall

• Order of additions different from sequential order: associativity critical

Analysis of scan operation• Algorithm is correct

(lg n) steps, (n) work

• EREW model

• Two variants– Inclusive: as discussed

– Exclusive: s1 = I, sk = x1 + ... + xk-1

• If n not power of 2, pad to next power

1)(1 2

Complexity measures of Sum• Recall definitions of

step complexity T(n) and work complexity W(n)

• Concurrent execution reduces number of steps

)(lg11)( nknT

How to do prefix sum ?• Input: Sequence x of n = 2k elements,

binary associative operator +

• Output: Sequence s of n = 2k elements, with sk = x1 + ... + xk

• Example:

x = [1, 4, 3, 5, 6, 7, 0, 1]

s = [1, 5, 8, 13, 19, 26, 26, 27]

List Ranking

• List ranking problem– Given a singly linked list L with n objects, for each node,

compute the distance to the end of the list

• If d denotes the distance– node.d = 0 if node.next = nil

– node.next.d + 1 otherwise

• Serial algorithm: O(n)

• Parallel algorithm– Assign one processor for each node

– Assume there are as many processors as list objects

– For each node i, perform1. i.d = i.d + i.next.d

2. i.next = i.next.next // pointer jumping

List Ranking - Pointer Jumping

• List_ranking(L)1. for each node i, in parallel do

2. if i.next = nil then i.d = 0

3. else i.d = 1

4. while exists a node i, such that i.next != nil do

5. for each node i, in parallel do

6. if i.next != nil then

7. i.d = i.d + i.next.d // i updates i itself

8. i.next = i.next.next

• Analysis– After a pointer jumping, a list is transformed into two (interleaved) lists

– After that, four (interleaved) lists

– Each pointer jumping doubles the number of lists and halves their length

– After log n, all lists contain only one node

– Total time: O(log n)

List Ranking - Example

List Ranking - Discussion

• Synchronization is important– In step 8 (i.next = i.next.next), all processors must read right hand side

before any processor write left hand side

• The list ranking algorithm is EREW– If we assume in step 7 (i.d = i.d + i.next.d) all processors read i.d and

then read i.next.d

– If j.next = i, i and j do not read i.d concurrently

• Work performance– performs O(n log n) work since n processors in O(log n) time

• Work efficient– A PRAM algorithm is work efficient w.r.t another algorithm if two

algorithms are within a constant factor

– Is the link ranking algorithm work-efficient w.r.t the serial algorithm?• No, because O(n log n) versus O(n)

• Speedup – S = n / log n

Parallel Prefix on a List

• Prefix computation

– Input <x1, x2, .., xn>, a binary, associative operator

– Output <y1, y2, .., yn>

– Prefix computation: yk = x1 x2 .. xk

• Example

– if xk = 1 for k=1..n and = +

– Then yk = k, for k = 1..n

• Serial algorithm: O(n)

• Notation

– [i, j] = xi xi+1 .. xj

• [k, k] = xk • [i, k] [k+1, j] = [i, j]

• Idea: perform prefix computation on a linked list so that

– each node k contains [k, k] = xk initially

– finally each node k contains [1, k] = yk

Parallel Prefix on a List (2)

• List_prefix(L, X)

// L: list, X: <x1, x2, .., xn>

1. for each node i, in parallel

2. i.y = xi

3. While exists a node i such that i.next != nil do

4. for each node i, in parallel do

5. if i.next != nil then

6. i.next.y = i.y i.next.y // i updates its successor

7. i.next = i.next.next

• Analysis– Initially k-th node has [k,k] as y-value, points to (k+1)-th node

– At the first iteration, • k-th node fetches [k+1,k+1] from its successor and

• perform [k,k] [k+1,k+1] resulting in [k,k+1] and • update its successor

– At the second iteration• k-th node fetches [k+1,k+2] from its successor and

• perform [k-1,k] [k+1,k+2] resulting in [k-1,k+2] and • update its successor

• Running time: O(log n)

– After log n, all lists contain only one node

• Work performed: O(n log n)

• Speedup

– S = n / log n

Pointer jumping• Fast parallel processing of linked data

structures (lists, trees)

• Convention: Draw trees with edges directed from children to parents

• Example: Finding the roots of forest represented as parent array P– P[i] = j if and only if (i, j) is a forest edge

– P[i] = i if and only if i is a root

Algorithm (Roots of forest)

forall i in 1:n do S[i] := P[i] while S[i] != S[S[i]] do S[i] := S[S[i]] endwhileenddo

Initial state of forest

After one iteration

After another iteration

Concurrent Read – Finding Roots

Analysis of pointer jumping• Termination detection?

• At each step, tree distance between i and S[i] doubles unless S[i] is a root

• CREW model

• Correctness by induction on h

• O(lg h) steps, O(n lg h) work

• TS(n) = O(n)

• Not work-efficient unless h constant

Concurrent Read – Finding Roots

• This is a CREW algorithm

• Suppose Exclusive-Read is used, what will be the running time?

– Initially only one node i has root information

– First iteration: Another node reads from the node i

• Totally two nodes are filled up

– Second iteration: Another two nodes can reads from the two nodes

• Totally four nodes are filled up

– k-th iteration: 2k-1 nodes are filled up

– If there are n nodes, k=log n

– So Find_root with Exclusive-Read takes O(log n).

• O(log log n) vs. O(log n)

Euler tours• Technique for fast optimal processing of

tree data

• Euler circuit of directed graph: directed cycle that traverses each edge exactly once

• Represent (rooted) tree by Euler circuit of its directed version

Trees = balanced parentheses

( ( ( ) ( ) ) ( ) ( ( ) ( ) ( ) ) )

Key property: The parenthesis subsequence corresponding to a subtree is balanced.

Computing the Depth

• Problem definition– Given a binary tree with n nodes, compute the depth

of each node

• Serial algorithm takes O(n) time

• A simple parallel algorithm– Starting from root, compute the depths level by level

– Still O(n) because the height of the tree could be as high as n

• Euler tour algorithm– Uses parallel prefix computation

Computing the Depth (2)• Euler tour: A cycle that traverses each edge exactly once in a

graph– It is a directed version of a tree

• Regard an undirected edge into two directed edges

– Any directed version of a tree has an Euler tour by traversing the tree• in a DFS way forming a linked list.

• Employ 3*n processors– Each node i has fields i.parent, i.left, i.right

– Each node i has three processors, i.A, i.B, and i.C.

• Three processors in each node of the tree are linked as follows– i.A = i.left.A if i.left != nil

– i.B if i.left = nil

– i.B = i.right.A if i.right != nil

– i.C if i.right = nil

– i.C = i.parent.B if i is the left child

– i.parent.C if i is the right child

– nil if i.parent = nil

Computing the Depth (3)

• Algorithm– Construct the Euler tour for the tree – O(1) time

– Assign 1 to all A processors, 0 to B processors, -1 to C processors

– Perform a parallel prefix computation

– The depth of each node resides in its C processor

• O(log n)– Actually log 3n

• EREW because no concurrent read or write

• Speedup– S = n/log n

Computing the Depth (4)

Broadcasting on a PRAM

• “ Broadcast” can be done on CREW PRAM in O(1) steps:– Broadcaster sends value to shared memory

– Processors read from shared memory

• Requires lg(P) steps on EREW PRAM.

PPPPPPPP

Concurrent Write – Finding Max

• Finding max problem– Given an array of n elements, find the maximum(s)

– sequential algorithm is O(n)

• Data structure for parallel algorithm– Array A[1..n]

– Array m[1..n]. m[i] is true if A[i] is the maximum

– Use n2 processors

• Fast_max(A, n)1. for i = 1 to n do, in parallel

2. m[i] = true // A[i] is potentially maximum

3. for i = 1 to n, j = 1 to n do, in parallel

4. if A[i] < A[j] then

5. m[i] = false

6. for i = 1 to n do, in parallel

7. if m[i] = true then max = A[i]

8. return max

• Time complexity: O(1)

Concurrent Write – Finding Max

• Concurrent-write– In step 4 and 5, processors with A[i] < A[j] write the same value

‘false’ into the same location m[i]

– This actually implements m[i] = (A[i] A[1]) … (A[i] A[n])

• Is this work efficient? – No, n2 processors in O(1)

– O(n2) work vs. sequential algorithm is O(n)

• What is the time complexity for the Exclusive-write? – Initially elements “think” that they might be the maximum

– First iteration: For n/2 pairs, compare. • n/2 elements might be the maximum.

– Second iteration: n/4 elements might be the maximum.• log n th iteration: one element is the maximum.

– So Fast_max with Exclusive-write takes O(log n).

• O(1) (CRCW) vs. O(log n) (EREW)

Simulating CRCW with EREW• CRCW algorithms are faster than EREW algorithms

– How much fast?

• Theorem

– A p-processor CRCW algorithm can be no more than O(log p) times faster than the best p-processor EREW algorithm

• Proof by simulating CRCW steps with EREW steps– Assumption: A parallel sorting takes O(log n) time with n processors

– When CRCW processor pi write a datum xi into a location li, EREW pi writes the pair (li, xi) into a separate location A[i]

• Note EREW write is exclusive, while CRCW may be concurrent

– Sort A by li

• O(log p) time by assumption

– Compare adjacent elements in A

– For each group of the same elements, only one processor, say first, write xi into the global memory li.

• Note this is also exclusive.

– Total time complexity: O(log p)

Simulating CRCW with EREW (2)

CRCW versus EREW - Discussion

• CRCW– Hardware implementations are expensive

– Used infrequently

– Easier to program, runs faster, more powerful.

– Implemented hardware is slower than that of EREW• In reality one cannot find maximum in O(1) time

• EREW– Programming model is too restrictive

• Cannot implement powerful algorithms

parallel algorithm

simulating crcw

euler tour

sequential algorithm

parallel prefix

work complexity

memory location

serial algorithm

pram model

Documents

a parallel algorithm for global routing

parallel algorithm in linear algebra

parallel ½-approx weighted matching algorithm

parallel algorithm design - uio.no

a parallel algorithm for solving general tridiagonal...

ecse 420 - parallel cholesky algorithm - report

issn: a differential evolution algorithm parallel ... · a...

a parallel biobjective shortest path algorithm

parallel algorithm models

a simple parallel algorithm for the maximal independent...

parallel search algorithm

considerations in parallel algorithm design

parallel algorithm design - paginas.fe.up.pt

parallel algorithm for natural neighbour interpolation

parallel algorithm design

parallel algorithm dr seyed javadi (zakizade)

3 parallel algorithm design - ntnu.edu.tw

chapter 3. parallel algorithm design methodology · csci...

parallel prefix computation - wmich.edu overview • a...

a new parallel prefix-scan algorithm for gpus ·...