parallelising dynamic programming

Parallelising Dynamic Programming

Raphael Reitzig

University of KaiserslauternDepartment of Computer ScienceAlgorithms and Complexity Group

September 27th, 2012

VisionCompile dynamic programming recurrences into efficient parallelcode.

Goal 1Understand what efficiency means in parallel algorithms.

Goal 2Characterise dynamic programming recurrences in a suitable way.

Goal 3Find and implement efficient parallel algorithms for DP.

Analysing Parallelism

Complexity theory

Classifies problems

Focuses on inherent parallelism

Answers: How many processors do you need to be really faston inputs of a given size?

But......p grows with n – no statement about constant p and growing n!

Complexity theory

Classifies problems

Complexity theory

Classifies problems

Complexity theory

Classifies problems

Amdahl’s law

Parallel speedup ≤ 11−γ+ γ

Answers: How many processors can you utilise on given inputs?

But......does not capture growth of n!

Amdahl’s law

Work and depth

Work W = T A1 and depth D = T A

Brent’s Law: A with Wp ≤ T A

p < Wp + D is possible in a certain

setting.

But......has limited applicability and D can be slippery!

Work and depth

setting.

Work and depth

setting.

Relative runtimes

Speedup SAp :=

Efficiency EAp := TB

p·TAp

But......what are good values?

Clear: SAp ∈ [0, p] and EA

p ∈ [0, 1]

– but we can certainly not alwayshit the optima!

Relative runtimes

Speedup SAp :=

p·TAp

p ∈ [0, 1]

Relative runtimes

Speedup SAp :=

p·TAp

p ∈ [0, 1]

Relative runtimes

Speedup SAp :=

p·TAp

p ∈ [0, 1]

Relative runtimes

Speedup SAp :=

p·TAp

p ∈ [0, 1] – but we can certainly not alwayshit the optima!

Proposal: Asymptotic relative runtimes

Definition

SAp(∞) := lim inf

n→∞SAp(n)

EAp (∞) := lim inf

n→∞EAp (n)

GoalFind parallel algorithms that are asymptotically as scalable andefficient as possible for all p.

Proposal: Asymptotic relative runtimes

Definition

SAp(∞) := lim inf

n→∞SAp(n)

EAp (∞) := lim inf

n→∞EAp (n)

GoalFind parallel algorithms that are asymptotically as scalable andefficient as possible for all p.

Disclaimer

This means:A good parallel algorithm can utilise any number of processors ifthe inputs are large enough.

Not:More processors are always better.

Just as in sequential algorithmics.

Disclaimer

Afterthoughts

Machine modelKeep it simple: (P)RAM with p processors and spawn/join.

Which quantities to analyse?

Elementary operations, memory accesses, inter-threadcommunication, ...

Implicit interaction – blocking, communication via memory, ... – isinvisible in code!

Afterthoughts

Elementary operations, memory accesses, inter-threadcommunication, ...

Implicit interaction – blocking, communication via memory, ... – isinvisible in code!

Afterthoughts

Elementary operations, memory accesses, inter-threadcommunication, ...Implicit interaction – blocking, communication via memory, ... – isinvisible in code!

Attacking Dynamic Programming

Disclaimer

Only two dimensions

Only finite domains

Only rectangular domains

Memoisation-table point-of-view

Reducing to dependencies

e(i , j) :=

0 i = j = 0

j i = 0 ∧ j > 0

i i > 0 ∧ j = 0

e(i − 1, j) + 1

e(i , j − 1) + 1

e(i − 1, j − 1) + [ vi 6= wj ]

Reducing to dependencies

e(i , j) :=

0 i = j = 0

j i = 0 ∧ j > 0

i i > 0 ∧ j = 0

e(i − 1, j) + 1

e(i , j − 1) + 1

e(i − 1, j − 1) + [ vi 6= wj ]

Gold standard

Simplification

DL D DR

UL U UR

Three cases

Impossible

Possible

Three cases

Impossible

Possible

Three cases

Assuming dependencies are area-complete and uniform, there areonly three cases up to symmetry:

Facing Reality

Challenges

Contention

Method of synchronisation

Metal issues (moving threads, cache sync)

Challenges

Contention

Challenges

Contention

Performance Examples

Edit distance on two-core shared memory machine:

0 0.2 0.4 0.6 0.8 1 1.2 1.4·105

Edit distance on four-core NUMA machine:

0 1 2 3 4·105

Pseudo-Bellman-Ford on two-core shared memory machine:

0 0.2 0.4 0.6 0.8 1 1.2 1.4·105

Pseudo-Bellman-Ford on four-core NUMA machine:

0 1 2 3 4·105

Future Work

Fill gaps in theory (caching and communication).

Generalise theory to more dimensions and interleaved DPs.

Improve and extend implementations.

More experiments (different problems, more diverse machines).

Improve compiler integration (detection, backtracing, result

functions).

Integrate with other tools.

Future Work

functions).

Future Work

functions).

Future Work

functions).

Future Work

functions).

Future Work

functions).