parallelising dynamic programming

Post on 14-Apr-2017

150 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parallelising Dynamic Programming

Raphael Reitzig

University of KaiserslauternDepartment of Computer ScienceAlgorithms and Complexity Group

September 27th, 2012

VisionCompile dynamic programming recurrences into efficient parallelcode.

Goal 1Understand what efficiency means in parallel algorithms.

Goal 2Characterise dynamic programming recurrences in a suitable way.

Goal 3Find and implement efficient parallel algorithms for DP.

Goal 1Understand what efficiency means in parallel algorithms.

Goal 2Characterise dynamic programming recurrences in a suitable way.

Goal 3Find and implement efficient parallel algorithms for DP.

Goal 1Understand what efficiency means in parallel algorithms.

Goal 2Characterise dynamic programming recurrences in a suitable way.

Goal 3Find and implement efficient parallel algorithms for DP.

Analysing Parallelism

Complexity theory

Classifies problems

Focuses on inherent parallelism

Answers: How many processors do you need to be really faston inputs of a given size?

But......p grows with n – no statement about constant p and growing n!

Complexity theory

Classifies problems

Focuses on inherent parallelism

Answers: How many processors do you need to be really faston inputs of a given size?

But......p grows with n – no statement about constant p and growing n!

Complexity theory

Classifies problems

Focuses on inherent parallelism

Answers: How many processors do you need to be really faston inputs of a given size?

But......p grows with n – no statement about constant p and growing n!

Complexity theory

Classifies problems

Focuses on inherent parallelism

Answers: How many processors do you need to be really faston inputs of a given size?

But......p grows with n – no statement about constant p and growing n!

Amdahl’s law

Parallel speedup ≤ 11−γ+ γ

p.

Answers: How many processors can you utilise on given inputs?

But......does not capture growth of n!

Amdahl’s law

Parallel speedup ≤ 11−γ+ γ

p.

Answers: How many processors can you utilise on given inputs?

But......does not capture growth of n!

Amdahl’s law

Parallel speedup ≤ 11−γ+ γ

p.

Answers: How many processors can you utilise on given inputs?

But......does not capture growth of n!

Work and depth

Work W = T A1 and depth D = T A

Brent’s Law: A with Wp ≤ T A

p < Wp + D is possible in a certain

setting.

But......has limited applicability and D can be slippery!

Work and depth

Work W = T A1 and depth D = T A

Brent’s Law: A with Wp ≤ T A

p < Wp + D is possible in a certain

setting.

But......has limited applicability and D can be slippery!

Work and depth

Work W = T A1 and depth D = T A

Brent’s Law: A with Wp ≤ T A

p < Wp + D is possible in a certain

setting.

But......has limited applicability and D can be slippery!

Relative runtimes

Speedup SAp :=

TA1

TAp

Efficiency EAp := TB

p·TAp

But......what are good values?

Clear: SAp ∈ [0, p] and EA

p ∈ [0, 1]

– but we can certainly not alwayshit the optima!

Relative runtimes

Speedup SAp :=

TA1

TAp

Efficiency EAp := TB

p·TAp

But......what are good values?

Clear: SAp ∈ [0, p] and EA

p ∈ [0, 1]

– but we can certainly not alwayshit the optima!

Relative runtimes

Speedup SAp :=

TA1

TAp

Efficiency EAp := TB

p·TAp

But......what are good values?

Clear: SAp ∈ [0, p] and EA

p ∈ [0, 1]

– but we can certainly not alwayshit the optima!

Relative runtimes

Speedup SAp :=

TA1

TAp

Efficiency EAp := TB

p·TAp

But......what are good values?

Clear: SAp ∈ [0, p] and EA

p ∈ [0, 1]

– but we can certainly not alwayshit the optima!

Relative runtimes

Speedup SAp :=

TA1

TAp

Efficiency EAp := TB

p·TAp

But......what are good values?

Clear: SAp ∈ [0, p] and EA

p ∈ [0, 1] – but we can certainly not alwayshit the optima!

Proposal: Asymptotic relative runtimes

Definition

SAp(∞) := lim inf

n→∞SAp(n)

?= p

EAp (∞) := lim inf

n→∞EAp (n)

?= 1

GoalFind parallel algorithms that are asymptotically as scalable andefficient as possible for all p.

Proposal: Asymptotic relative runtimes

Definition

SAp(∞) := lim inf

n→∞SAp(n)

?= p

EAp (∞) := lim inf

n→∞EAp (n)

?= 1

GoalFind parallel algorithms that are asymptotically as scalable andefficient as possible for all p.

Disclaimer

This means:A good parallel algorithm can utilise any number of processors ifthe inputs are large enough.

Not:More processors are always better.

Just as in sequential algorithmics.

Disclaimer

This means:A good parallel algorithm can utilise any number of processors ifthe inputs are large enough.

Not:More processors are always better.

Just as in sequential algorithmics.

Disclaimer

This means:A good parallel algorithm can utilise any number of processors ifthe inputs are large enough.

Not:More processors are always better.

Just as in sequential algorithmics.

Afterthoughts

Machine modelKeep it simple: (P)RAM with p processors and spawn/join.

Which quantities to analyse?

Elementary operations, memory accesses, inter-threadcommunication, ...

Implicit interaction – blocking, communication via memory, ... – isinvisible in code!

Afterthoughts

Machine modelKeep it simple: (P)RAM with p processors and spawn/join.

Which quantities to analyse?

Elementary operations, memory accesses, inter-threadcommunication, ...

Implicit interaction – blocking, communication via memory, ... – isinvisible in code!

Afterthoughts

Machine modelKeep it simple: (P)RAM with p processors and spawn/join.

Which quantities to analyse?

Elementary operations, memory accesses, inter-threadcommunication, ...Implicit interaction – blocking, communication via memory, ... – isinvisible in code!

Attacking Dynamic Programming

Disclaimer

Only two dimensions

Only finite domains

Only rectangular domains

Memoisation-table point-of-view

Reducing to dependencies

e(i , j) :=

0 i = j = 0

j i = 0 ∧ j > 0

i i > 0 ∧ j = 0

min

e(i − 1, j) + 1

e(i , j − 1) + 1

e(i − 1, j − 1) + [ vi 6= wj ]

else

Reducing to dependencies

e(i , j) :=

0 i = j = 0

j i = 0 ∧ j > 0

i i > 0 ∧ j = 0

min

e(i − 1, j) + 1

e(i , j − 1) + 1

e(i − 1, j − 1) + [ vi 6= wj ]

else

Gold standard

?

?

?

?

?

?

Simplification

DL D DR

UL U UR

L R

Three cases

Impossible

Possible

Three cases

Impossible

Possible

Three cases

Assuming dependencies are area-complete and uniform, there areonly three cases up to symmetry:

Facing Reality

Challenges

Contention

Method of synchronisation

Metal issues (moving threads, cache sync)

Challenges

Contention

Method of synchronisation

Metal issues (moving threads, cache sync)

Challenges

Contention

Method of synchronisation

Metal issues (moving threads, cache sync)

Performance Examples

Edit distance on two-core shared memory machine:

0 0.2 0.4 0.6 0.8 1 1.2 1.4·105

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4·105

0

0.5

1

1.5

2

2.5

Performance Examples

Edit distance on four-core NUMA machine:

0 1 2 3 4·105

0

1

2

3

4

0 1 2 3 4·105

0

1

2

3

4

Performance Examples

Pseudo-Bellman-Ford on two-core shared memory machine:

0 0.2 0.4 0.6 0.8 1 1.2 1.4·105

0

0.5

1

1.5

2

2.5

0 0.2 0.4 0.6 0.8 1 1.2 1.4·105

0

1

2

3

4

Performance Examples

Pseudo-Bellman-Ford on four-core NUMA machine:

0 1 2 3 4·105

0

1

2

3

4

0 1 2 3 4·105

0

2

4

6

8

Future Work

Fill gaps in theory (caching and communication).

Generalise theory to more dimensions and interleaved DPs.

Improve and extend implementations.

More experiments (different problems, more diverse machines).

Improve compiler integration (detection, backtracing, result

functions).

Integrate with other tools.

Future Work

Fill gaps in theory (caching and communication).

Generalise theory to more dimensions and interleaved DPs.

Improve and extend implementations.

More experiments (different problems, more diverse machines).

Improve compiler integration (detection, backtracing, result

functions).

Integrate with other tools.

Future Work

Fill gaps in theory (caching and communication).

Generalise theory to more dimensions and interleaved DPs.

Improve and extend implementations.

More experiments (different problems, more diverse machines).

Improve compiler integration (detection, backtracing, result

functions).

Integrate with other tools.

Future Work

Fill gaps in theory (caching and communication).

Generalise theory to more dimensions and interleaved DPs.

Improve and extend implementations.

More experiments (different problems, more diverse machines).

Improve compiler integration (detection, backtracing, result

functions).

Integrate with other tools.

Future Work

Fill gaps in theory (caching and communication).

Generalise theory to more dimensions and interleaved DPs.

Improve and extend implementations.

More experiments (different problems, more diverse machines).

Improve compiler integration (detection, backtracing, result

functions).

Integrate with other tools.

Future Work

Fill gaps in theory (caching and communication).

Generalise theory to more dimensions and interleaved DPs.

Improve and extend implementations.

More experiments (different problems, more diverse machines).

Improve compiler integration (detection, backtracing, result

functions).

Integrate with other tools.

top related