parallelising dynamic programming
TRANSCRIPT
Parallelising Dynamic Programming
Raphael Reitzig
University of KaiserslauternDepartment of Computer ScienceAlgorithms and Complexity Group
September 27th, 2012
VisionCompile dynamic programming recurrences into efficient parallelcode.
Goal 1Understand what efficiency means in parallel algorithms.
Goal 2Characterise dynamic programming recurrences in a suitable way.
Goal 3Find and implement efficient parallel algorithms for DP.
Goal 1Understand what efficiency means in parallel algorithms.
Goal 2Characterise dynamic programming recurrences in a suitable way.
Goal 3Find and implement efficient parallel algorithms for DP.
Goal 1Understand what efficiency means in parallel algorithms.
Goal 2Characterise dynamic programming recurrences in a suitable way.
Goal 3Find and implement efficient parallel algorithms for DP.
Analysing Parallelism
Complexity theory
Classifies problems
Focuses on inherent parallelism
Answers: How many processors do you need to be really faston inputs of a given size?
But......p grows with n – no statement about constant p and growing n!
Complexity theory
Classifies problems
Focuses on inherent parallelism
Answers: How many processors do you need to be really faston inputs of a given size?
But......p grows with n – no statement about constant p and growing n!
Complexity theory
Classifies problems
Focuses on inherent parallelism
Answers: How many processors do you need to be really faston inputs of a given size?
But......p grows with n – no statement about constant p and growing n!
Complexity theory
Classifies problems
Focuses on inherent parallelism
Answers: How many processors do you need to be really faston inputs of a given size?
But......p grows with n – no statement about constant p and growing n!
Amdahl’s law
Parallel speedup ≤ 11−γ+ γ
p.
Answers: How many processors can you utilise on given inputs?
But......does not capture growth of n!
Amdahl’s law
Parallel speedup ≤ 11−γ+ γ
p.
Answers: How many processors can you utilise on given inputs?
But......does not capture growth of n!
Amdahl’s law
Parallel speedup ≤ 11−γ+ γ
p.
Answers: How many processors can you utilise on given inputs?
But......does not capture growth of n!
Work and depth
Work W = T A1 and depth D = T A
∞
Brent’s Law: A with Wp ≤ T A
p < Wp + D is possible in a certain
setting.
But......has limited applicability and D can be slippery!
Work and depth
Work W = T A1 and depth D = T A
∞
Brent’s Law: A with Wp ≤ T A
p < Wp + D is possible in a certain
setting.
But......has limited applicability and D can be slippery!
Work and depth
Work W = T A1 and depth D = T A
∞
Brent’s Law: A with Wp ≤ T A
p < Wp + D is possible in a certain
setting.
But......has limited applicability and D can be slippery!
Relative runtimes
Speedup SAp :=
TA1
TAp
Efficiency EAp := TB
p·TAp
But......what are good values?
Clear: SAp ∈ [0, p] and EA
p ∈ [0, 1]
– but we can certainly not alwayshit the optima!
Relative runtimes
Speedup SAp :=
TA1
TAp
Efficiency EAp := TB
p·TAp
But......what are good values?
Clear: SAp ∈ [0, p] and EA
p ∈ [0, 1]
– but we can certainly not alwayshit the optima!
Relative runtimes
Speedup SAp :=
TA1
TAp
Efficiency EAp := TB
p·TAp
But......what are good values?
Clear: SAp ∈ [0, p] and EA
p ∈ [0, 1]
– but we can certainly not alwayshit the optima!
Relative runtimes
Speedup SAp :=
TA1
TAp
Efficiency EAp := TB
p·TAp
But......what are good values?
Clear: SAp ∈ [0, p] and EA
p ∈ [0, 1]
– but we can certainly not alwayshit the optima!
Relative runtimes
Speedup SAp :=
TA1
TAp
Efficiency EAp := TB
p·TAp
But......what are good values?
Clear: SAp ∈ [0, p] and EA
p ∈ [0, 1] – but we can certainly not alwayshit the optima!
Proposal: Asymptotic relative runtimes
Definition
SAp(∞) := lim inf
n→∞SAp(n)
?= p
EAp (∞) := lim inf
n→∞EAp (n)
?= 1
GoalFind parallel algorithms that are asymptotically as scalable andefficient as possible for all p.
Proposal: Asymptotic relative runtimes
Definition
SAp(∞) := lim inf
n→∞SAp(n)
?= p
EAp (∞) := lim inf
n→∞EAp (n)
?= 1
GoalFind parallel algorithms that are asymptotically as scalable andefficient as possible for all p.
Disclaimer
This means:A good parallel algorithm can utilise any number of processors ifthe inputs are large enough.
Not:More processors are always better.
Just as in sequential algorithmics.
Disclaimer
This means:A good parallel algorithm can utilise any number of processors ifthe inputs are large enough.
Not:More processors are always better.
Just as in sequential algorithmics.
Disclaimer
This means:A good parallel algorithm can utilise any number of processors ifthe inputs are large enough.
Not:More processors are always better.
Just as in sequential algorithmics.
Afterthoughts
Machine modelKeep it simple: (P)RAM with p processors and spawn/join.
Which quantities to analyse?
Elementary operations, memory accesses, inter-threadcommunication, ...
Implicit interaction – blocking, communication via memory, ... – isinvisible in code!
Afterthoughts
Machine modelKeep it simple: (P)RAM with p processors and spawn/join.
Which quantities to analyse?
Elementary operations, memory accesses, inter-threadcommunication, ...
Implicit interaction – blocking, communication via memory, ... – isinvisible in code!
Afterthoughts
Machine modelKeep it simple: (P)RAM with p processors and spawn/join.
Which quantities to analyse?
Elementary operations, memory accesses, inter-threadcommunication, ...Implicit interaction – blocking, communication via memory, ... – isinvisible in code!
Attacking Dynamic Programming
Disclaimer
Only two dimensions
Only finite domains
Only rectangular domains
Memoisation-table point-of-view
Reducing to dependencies
e(i , j) :=
0 i = j = 0
j i = 0 ∧ j > 0
i i > 0 ∧ j = 0
min
e(i − 1, j) + 1
e(i , j − 1) + 1
e(i − 1, j − 1) + [ vi 6= wj ]
else
Reducing to dependencies
e(i , j) :=
0 i = j = 0
j i = 0 ∧ j > 0
i i > 0 ∧ j = 0
min
e(i − 1, j) + 1
e(i , j − 1) + 1
e(i − 1, j − 1) + [ vi 6= wj ]
else
Gold standard
?
?
?
?
?
?
Simplification
DL D DR
UL U UR
L R
Three cases
Impossible
Possible
Three cases
Impossible
Possible
Three cases
Assuming dependencies are area-complete and uniform, there areonly three cases up to symmetry:
Facing Reality
Challenges
Contention
Method of synchronisation
Metal issues (moving threads, cache sync)
Challenges
Contention
Method of synchronisation
Metal issues (moving threads, cache sync)
Challenges
Contention
Method of synchronisation
Metal issues (moving threads, cache sync)
Performance Examples
Edit distance on two-core shared memory machine:
0 0.2 0.4 0.6 0.8 1 1.2 1.4·105
0
0.5
1
1.5
2
2.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4·105
0
0.5
1
1.5
2
2.5
Performance Examples
Edit distance on four-core NUMA machine:
0 1 2 3 4·105
0
1
2
3
4
0 1 2 3 4·105
0
1
2
3
4
Performance Examples
Pseudo-Bellman-Ford on two-core shared memory machine:
0 0.2 0.4 0.6 0.8 1 1.2 1.4·105
0
0.5
1
1.5
2
2.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4·105
0
1
2
3
4
Performance Examples
Pseudo-Bellman-Ford on four-core NUMA machine:
0 1 2 3 4·105
0
1
2
3
4
0 1 2 3 4·105
0
2
4
6
8
Future Work
Fill gaps in theory (caching and communication).
Generalise theory to more dimensions and interleaved DPs.
Improve and extend implementations.
More experiments (different problems, more diverse machines).
Improve compiler integration (detection, backtracing, result
functions).
Integrate with other tools.
Future Work
Fill gaps in theory (caching and communication).
Generalise theory to more dimensions and interleaved DPs.
Improve and extend implementations.
More experiments (different problems, more diverse machines).
Improve compiler integration (detection, backtracing, result
functions).
Integrate with other tools.
Future Work
Fill gaps in theory (caching and communication).
Generalise theory to more dimensions and interleaved DPs.
Improve and extend implementations.
More experiments (different problems, more diverse machines).
Improve compiler integration (detection, backtracing, result
functions).
Integrate with other tools.
Future Work
Fill gaps in theory (caching and communication).
Generalise theory to more dimensions and interleaved DPs.
Improve and extend implementations.
More experiments (different problems, more diverse machines).
Improve compiler integration (detection, backtracing, result
functions).
Integrate with other tools.
Future Work
Fill gaps in theory (caching and communication).
Generalise theory to more dimensions and interleaved DPs.
Improve and extend implementations.
More experiments (different problems, more diverse machines).
Improve compiler integration (detection, backtracing, result
functions).
Integrate with other tools.
Future Work
Fill gaps in theory (caching and communication).
Generalise theory to more dimensions and interleaved DPs.
Improve and extend implementations.
More experiments (different problems, more diverse machines).
Improve compiler integration (detection, backtracing, result
functions).
Integrate with other tools.