futures, scheduling, and work distribution companion slides for the art of multiprocessor...
TRANSCRIPT
Futures, Scheduling, and Work Distribution
Companion slides forThe Art of Multiprocessor Programming
by Maurice Herlihy & Nir Shavit
(Some images in this lecture courtesy of Charles Leiserson)
Art of Multiprocessor Programming 22
How to write Parallel Apps?
• Split a program into parallel parts• In an effective way• Thread management
Art of Multiprocessor Programming 33
Matrix Multiplication
BAC
Art of Multiprocessor Programming 44
Matrix Multiplication
ci j =N ¡ 1X
k=0
aki ¢bj k
Art of Multiprocessor Programming 55
Matrix Multiplication
cij = k=0N-1 aki * bjk
Art of Multiprocessor Programming 6
Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}
Art of Multiprocessor Programming 7
Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}
a thread
Art of Multiprocessor Programming 8
Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}
Which matrix entry to compute
Art of Multiprocessor Programming 9
Matrix Multiplication class Worker extends Thread { int row, col; Worker(int row, int col) { row = row; col = col; } public void run() { double dotProduct = 0.0; for (int i = 0; i < n; i++) dotProduct += a[row][i] * b[i][col]; c[row][col] = dotProduct; }}}
Actual computation
Art of Multiprocessor Programming 10
Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Art of Multiprocessor Programming 11
Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Create n x n threads
Art of Multiprocessor Programming 12
Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Start them
Art of Multiprocessor Programming 13
Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Wait for them to finish
Start them
Art of Multiprocessor Programming 14
Matrix Multiplication void multiply() { Worker[][] worker = new Worker[n][n]; for (int row …) for (int col …) worker[row][col] = new Worker(row,col); for (int row …) for (int col …) worker[row][col].start(); for (int row …) for (int col …) worker[row][col].join();}
Wait for them to finish
Start them
What’s wrong with this picture?
Art of Multiprocessor Programming 15
Thread Overhead
• Threads Require resources– Memory for stacks– Setup, teardown– Scheduler overhead
• Short-lived threads– Ratio of work versus overhead bad
Art of Multiprocessor Programming 16
Thread Pools
• More sensible to keep a pool of– long-lived threads
• Threads assigned short-lived tasks– Run the task– Rejoin pool– Wait for next assignment
Art of Multiprocessor Programming 17
Thread Pool = Abstraction
• Insulate programmer from platform– Big machine, big pool– Small machine, small pool
• Portable code– Works across platforms– Worry about algorithm, not platform
Art of Multiprocessor Programming 18
ExecutorService Interface
• In java.util.concurrent– Task = Runnable object
• If no result value expected• Calls run() method.
– Task = Callable<T> object• If result value of type T expected• Calls T call() method.
Art of Multiprocessor Programming 19
Future<T>Callable<T> task = …; …
Future<T> future = executor.submit(task);
…
T value = future.get();
Art of Multiprocessor Programming 20
Future<T>Callable<T> task = …; …
Future<T> future = executor.submit(task);
…
T value = future.get();
Submitting a Callable<T> task returns a Future<T> object
Art of Multiprocessor Programming 21
Future<T>Callable<T> task = …; …
Future<T> future = executor.submit(task);
…
T value = future.get();
The Future’s get() method blocks until the value is available
Art of Multiprocessor Programming 22
Future<?>Runnable task = …; …
Future<?> future = executor.submit(task);
…
future.get();
Art of Multiprocessor Programming 2323
Future<?>Runnable task = …; …
Future<?> future = executor.submit(task);
…
future.get();
Submitting a Runnable task returns a Future<?> object
Art of Multiprocessor Programming 2424
Future<?>Runnable task = …; …
Future<?> future = executor.submit(task);
…
future.get();
The Future’s get() method blocks until the computation is complete
Art of Multiprocessor Programming 2525
Note
• Executor Service submissions– Like New England traffic signs– Are purely advisory in nature
• The executor– Like the Boston driver– Is free to ignore any such advice– And could execute tasks sequentially …
Art of Multiprocessor Programming 2626
Matrix Addition
00 00 00 00 01 01
10 10 10 10 11 11
C C A B B A
C C A B A B
Art of Multiprocessor Programming 2727
Matrix Addition
00 00 00 00 01 01
10 10 10 10 11 11
C C A B B A
C C A B A B
4 parallel additions
Art of Multiprocessor Programming 2828
Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}
Art of Multiprocessor Programming 29
Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}
Constant-time operation
Art of Multiprocessor Programming 30
Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}
Submit 4 tasks
Art of Multiprocessor Programming 31
Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // add this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}
Base case: add directly
Art of Multiprocessor Programming 32
Matrix Addition Taskclass AddTask implements Runnable { Matrix a, b; // multiply this! public void run() { if (a.dim == 1) { c[0][0] = a[0][0] + b[0][0]; // base case } else { (partition a, b into half-size matrices aij and bij) Future<?> f00 = exec.submit(addTask(a00,b00)); … Future<?> f11 = exec.submit(addTask(a11,b11)); f00.get(); …; f11.get(); … }}
Let them finish
Art of Multiprocessor Programming 33
Dependencies
• Matrix example is not typical• Tasks are independent
– Don’t need results of one task …– To complete another
• Often tasks are not independent
Art of Multiprocessor Programming 3434
Fibonacci
1 if n = 0 or 1F(n)
F(n-1) + F(n-2) otherwise
• Note– Potential parallelism– Dependencies
Art of Multiprocessor Programming 3535
Disclaimer
• This Fibonacci implementation is– Egregiously inefficient
• So don’t try this at home or job!
– But illustrates our point• How to deal with dependencies
• Exercise:– Make this implementation efficient!
Art of Multiprocessor Programming 36
class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}}
Multithreaded Fibonacci
Art of Multiprocessor Programming 37
class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}}
Multithreaded Fibonacci
Parallel calls
Art of Multiprocessor Programming 38
class FibTask implements Callable<Integer> { static ExecutorService exec = Executors.newCachedThreadPool(); int arg; public FibTask(int n) { arg = n; } public Integer call() { if (arg > 2) { Future<Integer> left = exec.submit(new FibTask(arg-1)); Future<Integer> right = exec.submit(new FibTask(arg-2)); return left.get() + right.get(); } else { return 1; }}}
Multithreaded Fibonacci
Pick up & combine results
Art of Multiprocessor Programming 3939
The Blumofe-Leiserson DAG Model
• Multithreaded program is– A directed acyclic graph (DAG)– That unfolds dynamically
• Each node is– A single unit of work
Art of Multiprocessor Programming 4040
Fibonacci DAGfib(4)
Art of Multiprocessor Programming 4141
Fibonacci DAGfib(4)
fib(3)
Art of Multiprocessor Programming 4242
Fibonacci DAGfib(4)
fib(3) fib(2)
fib(2)
Art of Multiprocessor Programming 4343
Fibonacci DAGfib(4)
fib(3) fib(2)
fib(2) fib(1) fib(1)fib(1)
fib(1) fib(1)
Art of Multiprocessor Programming 4444
Fibonacci DAGfib(4)
fib(3) fib(2)
callget
fib(2) fib(1) fib(1)fib(1)
fib(1) fib(1)
Art of Multiprocessor Programming 4545
How Parallel is That?
• Define work:– Total time on one processor
• Define critical-path length:– Longest dependency path– Can’t beat that!
Unfolded DAG
Art of Multiprocessor Programming 46
Parallelism?
Art of Multiprocessor Programming 47
Serial fraction = 3/18 = 1/6 …
Amdahl’s Law says speedup cannot exceed 6.
Work?
Art of Multiprocessor Programming 48
1
2
3
754
7 8 9 10
11 12 13 14
15 16
17
18
T1: time needed on one processor
Just count the nodes ….
T1 = 18
Critical Path?
Art of Multiprocessor Programming 49
T∞: time needed on as many
processors as you like
Critical Path?
Art of Multiprocessor Programming 50
1
2
3
4
5
6
7
8
9
Longest path ….
T∞ = 9
T∞: time needed on as many
processors as you like
Art of Multiprocessor Programming 5151
Notation Watch
• TP = time on P processors
• T1 = work (time on 1 processor)
• T∞ = critical path length (time on ∞ processors)
Art of Multiprocessor Programming 5252
Simple Laws
• Work Law: TP ≥ T1/P– In one step, can’t do more than P work
• Critical Path Law: TP ≥ T∞
– Can’t beat infinite resources
Art of Multiprocessor Programming 5353
Performance Measures
• Speedup on P processors– Ratio T1/TP
– How much faster with P processors• Linear speedup
– T1/TP = Θ(P)
• Max speedup (average parallelism)– T1/T∞
Sequential Composition
Art of Multiprocessor Programming 54
A B
Work: T1(A) + T1(B)
Critical Path: T∞ (A) + T∞ (B)
Parallel Composition
Art of Multiprocessor Programming 55
Work: T1(A) + T1(B)
Critical Path: max{T∞(A), T∞(B)}
A
B
Art of Multiprocessor Programming 5656
Matrix Addition
00 00 00 00 01 01
10 10 10 10 11 11
C C A B B A
C C A B A B
Art of Multiprocessor Programming 5757
Matrix Addition
00 00 00 00 01 01
10 10 10 10 11 11
C C A B B A
C C A B A B
4 parallel additions
Art of Multiprocessor Programming 5858
Addition
• Let AP(n) be running time – For n x n matrix– on P processors
• For example– A1(n) is work
– A∞(n) is critical path length
Art of Multiprocessor Programming 5959
Addition
• Work is
A1(n) = 4 A1(n/2) + Θ(1)
4 spawned additions
Partition, synch, etc
Art of Multiprocessor Programming 6060
Addition
• Work is
A1(n) = 4 A1(n/2) + Θ(1)
= Θ(n2)
Same as double-loop summation
Art of Multiprocessor Programming 6161
Addition
• Critical Path length is
A∞(n) = A∞(n/2) + Θ(1)
spawned additions in parallel
Partition, synch, etc
Art of Multiprocessor Programming 6262
Addition
• Critical Path length is
A∞(n) = A∞(n/2) + Θ(1)
= Θ(log n)
Art of Multiprocessor Programming 6363
Matrix Multiplication Redux
BAC
Art of Multiprocessor Programming 6464
Matrix Multiplication Redux
2221
1211
2221
1211
2221
1211
BB
BB
AA
AA
CC
CC
Art of Multiprocessor Programming 6565
First Phase …
2222122121221121
2212121121121111
2221
1211
BABABABA
BABABABA
CC
CC
8 multiplications
Art of Multiprocessor Programming 6666
Second Phase …
2222122121221121
2212121121121111
2221
1211
BABABABA
BABABABA
CC
CC
4 additions
Art of Multiprocessor Programming 6767
Multiplication
• Work is
M1(n) = 8 M1(n/2) + A1(n)
8 parallel multiplications
Final addition
Art of Multiprocessor Programming 6868
Multiplication
• Work is
M1(n) = 8 M1(n/2) + Θ(n2)
= Θ(n3)
Same as serial triple-nested loop
Art of Multiprocessor Programming 6969
Multiplication
• Critical path length is
M∞(n) = M∞(n/2) + A∞(n)
Half-size parallel
multiplications
Final addition
Art of Multiprocessor Programming 7070
Multiplication
• Critical path length is
M∞(n) = M∞(n/2) + A∞(n)
= M∞(n/2) + Θ(log n)
= Θ(log2 n)
Art of Multiprocessor Programming 7171
Parallelism
• M1(n)/ M∞(n) = Θ(n3/log2 n)
• To multiply two 1000 x 1000 matrices– 10003/102=107
• Much more than number of processors on any real machine
Art of Multiprocessor Programming 7272
Shared-Memory Multiprocessors
• Parallel applications– Do not have direct access to HW
processors• Mix of other jobs
– All run together– Come & go dynamically
Art of Multiprocessor Programming 7373
Ideal Scheduling Hierarchy
Tasks
Processors
User-level scheduler
Art of Multiprocessor Programming 7474
Realistic Scheduling Hierarchy
Tasks
Threads
Processors
User-level scheduler
Kernel-level scheduler
Art of Multiprocessor Programming 75
For Example
• Initially,– All P processors available for application
• Serial computation– Takes over one processor– Leaving P-1 for us– Waits for I/O– We get that processor back ….
Art of Multiprocessor Programming 76
Speedup
• Map threads onto P processes• Cannot get P-fold speedup
– What if the kernel doesn’t cooperate?• Can try for speedup proportional to P
Art of Multiprocessor Programming 7777
Scheduling Hierarchy
• User-level scheduler– Tells kernel which threads are ready
• Kernel-level scheduler– Synchronous (for analysis, not correctness!)– Picks pi threads to schedule at step i
Greedy Scheduling
• A node is ready if predecessors are done
• Greedy: schedule as many ready nodes as possible
• Optimal scheduler is greedy (why?)
• But not every greedy scheduler is optimal
Art of Multiprocessor Programming 78
Greedy Scheduling
Art of Multiprocessor Programming 79
There are P processors
Complete Step:• >P nodes ready• run any P
Incomplete Step:• < P nodes ready• run them all
Art of Multiprocessor Programming 8080
Theorem
For any greedy scheduler,
TP ≤ T1/P + T∞
Art of Multiprocessor Programming 8181
Theorem
For any greedy scheduler,
Actual time
TP ≤ T1/P + T∞
Art of Multiprocessor Programming 8282
Theorem
For any greedy scheduler,
No better than work divided among processors
TP ≤ T1/P + T∞
Art of Multiprocessor Programming 8383
Theorem
For any greedy scheduler,
No better than critical path length
TP ≤ T1/P + T∞
TP ≤ T1/P + T∞
Art of Multiprocessor Programming 84
Proof:
Number of incomplete steps ≤ T1/P …
… because each performs P work.
Number of complete steps ≤ T1 …
… because each shortens the unexecuted critical path by 1
QED
Art of Multiprocessor Programming 8585
Near-OptimalityTheorem: any greedy scheduler is within a factor of 2 of optimal.
Remark: Optimality is NP-hard!
Art of Multiprocessor Programming 8686
Proof of Near-OptimalityLet TP
* be the optimal time.
TP* ≥ max{T1/P, T∞}
TP ≤ T1/P + T∞
TP ≤ 2 max{T1/P ,T∞}
TP ≤ 2 TP*
From work and critical path laws
Theorem
QED
Art of Multiprocessor Programming 8787
Work Distributionzzz…
Art of Multiprocessor Programming 8888
Work Dealing
Yes!
Art of Multiprocessor Programming 8989
The Problem with Work Dealing
D’oh!
D’oh!
D’oh!
Art of Multiprocessor Programming 9090
Work StealingNo work…
Yes!
Art of Multiprocessor Programming 9191
Lock-Free Work Stealing
• Each thread has a pool of ready work• Remove work without synchronizing• If you run out of work, steal someone
else’s• Choose victim at random
Art of Multiprocessor Programming 9292
Local Work Pools
Each work pool is a Double-Ended Queue
Art of Multiprocessor Programming 9393
Work DEQueue1
pushBottom
popBottom
work
1. Double-Ended Queue
Art of Multiprocessor Programming 9494
Obtain Work
• Obtain work• Run task until• Blocks or terminates
popBottom
Art of Multiprocessor Programming 9595
New Work
• Unblock node• Spawn node
pushBottom
Art of Multiprocessor Programming 9696
Whatcha Gonna do When the Well Runs Dry?
@&%$!!
empty
Art of Multiprocessor Programming 9797
Steal Work from OthersPick random thread’s DEQeueue
Art of Multiprocessor Programming 9898
Steal this Task!
popTop
Art of Multiprocessor Programming 9999
Task DEQueue
• Methods– pushBottom– popBottom– popTop
Never happen concurrently
Art of Multiprocessor Programming 100100
Task DEQueue
• Methods– pushBottom– popBottom– popTop
Most common – make them fast(minimize use of
CAS)
Art of Multiprocessor Programming 101101
Ideal
• Wait-Free• Linearizable• Constant time
Fortune Cookie: “It is better to be young, rich and beautiful, than old, poor, and ugly”
Art of Multiprocessor Programming 102102
Compromise
• Method popTop may fail if– Concurrent popTop succeeds, or a – Concurrent popBottom takes last task
Blame the victim!
Art of Multiprocessor Programming 103103
Dreaded ABA Problem
topCAS
Art of Multiprocessor Programming 104104
Dreaded ABA Problem
top
Art of Multiprocessor Programming 105105
Dreaded ABA Problem
top
Art of Multiprocessor Programming 106106
Dreaded ABA Problem
top
Art of Multiprocessor Programming 107107
Dreaded ABA Problem
top
Art of Multiprocessor Programming 108108
Dreaded ABA Problem
top
Art of Multiprocessor Programming 109109
Dreaded ABA Problem
top
Art of Multiprocessor Programming 110110
Dreaded ABA Problem
topCAS
Uh-Oh …
Yes!
Art of Multiprocessor Programming 111111
Fix for Dreaded ABA
stamp
top
bottom
Art of Multiprocessor Programming 112112
Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}
Art of Multiprocessor Programming 113113
Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}
Index & Stamp (synchronized)
Art of Multiprocessor Programming 114114
Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] deq; …}
index of bottom taskno need to synchronizememory barrier needed
Art of Multiprocessor Programming 115
Bounded DEQueuepublic class BDEQueue { AtomicStampedReference<Integer> top; volatile int bottom; Runnable[] tasks; …}
Array holding tasks
Art of Multiprocessor Programming 116
pushBottom()
public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …}
Art of Multiprocessor Programming 117
pushBottom()
public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …} Bottom is the index to store
the new task in the array
Art of Multiprocessor Programming 118
pushBottom()
public class BDEQueue { … void pushBottom(Runnable r){ tasks[bottom] = r; bottom++; } …}
Adjust the bottom index
stamp
top
bottom
Art of Multiprocessor Programming 119
Steal Work
public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r; return null; }
Art of Multiprocessor Programming 120
Steal Work
public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r; return null; } Read top (value & stamp)
Art of Multiprocessor Programming 121
Steal Work
public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r; return null; } Compute new value & stamp
Art of Multiprocessor Programming 122
Steal Work
public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r; return null; }
Quit if queue is empty
stamp
top
bottom
Art of Multiprocessor Programming 123
Steal Work
public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r; return null; }
Try to steal the task
stamp
topCAS
bottom
Art of Multiprocessor Programming 124
Steal Work
public Runnable popTop() { int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = oldTop + 1; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom <= oldTop) return null; Runnable r = tasks[oldTop]; if (top.CAS(oldTop, newTop, oldStamp, newStamp))
return r; return null; }
Give up if conflict occurs
Art of Multiprocessor Programming 125
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; }
Take Work
Art of Multiprocessor Programming 126
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; }
126
Take Work
Make sure queue is non-empty
Art of Multiprocessor Programming 127
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; }
Take Work
Prepare to grab bottom task
Art of Multiprocessor Programming 128128
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0; }
Take Work
Read top, & prepare new values
Art of Multiprocessor Programming 129
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}
129
Take Work
If top & bottom one or more apart, no conflict
stamp
top
bottom
Art of Multiprocessor Programming 130
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}
130
Take Work
At most one item left
stamp
top
bottom
Art of Multiprocessor Programming 131
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}
131
Take WorkTry to steal last task.Reset bottom because the DEQueue will be empty even if unsuccessful (why?)
Art of Multiprocessor Programming 132132
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}
Take Work
stamp
topCAS
bottom
I win CAS
Art of Multiprocessor Programming 133133
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}
Take Work
stamp
topCAS
bottom
If I lose CAS, thief must have won…
Art of Multiprocessor Programming 134134
Runnable popBottom() { if (bottom == 0) return null; bottom--; Runnable r = tasks[bottom]; int[] stamp = new int[1]; int oldTop = top.get(stamp), newTop = 0; int oldStamp = stamp[0], newStamp = oldStamp + 1; if (bottom > oldTop) return r; if (bottom == oldTop){ bottom = 0; if (top.CAS(oldTop, newTop, oldStamp, newStamp)) return r; } top.set(newTop,newStamp); return null; bottom = 0;}
Take WorkFailed to get last task(bottom could be less than top) Must still reset top and bottom
since deque is empty
Art of Multiprocessor Programming 135135
Old English Proverb
• “May as well be hanged for stealing a sheep as a goat”
• From which we conclude:– Stealing was punished severely– Sheep were worth more than goats
Art of Multiprocessor Programming 136136
Variations
• Stealing is expensive– Pay CAS– Only one task taken
• What if– Move more than one task– Randomly balance loads?
Art of Multiprocessor Programming 137137
Work Balancing
5
2
b2+5c/ 2 = 3
d 2+5e/2=4
Art of Multiprocessor Programming 138138
Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}
Art of Multiprocessor Programming 139139
Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}
Keep running tasks
Art of Multiprocessor Programming 140140
Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}
With probability1/|queue|
Art of Multiprocessor Programming 141141
Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}
Choose random victim
Art of Multiprocessor Programming 142142
Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}
Lock queues in canonical order
Art of Multiprocessor Programming 143143
Work-Balancing Threadpublic void run() { int me = ThreadID.get(); while (true) { Runnable task = queue[me].deq(); if (task != null) task.run(); int size = queue[me].size(); if (random.nextInt(size+1) == size) { int victim = random.nextInt(queue.length); int min = …, max = …; synchronized (queue[min]) { synchronized (queue[max]) { balance(queue[min], queue[max]);}}}}}
Rebalance queues
Art of Multiprocessor Programming 144144
Work Stealing & Balancing
• Clean separation between app & scheduling layer
• Works well when number of processors fluctuates.
• Works on “black-box” operating systems
Art of Multiprocessor Programming 145145
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.
• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work
• Under the following conditions:– Attribution. You must attribute the work to “The Art of Multiprocessor
Programming” (but not in any way that suggests that the authors endorse you or your use of the work).
– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.
• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.
• Any of the above conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.
Art of Multiprocessor Programming 146146
T O M
M RA V LO O
R I D LED