nikos anastopoulos, konstantinos nikas, georgios goumas ...anastop/files/presentation.pdf · nikos...
TRANSCRIPT
Early Experiences on Accelerating Dijkstra’s AlgorithmUsing Transactional Memory
Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumasand Nectarios Koziris
Computing Systems LaboratorySchool of Electrical and Computer Engineering
National Technical University of Athens{anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr
http://www.cslab.ece.ntua.gr
May 29, 2009
CSLabNational Technical University of Athens
Outline
1 Dijkstra’s Basics
2 Straightforward Parallelization Scheme
3 Helper-Threading Scheme
4 Experimental Evaluation
5 Conclusions
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 2 / 27
The Basics of Dijkstra’s Algorithm
SSSP Problem
Directed graph G = (V ,E ), weight function w : E → R+, sourcevertex s
∀v ∈ V : compute δ(v) = min{w(p) : sp v}
Shortest path estimate d(v)
gradually converges to δ(v) through relaxations
relax (v,w): d(w) = min{d(w), d(v) + w(v ,w)}I can we find a better path s
p w by going through v?
Three partitions of vertices
Settled: d(v) = δ(v)
Queued: d(v) > δ(v) and d(v) 6=∞Unreached: d(v) =∞
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 3 / 27
The Basics of Dijkstra’s Algorithm
Serial algorithm
Input : G = (V ,E), w : E → R+,1
source vertex s, min QOutput : shortest distance array d ,2
predecessor array πforeach v ∈ V do3
d [v ]← inf;4
π[v ]← nil;5
Insert(Q, v);6
end7
d [s]← 0;8
while Q 6= ∅ do9
u ← ExtractMin(Q);10
foreach v adjacent to u do11
sum← d [u] + w(u, v);12
if d [v ] > sum then13
DecreaseKey(Q, v , sum);14
d [v ]← sum;15
π[v ]← u;16
end17
end18
50
55
60 65
70
5
2
10
10 15
20
7
S
A
B
CD
E
8
8
8
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 4 / 27
The Basics of Dijkstra’s Algorithm
5
7 8
12 9
15 17 13 16
10 13
12 14
i
j
k
4
6 5
7 9
15 12 13 16
8 13
10 14
i:17 6
Min-priority queue implemented as binary min-heap
maintains all but the settled vertices
min-heap property: ∀i : d(parent(i)) ≤ d(i)
amortizes the cost of multiple ExtractMin’s and DecreaseKey’sI O((|E |+ |V |)log |V |) time complexity
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 5 / 27
Straightforward Parallelization
Fine-grain parallelization at the inner loop level
Fine-Grain Multi-Threaded
/* Initialization phase same to the serialcode */
while Q 6= ∅ do1
Barrier2
if tid = 0 then3
u ← ExtractMin(Q);4
Barrier5
for v adjacent to u in parallel do6
sum← d [u] + w(u, v);7
if d [v ] > sum then8
Begin-Atomic9
DecreaseKey(Q, v , sum);10
End-Atomic11
d [v ]← sum;12
π[v ]← u;13
end14
end15
50
55
60 65
70
5
2
10
10 15
20
7
S
A
B
CD
E
8
8
8
Issues:
speedup bounded by averageout-degreeconcurrent heap updates due toDecreaseKey’sbarrier synchronization overhead
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 6 / 27
Concurrent Heap Updates with Locks
5
7 8
12 9
15 17 13 16
10 13
12 14
i
j
k
4
6 5
7 9
15 12 13 16
8 13
10 14
j
k:12 4i:17 6
3
5 8
7 9
15 12 13 16
10 13
12 14
ki:17 3
j:9 2
x
x
Coarse-grain synchronization (cgs-lock)I enforces atomicity at the level of a DecreaseKey operationI one lock for the entire heapI serializes DecreaseKey’s
Fine-grain synchronization (fgs-lock)I enforces atomicity at the level of a single swapI allows multiple swap sequences to execute in parallel as long as they
are temporally non-overlappingI separate locks for each parent-child pair
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 7 / 27
Performance of FGMT with Locks
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3
2 4 6 8 10 12 14 16
Mul
tithr
eade
d sp
eedu
p
Number of threads
cgs-lockperfbar+cgs-lockperfbar+fgs-lock
Software barriers dominate total execution time
72% with 2 threads, 88% with 8replace with idealized (simulated) zero-latency barriers
Fgs-lock scheme more scalable, but still fails to outperform serial
locking overhead (2 locks + 2 unlocks per swap)
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 8 / 27
Concurrent Heap Updates with TM
5
7 8
12 9
15 17 13 16
10 13
12 14
i
j
k
4
6 5
7 9
15 12 13 16
8 13
10 14
j
k:12 4i:17 6
3
5 8
7 9
15 12 13 16
10 13
12 14
ki:17 3
j:9 2
x
x
TM-based
Coarse-grain synchronization (cgs-tm)I enclose DecreaseKey within a transactionI allows multiple swap sequences to execute in parallel as long as they
are spatially (and temporally) non-overlappingI conflicting transaction stalls and retries or aborts
Fine-grain synchronization (fgs-tm)I enclose each swap operation within a transactionI atomicity as in fgs-lockI shorter but more transactions
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 9 / 27
Performance of FGMT with TM
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
2 4 6 8 10 12 14 16
Mul
tithr
eade
d sp
eedu
p
Number of threads
perfbar+cgs-lockperfbar+fgs-lockperfbar+cgs-tmperfbar+fgs-tm
TM-based schemes offer speedup up to ∼ 1.1
less overhead for cgs-tm, yet equally able to exploit availableconcurrency
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 10 / 27
Helper-Threading Scheme
Motivation
expose more parallelism to each threadeliminate costly barrier synchronization
Rationale
in serial, relaxations are performed onlyfrom the extracted (settled) vertex
allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled
I main thread operates as in the serialalgorithm
I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads
I helper thread k relaxes all out-edgesof vertex xk
50
55
60 65
70
5
2
10
10 15
20
7
i-1
i
S
A
B
CD
E
8
8
8
speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by
main thread
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 11 / 27
Helper-Threading Scheme
Motivation
expose more parallelism to each threadeliminate costly barrier synchronization
Rationale
in serial, relaxations are performed onlyfrom the extracted (settled) vertex
allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled
I main thread operates as in the serialalgorithm
I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads
I helper thread k relaxes all out-edgesof vertex xk
50
55
60 65
70
5
2
10
10 15
20
7
i-1
i
S
A
B
CD
E
8
8
8
speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by
main thread
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 11 / 27
Helper-Threading Scheme
Motivation
expose more parallelism to each threadeliminate costly barrier synchronization
Rationale
in serial, relaxations are performed onlyfrom the extracted (settled) vertex
allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled
I main thread operates as in the serialalgorithm
I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads
I helper thread k relaxes all out-edgesof vertex xk
50
55
60 65
70
5
2
10
10 15
20
7
i-1
i
S
A
B
CD
E
8
8
8
speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by
main thread
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 11 / 27
Execution Pattern
ste
p k
ste
p k
+1
ste
p k
+2
Main
Th
read
1
Th
read
2
Th
read
3
Th
read
4
ste
p k
ste
p k
+1
ste
p k
+2
kill kill
killM
ain H
elp
er 1
Help
er 2
Help
er 3
kill
kill
ste
p k
ste
p k
+1
ste
p k
+2
extract-min
relax edges
read tidth-min
Serial FGMT Helper Threads
the main thread stops all helpers at the end of each iteration
unfinished work will be corrected, as with mis-speculated distances
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 12 / 27
Helper-Threading Scheme
Main thread
while Q 6= ∅ do1
u ← ExtractMin(Q);2
done ← 0;3
foreach v adjacent to u do4
sum← d [u] + w(u, v);5
Begin-Xact6
if d [v ] > sum then7
DecreaseKey(Q, v , sum);8
d [v ]← sum;9
π[v ]← u;10
End-Xact11
end12
Begin-Xact13
done ← 1;14
End-Xact15
end16
Helper thread
while Q 6= ∅ do1
while done = 1 do ;2
x ← ReadMin(Q, tid)3
stop ← 04
foreach y adjacent to x and while stop = 0 do5
Begin-Xact6
if done = 0 then7
sum← d [x] + w(x , y)8
if d [y ] > sum then9
DecreaseKey(Q, y , sum)10
d [y ]← sum11
π[y ]← x12
else13
stop ← 114
End-Xact15
end16
end17
for a single neighbour, the check for relaxation, updates to the heap, andupdates to d ,π arrays, are enclosed within a transaction
I performed “all-or-none”I on a conflict, only one thread commits
interruption of helper threads implemented through TM, as well
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 13 / 27
Helper-Threading Scheme
Main thread
while Q 6= ∅ do1
u ← ExtractMin(Q);2
done ← 0;3
foreach v adjacent to u do4
sum← d [u] + w(u, v);5
Begin-Xact6
if d [v ] > sum then7
DecreaseKey(Q, v , sum);8
d [v ]← sum;9
π[v ]← u;10
End-Xact11
end12
Begin-Xact13
done ← 1;14
End-Xact15
end16
Helper thread
while Q 6= ∅ do1
while done = 1 do ;2
x ← ReadMin(Q, tid)3
stop ← 04
foreach y adjacent to x and while stop = 0 do5
Begin-Xact6
if done = 0 then7
sum← d [x] + w(x , y)8
if d [y ] > sum then9
DecreaseKey(Q, y , sum)10
d [y ]← sum11
π[y ]← x12
else13
stop ← 114
End-Xact15
end16
end17
Why with TM?
composableI all dependent atomic sub-operations composed into a large atomic
operation, without limiting concurrency
optimisticeasily programmable
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 14 / 27
Helper-Threading Scheme
Main thread
while Q 6= ∅ do1
u ← ExtractMin(Q);2
done ← 0;3
foreach v adjacent to u do4
sum← d [u] + w(u, v);5
Begin-Xact6
if d [v ] > sum then7
DecreaseKey(Q, v , sum);8
d [v ]← sum;9
π[v ]← u;10
End-Xact11
end12
Begin-Xact13
done ← 1;14
End-Xact15
end16
Helper thread
while Q 6= ∅ do1
while done = 1 do ;2
x ← ReadMin(Q, tid)3
stop ← 04
foreach y adjacent to x and while stop = 0 do5
Begin-Xact6
if done = 0 then7
sum← d [x] + w(x , y)8
if d [y ] > sum then9
DecreaseKey(Q, y , sum)10
d [y ]← sum11
π[y ]← x12
else13
stop ← 114
End-Xact15
end16
end17
Why with TM?
composableI all dependent atomic sub-operations composed into a large atomic
operation, without limiting concurrency
optimisticeasily programmable
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 14 / 27
Experimental SetupFull-system simulation
Simics 3.0.31 in conjunction with GEMS toolset 2.1boots unmodified Solaris 10 (UltraSPARC III Cu)
LogTM (“Signature Edition”)
eager version managementeager conflict detection
I on a conflict, a transaction stalls and either retries or aborts
HYBRID conflict resolution policyI favors older transactions
Hardware platform
single CMP system (configurations up to 32 cores)private L1 caches (64KB), shared L2 cache (2MB)
Software
Pthreads for threading and synchronizationSimics “magic” instructions to simulate idealized barriersSun Studio 12 C compiler (-xO3)
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 15 / 27
Graphs
Three graph families
Random: G (n,m) model
SSCA#2: cliques with varying size (1 – C )connected with probability P
R-MAT: power-law degree distributions
GTgraph graph generatorFixed #nodes (10K), varying density
sparse (∼10K edges)
medium (∼100K edges)
dense (∼200K edges)
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 16 / 27
Speedups
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
2 4 6 8 10 12 14 16
Number of threads
ssca2-10000x28351
perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
2 4 6 8 10 12 14 16
Number of threads
ssca2-10000x118853
perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
2 4 6 8 10 12 14 16
Number of threads
rmat-10000x200000
perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
2 4 6 8 10 12 14 16
Number of threads
rmat-10000x10000
perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
2 4 6 8 10 12 14 16
Number of threads
rand-10000x100000
perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 1.1 1.2 1.3 1.4 1.5
2 4 6 8 10 12 14 16
Number of threads
rand-10000x200000
perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper
Helper-Threading
speedups in 6 out of 9 cases (not all shown), up to 1.46
performance improves with increasing density
main thread not obstructed by helpers (<1% abort rate in all cases)
FGMT with TM
speedups only with perfect barriers
optimistic parallelism does exist in concurrent queue updates
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 17 / 27
Conclusions
FGMT
conventional synchronization mechanisms incur unacceptableoverhead
TM reduces overheads and highlights the existence of parallelism, butstill requires very efficient barriers to offer some speedup
HT with TM
exposes more parallelism and eliminates barrier synchronization
noteworthy speedups with minimal code extensions
Future work
more aggressive parallelization schemes
dynamic adaptation of helper threads to algorithm’s execution phases
explore impact of TM characteristics
applicability of HT on other SSSP algorithms (∆-stepping,Bellman-Ford) and other similar (“greedy”) applications
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 18 / 27
Thank you!Questions?
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 19 / 27
BACKUP SLIDES
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 20 / 27
DecreaseKey
Input : min-priority queue Q, vertex u,new key value for vertex u
Q[u]← value;1
i ← u;2
while (parent(i).key ≥ value) do3
Begin-Xact4
if swap(u, i) then5
i ← parent(i);6
End-Xact7
end8
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 21 / 27
Distribution of DecreaseKey’s between main and helperthreads
0
20
40
60
80
100
120
140
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
% D
ecre
aseK
ey o
ps w
.r.t.
ser
ial e
xecu
tion
Number of threads
(a) 10Kx10K
serialHT-mainHT-helperHT-all
0
20
40
60
80
100
120
140
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
% D
ecre
aseK
ey o
ps w
.r.t.
ser
ial e
xecu
tion
Number of threads
(b) 10Kx200K
serialHT-mainHT-helperHT-all
0
20
40
60
80
100
120
140
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
% D
ecre
aseK
ey o
ps w
.r.t.
ser
ial e
xecu
tion
Number of threads
(c) 10Kx1000K
serialHT-mainHT-helperHT-all
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 22 / 27
Overall and main thread transaction abort rates
0
5
10
15
20
25
30
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Abo
rt r
ate
Number of threads
(a) 10Kx10K
overallmain thread
0
0.5
1
1.5
2
2.5
3
3.5
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Abo
rt r
ate
Number of threads
(b) 10Kx200K
overallmain thread
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Abo
rt r
ate
Number of threads
(c) 10Kx1000K
overallmain thread
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 23 / 27
Breakdown of main thread’s total cycles
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
1.6e+07
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Mai
n th
read
cyc
le b
reak
dow
n
Number of threads
(a) 10Kx10K
total cyclesnon-xact cyclesxact cyclesxact overhead cycles
0
1e+07
2e+07
3e+07
4e+07
5e+07
6e+07
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Mai
n th
read
cyc
le b
reak
dow
n
Number of threads
(b) 10Kx200K
total cyclesnon-xact cyclesxact cyclesxact overhead cycles
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
1.4e+08
1.6e+08
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Mai
n th
read
cyc
le b
reak
dow
n
Number of threads
(c) 10Kx1000K
total cyclesnon-xact cyclesxact cyclesxact overhead cycles
Non-transactional (non-xact), transactional (xact) and overhead(xact-overhead)
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 24 / 27
Percentage of useful (xact) and wasted (xact-overhead)transactional cycles
0
20
40
60
80
100
120
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
%C
ycle
bre
akdo
wn
for
aver
age
tran
sact
ion
Number of threads
(a) 10Kx10K
xact cyclesxact overhead cycles
0
20
40
60
80
100
120
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
%C
ycle
bre
akdo
wn
for
aver
age
tran
sact
ion
Number of threads
(b) 10Kx200K
xact cyclesxact overhead cycles
0
20
40
60
80
100
120
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
%C
ycle
bre
akdo
wn
for
aver
age
tran
sact
ion
Number of threads
(c) 10Kx1000K
xact cyclesxact overhead cycles
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 25 / 27
Write-set sizes
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 26 / 27
The timeline of execution
0
500000
1e+06
1.5e+06
2e+06
2.5e+06
3e+06
3.5e+06
0 2000 4000 6000 8000 10000
Cyc
les
of m
ain
thre
ad (
per
100
itera
tions
)
Outer loop iterations
serialHT-2thrHT-4thrHT-best
Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 27 / 27