nikos anastopoulos, konstantinos nikas, georgios goumas ...anastop/files/presentation.pdf · nikos...

Early Experiences on Accelerating Dijkstra’s AlgorithmUsing Transactional Memory

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumasand Nectarios Koziris

Computing Systems LaboratorySchool of Electrical and Computer Engineering

National Technical University of Athens{anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr

http://www.cslab.ece.ntua.gr

May 29, 2009

CSLabNational Technical University of Athens

Outline

1 Dijkstra’s Basics

2 Straightforward Parallelization Scheme

3 Helper-Threading Scheme

4 Experimental Evaluation

5 Conclusions

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 2 / 27

The Basics of Dijkstra’s Algorithm

SSSP Problem

Directed graph G = (V ,E ), weight function w : E → R+, sourcevertex s

∀v ∈ V : compute δ(v) = min{w(p) : sp v}

Shortest path estimate d(v)

gradually converges to δ(v) through relaxations

relax (v,w): d(w) = min{d(w), d(v) + w(v ,w)}I can we find a better path s

p w by going through v?

Three partitions of vertices

Settled: d(v) = δ(v)

Queued: d(v) > δ(v) and d(v) 6=∞Unreached: d(v) =∞



Serial algorithm

Input : G = (V ,E), w : E → R+,1

source vertex s, min QOutput : shortest distance array d ,2

predecessor array πforeach v ∈ V do3

d [v ]← inf;4

π[v ]← nil;5

Insert(Q, v);6

end7

d [s]← 0;8

while Q 6= ∅ do9

u ← ExtractMin(Q);10

foreach v adjacent to u do11

sum← d [u] + w(u, v);12

if d [v ] > sum then13

DecreaseKey(Q, v , sum);14

d [v ]← sum;15

π[v ]← u;16

end17

end18

50

55

60 65

70

5

2

10

10 15

20

7

S

A

B

CD

E

8

8

8



5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

i:17 6

Min-priority queue implemented as binary min-heap

maintains all but the settled vertices

min-heap property: ∀i : d(parent(i)) ≤ d(i)

amortizes the cost of multiple ExtractMin’s and DecreaseKey’sI O((|E |+ |V |)log |V |) time complexity


Straightforward Parallelization

Fine-grain parallelization at the inner loop level

Fine-Grain Multi-Threaded

/* Initialization phase same to the serialcode */

while Q 6= ∅ do1

Barrier2

if tid = 0 then3


Barrier5

for v adjacent to u in parallel do6

sum← d [u] + w(u, v);7


Begin-Atomic9


End-Atomic11

d [v ]← sum;12

π[v ]← u;13

end14

end15

50

55

60 65

70

5

2

10

10 15

20

7

S

A

B

CD

E

8

8

8

Issues:

speedup bounded by averageout-degreeconcurrent heap updates due toDecreaseKey’sbarrier synchronization overhead


Concurrent Heap Updates with Locks

5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

j

k:12 4i:17 6

3

5 8

7 9

15 12 13 16

10 13

12 14

ki:17 3

j:9 2

x

x

Coarse-grain synchronization (cgs-lock)I enforces atomicity at the level of a DecreaseKey operationI one lock for the entire heapI serializes DecreaseKey’s

Fine-grain synchronization (fgs-lock)I enforces atomicity at the level of a single swapI allows multiple swap sequences to execute in parallel as long as they

are temporally non-overlappingI separate locks for each parent-child pair


Performance of FGMT with Locks

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3

2 4 6 8 10 12 14 16

Mul

tithr

eade

d sp

eedu

p

Number of threads

cgs-lockperfbar+cgs-lockperfbar+fgs-lock

Software barriers dominate total execution time

72% with 2 threads, 88% with 8replace with idealized (simulated) zero-latency barriers

Fgs-lock scheme more scalable, but still fails to outperform serial

locking overhead (2 locks + 2 unlocks per swap)


Concurrent Heap Updates with TM

5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

j

k:12 4i:17 6

3

5 8

7 9

15 12 13 16

10 13

12 14

ki:17 3

j:9 2

x

x

TM-based

Coarse-grain synchronization (cgs-tm)I enclose DecreaseKey within a transactionI allows multiple swap sequences to execute in parallel as long as they

are spatially (and temporally) non-overlappingI conflicting transaction stalls and retries or aborts

Fine-grain synchronization (fgs-tm)I enclose each swap operation within a transactionI atomicity as in fgs-lockI shorter but more transactions


Performance of FGMT with TM

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

2 4 6 8 10 12 14 16

Mul

tithr

eade

d sp

eedu

p

Number of threads

perfbar+cgs-lockperfbar+fgs-lockperfbar+cgs-tmperfbar+fgs-tm

TM-based schemes offer speedup up to ∼ 1.1

less overhead for cgs-tm, yet equally able to exploit availableconcurrency


Helper-Threading Scheme

Motivation

expose more parallelism to each threadeliminate costly barrier synchronization

Rationale

in serial, relaxations are performed onlyfrom the extracted (settled) vertex

allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled

I main thread operates as in the serialalgorithm

I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads

I helper thread k relaxes all out-edgesof vertex xk

50

55

60 65

70

5

2

10

10 15

20

7

i-1

i

S

A

B

CD

E

8

8

8

speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by

main thread


Execution Pattern

ste

p k

ste

p k

+1

ste

p k

+2

Main

Th

read

1

Th

read

2

Th

read

3

Th

read

4

ste

p k

ste

p k

+1

ste

p k

+2

kill kill

killM

ain H

elp

er 1

Help

er 2

Help

er 3

kill

kill

ste

p k

ste

p k

+1

ste

p k

+2

extract-min

relax edges

read tidth-min

Serial FGMT Helper Threads

the main thread stops all helpers at the end of each iteration

unfinished work will be corrected, as with mis-speculated distances



Main thread

while Q 6= ∅ do1


done ← 0;3


sum← d [u] + w(u, v);5

Begin-Xact6



d [v ]← sum;9

π[v ]← u;10

End-Xact11

end12

Begin-Xact13

done ← 1;14

End-Xact15

end16

Helper thread

while Q 6= ∅ do1

while done = 1 do ;2

x ← ReadMin(Q, tid)3

stop ← 04

foreach y adjacent to x and while stop = 0 do5

Begin-Xact6

if done = 0 then7

sum← d [x] + w(x , y)8

if d [y ] > sum then9

DecreaseKey(Q, y , sum)10

d [y ]← sum11

π[y ]← x12

else13

stop ← 114

End-Xact15

end16

end17

for a single neighbour, the check for relaxation, updates to the heap, andupdates to d ,π arrays, are enclosed within a transaction

I performed “all-or-none”I on a conflict, only one thread commits

interruption of helper threads implemented through TM, as well



Main thread

while Q 6= ∅ do1


done ← 0;3


sum← d [u] + w(u, v);5

Begin-Xact6



d [v ]← sum;9

π[v ]← u;10

End-Xact11

end12

Begin-Xact13

done ← 1;14

End-Xact15

end16

Helper thread

while Q 6= ∅ do1

while done = 1 do ;2

x ← ReadMin(Q, tid)3

stop ← 04

foreach y adjacent to x and while stop = 0 do5

Begin-Xact6

if done = 0 then7

sum← d [x] + w(x , y)8

if d [y ] > sum then9

DecreaseKey(Q, y , sum)10

d [y ]← sum11

π[y ]← x12

else13

stop ← 114

End-Xact15

end16

end17

Why with TM?

composableI all dependent atomic sub-operations composed into a large atomic

operation, without limiting concurrency

optimisticeasily programmable


Experimental SetupFull-system simulation

Simics 3.0.31 in conjunction with GEMS toolset 2.1boots unmodified Solaris 10 (UltraSPARC III Cu)

LogTM (“Signature Edition”)

eager version managementeager conflict detection

I on a conflict, a transaction stalls and either retries or aborts

HYBRID conflict resolution policyI favors older transactions

Hardware platform

single CMP system (configurations up to 32 cores)private L1 caches (64KB), shared L2 cache (2MB)

Software

Pthreads for threading and synchronizationSimics “magic” instructions to simulate idealized barriersSun Studio 12 C compiler (-xO3)


Graphs

Three graph families

Random: G (n,m) model

SSCA#2: cliques with varying size (1 – C )connected with probability P

R-MAT: power-law degree distributions

GTgraph graph generatorFixed #nodes (10K), varying density

sparse (∼10K edges)

medium (∼100K edges)

dense (∼200K edges)


Speedups

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

ssca2-10000x28351

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

ssca2-10000x118853


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rmat-10000x200000


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rmat-10000x10000


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rand-10000x100000


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rand-10000x200000


Helper-Threading

speedups in 6 out of 9 cases (not all shown), up to 1.46

performance improves with increasing density

main thread not obstructed by helpers (<1% abort rate in all cases)

FGMT with TM

speedups only with perfect barriers

optimistic parallelism does exist in concurrent queue updates


Conclusions

FGMT

conventional synchronization mechanisms incur unacceptableoverhead

TM reduces overheads and highlights the existence of parallelism, butstill requires very efficient barriers to offer some speedup

HT with TM

exposes more parallelism and eliminates barrier synchronization

noteworthy speedups with minimal code extensions

Future work

more aggressive parallelization schemes

dynamic adaptation of helper threads to algorithm’s execution phases

explore impact of TM characteristics

applicability of HT on other SSSP algorithms (∆-stepping,Bellman-Ford) and other similar (“greedy”) applications


Thank you!Questions?


BACKUP SLIDES


DecreaseKey

Input : min-priority queue Q, vertex u,new key value for vertex u

Q[u]← value;1

i ← u;2

while (parent(i).key ≥ value) do3

Begin-Xact4

if swap(u, i) then5

i ← parent(i);6

End-Xact7

end8


Distribution of DecreaseKey’s between main and helperthreads

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

% D

ecre

aseK

ey o

ps w

.r.t.

ser

ial e

xecu

tion

Number of threads

(a) 10Kx10K

serialHT-mainHT-helperHT-all

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

% D

ecre

aseK

ey o

ps w

.r.t.

ser

ial e

xecu

tion

Number of threads

(b) 10Kx200K


0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

% D

ecre

aseK

ey o

ps w

.r.t.

ser

ial e

xecu

tion

Number of threads

(c) 10Kx1000K



Overall and main thread transaction abort rates

0

5

10

15

20

25

30

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Abo

rt r

ate

Number of threads

(a) 10Kx10K

overallmain thread

0

0.5

1

1.5

2

2.5

3

3.5

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Abo

rt r

ate

Number of threads

(b) 10Kx200K

overallmain thread

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Abo

rt r

ate

Number of threads

(c) 10Kx1000K

overallmain thread


Breakdown of main thread’s total cycles

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

1.6e+07

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Mai

n th

read

cyc

le b

reak

dow

n

Number of threads

(a) 10Kx10K

total cyclesnon-xact cyclesxact cyclesxact overhead cycles

0

1e+07

2e+07

3e+07

4e+07

5e+07

6e+07

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Mai

n th

read

cyc

le b

reak

dow

n

Number of threads

(b) 10Kx200K


0

2e+07

4e+07

6e+07

8e+07

1e+08

1.2e+08

1.4e+08

1.6e+08

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Mai

n th

read

cyc

le b

reak

dow

n

Number of threads

(c) 10Kx1000K


Non-transactional (non-xact), transactional (xact) and overhead(xact-overhead)


Percentage of useful (xact) and wasted (xact-overhead)transactional cycles

0

20

40

60

80

100

120

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

%C

ycle

bre

akdo

wn

for

aver

age

tran

sact

ion

Number of threads

(a) 10Kx10K

xact cyclesxact overhead cycles

0

20

40

60

80

100

120

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

%C

ycle

bre

akdo

wn

for

aver

age

tran

sact

ion

Number of threads

(b) 10Kx200K


0

20

40

60

80

100

120

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

%C

ycle

bre

akdo

wn

for

aver

age

tran

sact

ion

Number of threads

(c) 10Kx1000K



Write-set sizes


The timeline of execution

0

500000

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

3.5e+06

0 2000 4000 6000 8000 10000

Cyc

les

of m

ain

thre

ad (

per

100

itera

tions

)

Outer loop iterations

serialHT-2thrHT-4thrHT-best


nikos anastopoulos, konstantinos nikas, georgios goumas ...anastop/files/presentation.pdf · nikos...

Documents