nikos anastopoulos, konstantinos nikas, georgios goumas ...anastop/files/presentation.pdf · nikos...

30
Early Experiences on Accelerating Dijkstra’s Algorithm Using Transactional Memory Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens {anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr http://www.cslab.ece.ntua.gr May 29, 2009 CSLab National Technical University of Athens

Upload: others

Post on 13-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Early Experiences on Accelerating Dijkstra’s AlgorithmUsing Transactional Memory

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumasand Nectarios Koziris

Computing Systems LaboratorySchool of Electrical and Computer Engineering

National Technical University of Athens{anastop,knikas,goumas,nkoziris}@cslab.ece.ntua.gr

http://www.cslab.ece.ntua.gr

May 29, 2009

CSLabNational Technical University of Athens

Page 2: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Outline

1 Dijkstra’s Basics

2 Straightforward Parallelization Scheme

3 Helper-Threading Scheme

4 Experimental Evaluation

5 Conclusions

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 2 / 27

Page 3: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

The Basics of Dijkstra’s Algorithm

SSSP Problem

Directed graph G = (V ,E ), weight function w : E → R+, sourcevertex s

∀v ∈ V : compute δ(v) = min{w(p) : sp v}

Shortest path estimate d(v)

gradually converges to δ(v) through relaxations

relax (v,w): d(w) = min{d(w), d(v) + w(v ,w)}I can we find a better path s

p w by going through v?

Three partitions of vertices

Settled: d(v) = δ(v)

Queued: d(v) > δ(v) and d(v) 6=∞Unreached: d(v) =∞

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 3 / 27

Page 4: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

The Basics of Dijkstra’s Algorithm

Serial algorithm

Input : G = (V ,E), w : E → R+,1

source vertex s, min QOutput : shortest distance array d ,2

predecessor array πforeach v ∈ V do3

d [v ]← inf;4

π[v ]← nil;5

Insert(Q, v);6

end7

d [s]← 0;8

while Q 6= ∅ do9

u ← ExtractMin(Q);10

foreach v adjacent to u do11

sum← d [u] + w(u, v);12

if d [v ] > sum then13

DecreaseKey(Q, v , sum);14

d [v ]← sum;15

π[v ]← u;16

end17

end18

50

55

60 65

70

5

2

10

10 15

20

7

S

A

B

CD

E

8

8

8

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 4 / 27

Page 5: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

The Basics of Dijkstra’s Algorithm

5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

i:17 6

Min-priority queue implemented as binary min-heap

maintains all but the settled vertices

min-heap property: ∀i : d(parent(i)) ≤ d(i)

amortizes the cost of multiple ExtractMin’s and DecreaseKey’sI O((|E |+ |V |)log |V |) time complexity

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 5 / 27

Page 6: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Straightforward Parallelization

Fine-grain parallelization at the inner loop level

Fine-Grain Multi-Threaded

/* Initialization phase same to the serialcode */

while Q 6= ∅ do1

Barrier2

if tid = 0 then3

u ← ExtractMin(Q);4

Barrier5

for v adjacent to u in parallel do6

sum← d [u] + w(u, v);7

if d [v ] > sum then8

Begin-Atomic9

DecreaseKey(Q, v , sum);10

End-Atomic11

d [v ]← sum;12

π[v ]← u;13

end14

end15

50

55

60 65

70

5

2

10

10 15

20

7

S

A

B

CD

E

8

8

8

Issues:

speedup bounded by averageout-degreeconcurrent heap updates due toDecreaseKey’sbarrier synchronization overhead

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 6 / 27

Page 7: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Concurrent Heap Updates with Locks

5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

j

k:12 4i:17 6

3

5 8

7 9

15 12 13 16

10 13

12 14

ki:17 3

j:9 2

x

x

Coarse-grain synchronization (cgs-lock)I enforces atomicity at the level of a DecreaseKey operationI one lock for the entire heapI serializes DecreaseKey’s

Fine-grain synchronization (fgs-lock)I enforces atomicity at the level of a single swapI allows multiple swap sequences to execute in parallel as long as they

are temporally non-overlappingI separate locks for each parent-child pair

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 7 / 27

Page 8: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Performance of FGMT with Locks

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3

2 4 6 8 10 12 14 16

Mul

tithr

eade

d sp

eedu

p

Number of threads

cgs-lockperfbar+cgs-lockperfbar+fgs-lock

Software barriers dominate total execution time

72% with 2 threads, 88% with 8replace with idealized (simulated) zero-latency barriers

Fgs-lock scheme more scalable, but still fails to outperform serial

locking overhead (2 locks + 2 unlocks per swap)

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 8 / 27

Page 9: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Concurrent Heap Updates with TM

5

7 8

12 9

15 17 13 16

10 13

12 14

i

j

k

4

6 5

7 9

15 12 13 16

8 13

10 14

j

k:12 4i:17 6

3

5 8

7 9

15 12 13 16

10 13

12 14

ki:17 3

j:9 2

x

x

TM-based

Coarse-grain synchronization (cgs-tm)I enclose DecreaseKey within a transactionI allows multiple swap sequences to execute in parallel as long as they

are spatially (and temporally) non-overlappingI conflicting transaction stalls and retries or aborts

Fine-grain synchronization (fgs-tm)I enclose each swap operation within a transactionI atomicity as in fgs-lockI shorter but more transactions

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 9 / 27

Page 10: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Performance of FGMT with TM

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

2 4 6 8 10 12 14 16

Mul

tithr

eade

d sp

eedu

p

Number of threads

perfbar+cgs-lockperfbar+fgs-lockperfbar+cgs-tmperfbar+fgs-tm

TM-based schemes offer speedup up to ∼ 1.1

less overhead for cgs-tm, yet equally able to exploit availableconcurrency

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 10 / 27

Page 11: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Helper-Threading Scheme

Motivation

expose more parallelism to each threadeliminate costly barrier synchronization

Rationale

in serial, relaxations are performed onlyfrom the extracted (settled) vertex

allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled

I main thread operates as in the serialalgorithm

I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads

I helper thread k relaxes all out-edgesof vertex xk

50

55

60 65

70

5

2

10

10 15

20

7

i-1

i

S

A

B

CD

E

8

8

8

speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by

main thread

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 11 / 27

Page 12: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Helper-Threading Scheme

Motivation

expose more parallelism to each threadeliminate costly barrier synchronization

Rationale

in serial, relaxations are performed onlyfrom the extracted (settled) vertex

allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled

I main thread operates as in the serialalgorithm

I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads

I helper thread k relaxes all out-edgesof vertex xk

50

55

60 65

70

5

2

10

10 15

20

7

i-1

i

S

A

B

CD

E

8

8

8

speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by

main thread

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 11 / 27

Page 13: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Helper-Threading Scheme

Motivation

expose more parallelism to each threadeliminate costly barrier synchronization

Rationale

in serial, relaxations are performed onlyfrom the extracted (settled) vertex

allow relaxations for out-edges ofqueued vertices, hoping that some ofthem might already be settled

I main thread operates as in the serialalgorithm

I assign the next t vertices in thequeue (x2 . . . xt+1) to t helper threads

I helper thread k relaxes all out-edgesof vertex xk

50

55

60 65

70

5

2

10

10 15

20

7

i-1

i

S

A

B

CD

E

8

8

8

speculation on the status of d(xk)I if already optimal , main thread will be offloadedI if not optimal , any suboptimal relaxations will be corrected eventually by

main thread

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 11 / 27

Page 14: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Execution Pattern

ste

p k

ste

p k

+1

ste

p k

+2

Main

Th

read

1

Th

read

2

Th

read

3

Th

read

4

ste

p k

ste

p k

+1

ste

p k

+2

kill kill

killM

ain H

elp

er 1

Help

er 2

Help

er 3

kill

kill

ste

p k

ste

p k

+1

ste

p k

+2

extract-min

relax edges

read tidth-min

Serial FGMT Helper Threads

the main thread stops all helpers at the end of each iteration

unfinished work will be corrected, as with mis-speculated distances

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 12 / 27

Page 15: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Helper-Threading Scheme

Main thread

while Q 6= ∅ do1

u ← ExtractMin(Q);2

done ← 0;3

foreach v adjacent to u do4

sum← d [u] + w(u, v);5

Begin-Xact6

if d [v ] > sum then7

DecreaseKey(Q, v , sum);8

d [v ]← sum;9

π[v ]← u;10

End-Xact11

end12

Begin-Xact13

done ← 1;14

End-Xact15

end16

Helper thread

while Q 6= ∅ do1

while done = 1 do ;2

x ← ReadMin(Q, tid)3

stop ← 04

foreach y adjacent to x and while stop = 0 do5

Begin-Xact6

if done = 0 then7

sum← d [x] + w(x , y)8

if d [y ] > sum then9

DecreaseKey(Q, y , sum)10

d [y ]← sum11

π[y ]← x12

else13

stop ← 114

End-Xact15

end16

end17

for a single neighbour, the check for relaxation, updates to the heap, andupdates to d ,π arrays, are enclosed within a transaction

I performed “all-or-none”I on a conflict, only one thread commits

interruption of helper threads implemented through TM, as well

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 13 / 27

Page 16: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Helper-Threading Scheme

Main thread

while Q 6= ∅ do1

u ← ExtractMin(Q);2

done ← 0;3

foreach v adjacent to u do4

sum← d [u] + w(u, v);5

Begin-Xact6

if d [v ] > sum then7

DecreaseKey(Q, v , sum);8

d [v ]← sum;9

π[v ]← u;10

End-Xact11

end12

Begin-Xact13

done ← 1;14

End-Xact15

end16

Helper thread

while Q 6= ∅ do1

while done = 1 do ;2

x ← ReadMin(Q, tid)3

stop ← 04

foreach y adjacent to x and while stop = 0 do5

Begin-Xact6

if done = 0 then7

sum← d [x] + w(x , y)8

if d [y ] > sum then9

DecreaseKey(Q, y , sum)10

d [y ]← sum11

π[y ]← x12

else13

stop ← 114

End-Xact15

end16

end17

Why with TM?

composableI all dependent atomic sub-operations composed into a large atomic

operation, without limiting concurrency

optimisticeasily programmable

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 14 / 27

Page 17: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Helper-Threading Scheme

Main thread

while Q 6= ∅ do1

u ← ExtractMin(Q);2

done ← 0;3

foreach v adjacent to u do4

sum← d [u] + w(u, v);5

Begin-Xact6

if d [v ] > sum then7

DecreaseKey(Q, v , sum);8

d [v ]← sum;9

π[v ]← u;10

End-Xact11

end12

Begin-Xact13

done ← 1;14

End-Xact15

end16

Helper thread

while Q 6= ∅ do1

while done = 1 do ;2

x ← ReadMin(Q, tid)3

stop ← 04

foreach y adjacent to x and while stop = 0 do5

Begin-Xact6

if done = 0 then7

sum← d [x] + w(x , y)8

if d [y ] > sum then9

DecreaseKey(Q, y , sum)10

d [y ]← sum11

π[y ]← x12

else13

stop ← 114

End-Xact15

end16

end17

Why with TM?

composableI all dependent atomic sub-operations composed into a large atomic

operation, without limiting concurrency

optimisticeasily programmable

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 14 / 27

Page 18: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Experimental SetupFull-system simulation

Simics 3.0.31 in conjunction with GEMS toolset 2.1boots unmodified Solaris 10 (UltraSPARC III Cu)

LogTM (“Signature Edition”)

eager version managementeager conflict detection

I on a conflict, a transaction stalls and either retries or aborts

HYBRID conflict resolution policyI favors older transactions

Hardware platform

single CMP system (configurations up to 32 cores)private L1 caches (64KB), shared L2 cache (2MB)

Software

Pthreads for threading and synchronizationSimics “magic” instructions to simulate idealized barriersSun Studio 12 C compiler (-xO3)

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 15 / 27

Page 19: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Graphs

Three graph families

Random: G (n,m) model

SSCA#2: cliques with varying size (1 – C )connected with probability P

R-MAT: power-law degree distributions

GTgraph graph generatorFixed #nodes (10K), varying density

sparse (∼10K edges)

medium (∼100K edges)

dense (∼200K edges)

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 16 / 27

Page 20: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Speedups

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

ssca2-10000x28351

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

ssca2-10000x118853

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rmat-10000x200000

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rmat-10000x10000

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rand-10000x100000

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1 1.1 1.2 1.3 1.4 1.5

2 4 6 8 10 12 14 16

Number of threads

rand-10000x200000

perfbar+cgs-lockperfbar+cgs-tmperfbar+fgs-lockperfbar+fgs-tmhelper

Helper-Threading

speedups in 6 out of 9 cases (not all shown), up to 1.46

performance improves with increasing density

main thread not obstructed by helpers (<1% abort rate in all cases)

FGMT with TM

speedups only with perfect barriers

optimistic parallelism does exist in concurrent queue updates

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 17 / 27

Page 21: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Conclusions

FGMT

conventional synchronization mechanisms incur unacceptableoverhead

TM reduces overheads and highlights the existence of parallelism, butstill requires very efficient barriers to offer some speedup

HT with TM

exposes more parallelism and eliminates barrier synchronization

noteworthy speedups with minimal code extensions

Future work

more aggressive parallelization schemes

dynamic adaptation of helper threads to algorithm’s execution phases

explore impact of TM characteristics

applicability of HT on other SSSP algorithms (∆-stepping,Bellman-Ford) and other similar (“greedy”) applications

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 18 / 27

Page 22: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Thank you!Questions?

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 19 / 27

Page 23: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

BACKUP SLIDES

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 20 / 27

Page 24: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

DecreaseKey

Input : min-priority queue Q, vertex u,new key value for vertex u

Q[u]← value;1

i ← u;2

while (parent(i).key ≥ value) do3

Begin-Xact4

if swap(u, i) then5

i ← parent(i);6

End-Xact7

end8

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 21 / 27

Page 25: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Distribution of DecreaseKey’s between main and helperthreads

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

% D

ecre

aseK

ey o

ps w

.r.t.

ser

ial e

xecu

tion

Number of threads

(a) 10Kx10K

serialHT-mainHT-helperHT-all

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

% D

ecre

aseK

ey o

ps w

.r.t.

ser

ial e

xecu

tion

Number of threads

(b) 10Kx200K

serialHT-mainHT-helperHT-all

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

% D

ecre

aseK

ey o

ps w

.r.t.

ser

ial e

xecu

tion

Number of threads

(c) 10Kx1000K

serialHT-mainHT-helperHT-all

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 22 / 27

Page 26: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Overall and main thread transaction abort rates

0

5

10

15

20

25

30

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Abo

rt r

ate

Number of threads

(a) 10Kx10K

overallmain thread

0

0.5

1

1.5

2

2.5

3

3.5

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Abo

rt r

ate

Number of threads

(b) 10Kx200K

overallmain thread

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Abo

rt r

ate

Number of threads

(c) 10Kx1000K

overallmain thread

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 23 / 27

Page 27: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Breakdown of main thread’s total cycles

0

2e+06

4e+06

6e+06

8e+06

1e+07

1.2e+07

1.4e+07

1.6e+07

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Mai

n th

read

cyc

le b

reak

dow

n

Number of threads

(a) 10Kx10K

total cyclesnon-xact cyclesxact cyclesxact overhead cycles

0

1e+07

2e+07

3e+07

4e+07

5e+07

6e+07

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Mai

n th

read

cyc

le b

reak

dow

n

Number of threads

(b) 10Kx200K

total cyclesnon-xact cyclesxact cyclesxact overhead cycles

0

2e+07

4e+07

6e+07

8e+07

1e+08

1.2e+08

1.4e+08

1.6e+08

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Mai

n th

read

cyc

le b

reak

dow

n

Number of threads

(c) 10Kx1000K

total cyclesnon-xact cyclesxact cyclesxact overhead cycles

Non-transactional (non-xact), transactional (xact) and overhead(xact-overhead)

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 24 / 27

Page 28: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Percentage of useful (xact) and wasted (xact-overhead)transactional cycles

0

20

40

60

80

100

120

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

%C

ycle

bre

akdo

wn

for

aver

age

tran

sact

ion

Number of threads

(a) 10Kx10K

xact cyclesxact overhead cycles

0

20

40

60

80

100

120

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

%C

ycle

bre

akdo

wn

for

aver

age

tran

sact

ion

Number of threads

(b) 10Kx200K

xact cyclesxact overhead cycles

0

20

40

60

80

100

120

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

%C

ycle

bre

akdo

wn

for

aver

age

tran

sact

ion

Number of threads

(c) 10Kx1000K

xact cyclesxact overhead cycles

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 25 / 27

Page 29: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

Write-set sizes

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 26 / 27

Page 30: Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas ...anastop/files/presentation.pdf · Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris Computing

The timeline of execution

0

500000

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

3.5e+06

0 2000 4000 6000 8000 10000

Cyc

les

of m

ain

thre

ad (

per

100

itera

tions

)

Outer loop iterations

serialHT-2thrHT-4thrHT-best

Anastopoulos et al. (NTUA) MTAAP’09 May 29, 2009 27 / 27