scalability comparison: traditional fork-join-based parallelism vs. goroutines: porting the...

18
Scalability comparison: Traditional fork-join-based parallelism vs. Goroutines Porting the Barcelona OpenMP Tasks Suite to Go Artjom Simon https://github.com/artjomsimon/go-bots Know Your Gophers 2015-05-12

Upload: artjom-simon

Post on 08-Aug-2015

65 views

Category:

Engineering


5 download

TRANSCRIPT

Scalability comparison: Traditional fork-join-basedparallelism vs. Goroutines

Porting the Barcelona OpenMP Tasks Suite to Go

Artjom Simonhttps://github.com/artjomsimon/go-bots

Know Your Gophers

2015-05-12

Traditional approach in C

Cilk:cilk_spawn task();

[...]cilk_sync;

OpenMP:#pragma omp parallel{

#pragma omp task[...]#pragma omp taskwait[...]

}

Go: Parallel For Loop Pattern1

queue := make(chan int)done := make(chan bool)NP := runtime.GOMAXPROCS(0)

go func() {for i := 0; i < n; i++ { queue <- i }close(queue)

}()

for i := 0; i < NP; i++ {go func() {

for i := range queue { work(i) }done<-true

}()}

for i := 0; i < NP; i++ { <-done }

1Benchmarking Usability and Performance of Multicore Languages, PDF:http://arxiv.org/pdf/1302.2837v2

Barcelona OpenMP Tasks Suite2

2https://github.com/alcides/bots

...used in academic publications3

3http://www.sarc-ip.org/files/null/Workshop/1234128788173__TSchedStrat-iwomp08.pdf

Micro benchmarks

1 8 16 32 48

1

8

16

32

48

OMP_NUM_THREADS

Spee

dup

rel.

tose

q.

spc (opteron)

n=1000µsn=100µsn=10µs

Figure: Speedup spc (icc), 10 000 Tasks

Task pools: Variations

• notaskpoolStart Goroutines as needed, no limitation, uses WaitGroup forsynchronization

• simple-queueBuffered channel of func()s holds task queue. n goroutinesreceive the func()s and execute them

• goroutines-dispatcherDispatcher function, executing tasks in Goroutine only if aglobal counter of running goroutines is < n

• const-goroutinesn goroutines remove tasks from a double-linked list

Micro benchmarks

1 8 16 32 48

1

8

16

32

48

OMP_NUM_THREADS

Spee

dup

rel.

zuse

quen

tiel

l

spc (opteron)

gccicc

clanggo-notaskpool

go-simple-queuego-const-goroutines

go-goroutine-dispatch

Figure: Speedup spc, n=100µs, 10 000 Tasks

BOTS: nqueens

• N-Queens problem with n=12• Recursive backtracking search• No cut-off when creating tasks

Ergebnisse: BOTS (nqueens)

1 8 16 32 48

0

5

10

CPU cores

Spee

dup

nqueens (opteron)

gccicc

clanggo-const-goroutines

go-dispatchgo-notaskpool

gccgo-const-goroutinesgccgo-dispatch

gccgo-notaskpool

Figure: Speedup for nqueens -n 12, parallel

BOTS: sparselu

• LU factorization of a sparse block matrix• 50x50-Matrix, 100x100 sub block matrices

Results: BOTS (sparselu)

1 8 16 32 48

0

10

20

30

CPU cores

Spee

dup

sparselu (opteron)

gccicc

clanggo-const-goroutines

go-dispatchgo-notaskpoolgo-simplequeue

gccgo-const-goroutinesgccgo-dispatch

gccgo-notaskpool

Figure: Speedup sparselu -n 50 -m 100, parallel

Problem: Recursion (dependencies!)

Memory

opteron

0

0.5

1

1.5

·105RSS

[Kby

tes]

spc-par 10000 1000, 4 Threads

gccicc

clang

go-notaskpoolgo-const-goroutines

gccgo-notaskpoolgccgo-const-goroutines

Figure: Memory comparison (Resident Set Size), spc parallel

Side effect: Possible heap corruption bug in Go 1.4?

Questions?

Thank you!

Image credits

Icon N-Queens problem: Colin M.L. Burnett, Wikimedia Commons,(GFDL & BSD & GPL)http://commons.wikimedia.org/wiki/File:Chess_d45.svg(2015-03-09)