mcstl: multi-core standard template...

93
Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard Template Library Practical Implementation of Parallel Algorithms for Shared-Memory Systems Peter Sanders, Johannes Singler Institute for Theoretical Computer Science University of Karlsruhe December 13th, 2006 Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Upload: others

Post on 24-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 1/70

MCSTL:Multi-Core Standard Template Library

Practical Implementation of Parallel Algorithms forShared-Memory Systems

Peter Sanders, Johannes Singler

Institute for Theoretical Computer ScienceUniversity of Karlsruhe

December 13th, 2006

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 2: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 2/70

Lecture Contents

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 3: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 3/70

Outline

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 4: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 4/70

What is this Lecture About?Theory Practice

I machine model concrete machine(s)I pseudo-code existing C++ library

Communication Network Shared MemoryI implicit communication

I cache hierarchy, NUMA, bandwidth sharing

Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 5: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 4/70

What is this Lecture About?Theory Practice

I machine model concrete machine(s)I pseudo-code existing C++ library

Communication Network Shared MemoryI implicit communication

I cache hierarchy, NUMA, bandwidth sharing

Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 6: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 4/70

What is this Lecture About?Theory Practice

I machine model concrete machine(s)I pseudo-code existing C++ library

Communication Network Shared MemoryI implicit communication

I cache hierarchy, NUMA, bandwidth sharing

Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 7: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 5/70

Why Multi-Cores?

I easy use of high transistor budgetI energy efficient

(at reduced clock speeds)I increase in clock speed

largely exhaustedI instruction level parallelism

exhaustedI SIMD/Vector

only for special applications

Multi-cores will be everywhere:mobile devices . . . super computers

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 8: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 5/70

Why Multi-Cores?

I easy use of high transistor budgetI energy efficient

(at reduced clock speeds)I increase in clock speed

largely exhaustedI instruction level parallelism

exhaustedI SIMD/Vector

only for special applications

Multi-cores will be everywhere:mobile devices . . . super computers

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 9: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 6/70

Hardware Nowadays

I dual-cores omnipresentI mainstream quad-core availableI Sun T1: 8 cores, 32 threadsI high-end shared-memory

servers with many more cores(on multiple chips)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 10: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 7/70

Programming Multicores

I automatic parallelization?only for simple loops

I explicitly parallel?too complicated for everyday use

I libraries of parallelized algorithms!

natural starting point:standard libraries of programming languages

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 11: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 7/70

Programming Multicores

I automatic parallelization?only for simple loops

I explicitly parallel?too complicated for everyday use

I libraries of parallelized algorithms!

natural starting point:standard libraries of programming languages

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 12: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 8/70

Basic Approach

Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library

Why STL?I many efficient and useful algorithms includedI simple interface, very well-known among developersI template mechanism is known to allow

low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 13: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 8/70

Basic Approach

Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library

Why STL?I many efficient and useful algorithms includedI simple interface, very well-known among developersI template mechanism is known to allow

low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 14: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 9/70

Goals

I parallelize all time consuming STL algorithmsI speedup already for small inputs scale downI high speedup for medium/large inputsI dynamically choose

algorithms and tuning parametersI coexist with other forms of parallelization load balancing even for regular computations

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 15: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 10/70

Special Requirements for a Library

GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases

Compatibility toI existing librariesI platforms

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 16: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 10/70

Special Requirements for a Library

GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases

Compatibility toI existing librariesI platforms

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 17: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 11/70

Layers

MCSTL

Application

Hardware

Threading Support

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 18: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 12/70

Threading SupportI OpenMP: currently used (basic primitives).

I example

#pragma omp parallel num threads(p) iam = omp get thread num(); ...#pragma omp barrier/single/master...

I quite elegantI no permanent separation possibleI still works when compiler ignores pragmasI growing compiler support (gcc, Sun, Intel, MS)

I atomic operationsI fetch-and-addI compare-and-swap

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 19: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 12/70

Threading SupportI OpenMP: currently used (basic primitives).

I example

#pragma omp parallel num threads(p) iam = omp get thread num(); ...#pragma omp barrier/single/master...

I quite elegantI no permanent separation possibleI still works when compiler ignores pragmasI growing compiler support (gcc, Sun, Intel, MS)

I atomic operationsI fetch-and-addI compare-and-swap

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 20: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 13/70

Implemented Algorithms

I find, find if, mismatch, . . .I partial sum (prefix sum)I partition

I nth element/partial sort

I merge

I sort, stable sort

I random shuffle

I embarrassingly parallel (for each, transform,. . . )≥ 50 % of STL

Extension to STLI multiway merge

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 21: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 14/70

Dependency Graph of Lecture Contents

for_each etc.

partial_sum

sort (quick)sort (mwms)

random_shuffle

(multiway_)merge

multi-sequence selection

exact splitting

atomic operations

lock-free DS

partition

partial_sort nth_element

load-balancing

fine-grained communication

consistency

placement

placement

init

initial splits

Pla

tform

S

up

po

rtS

eq

ue

ntia

l He

lpe

rA

lgorith

ms

Pa

ralle

l Algo

rithm

sMCSTL Overview

tournament trees

find etc.

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 22: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 15/70

Outline

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 23: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 16/70

Shared-Memory Hardware

I cache coherency protocol makes memory viewconsistent, introduces implicit communication

I cores invalidate entries in cache when other corewrites (snooping)

I overhead only for actual transfer of dataI granularity is one cache-line: avoid false sharing!

I “cache level 0” = registers exempted, variable valuesnot updated in memory(from other core’s point of view)

I declare variable type volatile (once per variable)I #pragma omp flush variable when update

suspected (once per update)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 24: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 16/70

Shared-Memory Hardware

I cache coherency protocol makes memory viewconsistent, introduces implicit communication

I cores invalidate entries in cache when other corewrites (snooping)

I overhead only for actual transfer of dataI granularity is one cache-line: avoid false sharing!

I “cache level 0” = registers exempted, variable valuesnot updated in memory(from other core’s point of view)

I declare variable type volatile (once per variable)I #pragma omp flush variable when update

suspected (once per update)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 25: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 17/70

Atomic Operations

a few operations are executed without any chance ofinterference atomically

I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;

I allows concurrent iteration over sequence

I compare and swap(x, c, r)I if(x = c) x := r; return c [true]; else return r [false];

I secure state transition, can emulate fetch and addand others by using in a loop

I slower than usual operation, in particular whenconcurrent

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 26: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 17/70

Atomic Operations

a few operations are executed without any chance ofinterference atomically

I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;

I allows concurrent iteration over sequenceI compare and swap(x, c, r)

I if(x = c) x := r; return c [true]; else return r [false];

I secure state transition, can emulate fetch and addand others by using in a loop

I slower than usual operation, in particular whenconcurrent

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 27: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 18/70

Outline

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 28: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 19/70

find, find if, mismatch,. . .

find the first position in a sequence satisfying a predicate

AnalysisI O(n) sequential time if first hit is at position n

(unknown)I naıve parallel algorithm needs Ω(m/p).I parallelization not worthwhile for small n

mn1st hit

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 29: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 20/70

find: Algorithm

I start sequentially up to position m0

I dynamic load balancing using fetch-and-addI scale up and down

using geometrically growing block sizesI first successful thread grabs remaining work

p0 p1 p2 p3

. . . . . .

sequential parallelm0

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 30: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 21/70

0

1

2

3

4

108107106105100001000100101

Spe

edup

Position of found element

Find n in the sequence [1,...,108] of integers on 4-way Opteron

MCSTL find, 4 threadsMCSTL find, 3 threadsMCSTL find, 2 threads

sequentialnaive parallel, 4 threadsnaive parallel, 3 threadsnaive parallel, 2 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 31: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 22/70

partial sum

Discrimination to Algorithms Seen so FarI n p: multiple elements per PE,

sum must be calculated in preprocessing step,prefix sum in postprocessing step

I 2n + O(1) additions in total, not optimal,speedup only p

2 , particularly bad for small pI O(log p) communication stepsI shared-memory advantage: can split data arbitrarily

Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI double calculation for first part can be avoided

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 32: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 22/70

partial sum

Discrimination to Algorithms Seen so FarI n p: multiple elements per PE,

sum must be calculated in preprocessing step,prefix sum in postprocessing step

I 2n + O(1) additions in total, not optimal,speedup only p

2 , particularly bad for small pI O(log p) communication stepsI shared-memory advantage: can split data arbitrarily

Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI double calculation for first part can be avoided

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 33: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 23/70

partial sum: Algorithm

Processor i ∈ 0 . . . p − 1

1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i

2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]

AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p), no hidden factor 2

speedup p+12 for n p

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 34: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 23/70

partial sum: Algorithm

Processor i ∈ 0 . . . p − 1

1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i

2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]

AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p), no hidden factor 2

speedup p+12 for n p

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 35: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 24/70

partial sum: Scheme

p0 p1 p2

input

p0

p0 p1 p2

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 36: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 25/70

partial sum: Results

0

1

2

3

4

5

6

7

8

9

10

10810710610510000

Spe

edup

n

Prefix sum of integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 37: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 26/70

partition

>pivot<

Sequential AlgorithmI scan from both endsI swap to desired order when contrary

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 38: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 27/70

Parallel Partitioning[Tsigas Zhang 2003]

1. scan blocks of size B from both ends1.1 claim new blocks when running out of data

2. swap the unfinished blocks to the “middle”3. recurse on the middle

p0 p1 p2

rest recursive or sequential

swap in parallel

input

I time complexity O(n/p + B log p)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 39: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 28/70

partition: Example

361 91 429 81 4317 93 521 51 215 60 7388 40448 77 3467 53361 91 429 81 4317 93 1 51 215 7388 40448 77 3467 53

3 6191 429 814317 93521 5121 5 60 738840 448 7734 67 53

3185

3185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 5331 85

3 processors, B=3, pivot 50, no special cases

p0 p1 p2

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 40: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 29/70

0

2

4

6

8

10

12

14

16

10810710610510000

Spe

edup

n

Partitioning of 32-bit integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 41: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 30/70

nth element, partial sort, quicksort

n<nrank

partial_sortnth_element

AlgorithmsI nth element: quickselect—

linear recursion using partition

I partial sort: nth element then sort

I quicksort: recursion using partition,load balancing using work stealing

Parallel implementations profit from each other

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 42: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 31/70

Multi-Sequence Selection

Problem Definitionfind element with global rank r in k sorted sequences Si

Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE

I guaranteed even for many equal elements

Solution[Varman et al. 1991] see next slide

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 43: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 31/70

Multi-Sequence Selection

Problem Definitionfind element with global rank r in k sorted sequences Si

Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE

I guaranteed even for many equal elements

Solution[Varman et al. 1991] see next slide

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 44: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 31/70

Multi-Sequence Selection

Problem Definitionfind element with global rank r in k sorted sequences Si

Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE

I guaranteed even for many equal elements

Solution[Varman et al. 1991] see next slide

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 45: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 33/70

Multi-Sequence Selection: Algorithm

IdeaI partition into two sets with desired ratio

(corresponds to rank)I start with middle elementI refine partition by recursively adding the elements in

the middle of both sides, taking O(k) time for eachstep only

I running time O(k log |Si |)O(k log k log |Si |) practical variant

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 46: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 34/70

Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14

1 2 6 7 9 11 15

2 8 9 17 23 24 25

6 7 9 12 23 24 25

3 8 10 13 14 17 19

7

17

12

13

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 47: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 34/70

Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14

1 2 6 7 9 11 15

2 8 9 17 23 24 25

6 7 9 12 23 24 25

3 8 10 13 14 17 19

7

17

12

13

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 48: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 35/70

Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14

2 7 11

8 17 24

7 12 24

8 13 17

2 7 11

8 17 24

7 12 24

8 13 17

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 49: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 35/70

Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14

2 7 11

8 17 24

7 12 24

8 13 17

2 7 11

8 17 24

7 12 24

8 13 17

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 50: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 36/70

Multi-Sequence Selection: Remarks

Implementation ProblemsI non-uniform length, length not equal to 2i − 1:

“conceptual padding” running time ∼ log maxi |Si |I finding ranks 6= 1

2

∑i |Si |, short sequences:

complicated special cases at ends of sequencesI equal elements: find partition directly, not element

with specified global rank

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 51: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 37/70

Sequential multiway merge

Problem Definitionmerge k sorted sequences into one sorted sequence

Solutionuse a tournament tree, usually implemented as loser tree

I binary tree in arrayI optimal O(log k) running time per merge stepI efficient computation of indicesI downside: tricky without sentinels and/or k not being

a power of 2

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 52: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 37/70

Sequential multiway merge

Problem Definitionmerge k sorted sequences into one sorted sequence

Solutionuse a tournament tree, usually implemented as loser tree

I binary tree in arrayI optimal O(log k) running time per merge stepI efficient computation of indicesI downside: tricky without sentinels and/or k not being

a power of 2

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 53: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 38/70

Loser Tree

6 2 7 9 1 4 74

6 7 9 7

4 4

2

1

36 2 7 9 4 74

6 7 9 7

4 4

3

2deleteMin+insertNext

3

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 54: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 39/70

Parallel (multiway )merge

How to divide the problem?I find slabs, i. e. consistent sets

of sections from thesequences

I two possibilities:I (randomized) splitting by

samplingI exact splitting into parts of

equal size (usingmulti-sequence selection)

· · · · · · k

p0

p1

p2

p3

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 55: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 40/70

Parallel (multiway )merge: Analysis

I time complexity O(1p (n log k +k log k · log maxj |Sj |))

I no full linear speedupI good in practiceI special case k = p: O(n

p log k + log p · log maxj |Sj |))

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 56: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 41/70

Parallel (multiway )merge: Results

0

5

10

15

20

25

107106105100001000100

Spe

edup

n

Multiway merging of pairs of 64-bit integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 57: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 42/70

sort, stable sort

Parallel Multiway Mergesort

+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space

Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable

both variants implemented in the MCSTL

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 58: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 42/70

sort, stable sort

Parallel Multiway Mergesort

+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space

Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable

both variants implemented in the MCSTL

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 59: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 43/70

Parallel Multiway Mergesort

Procedure1. divide sequence into

p parts of equal size2. in parallel sort the

parts locally3. use parallel p-way

merging to computethe final sequence

4. copy result back tooriginal position

p0 p1 p2 p3

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 60: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 44/70

Parallel Multiway Mergesort: AnalysisRunning Time

I time complexity O(n log np + p log p · log n

p )

I one multi-sequence partition per PE

Comparison to (Deterministic) Sample SortI very similar, only splitting differsI exact splitting ⇐⇒ approximation guaranteedI DSS’ time complexity: O(n log n

p + p log p)

I tradeoff possible using oversamplingI global communication volume: 2n (copy back)I local memory movement: n

p log2np

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 61: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 44/70

Parallel Multiway Mergesort: AnalysisRunning Time

I time complexity O(n log np + p log p · log n

p )

I one multi-sequence partition per PE

Comparison to (Deterministic) Sample SortI very similar, only splitting differsI exact splitting ⇐⇒ approximation guaranteedI DSS’ time complexity: O(n log n

p + p log p)

I tradeoff possible using oversamplingI global communication volume: 2n (copy back)I local memory movement: n

p log2np

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 62: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 45/70

Parallel Multiway Mergesort: Practical Issues

I copy to temporary memory first? or merge totemporary memory and copy back later?

I compute starting positions sequentially

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 63: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 46/70

0

5

10

15

20

107106105100001000100

Spe

edup

n

Multiway Mergesort of 64-bit integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 64: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 47/70

Parallel QuicksortBasic Algorithm

1. partition the sequence in parallel

2. if group consists of more than one processor:2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: sort the piece sequentially

Problemload balancing may be very poor, in particular with smallp, bad splitters

Solutionkeep basic algorithm,dynamically balance work in last step

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 65: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 47/70

Parallel QuicksortBasic Algorithm

1. partition the sequence in parallel2. if group consists of more than one processor:

2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: sort the piece sequentially

Problemload balancing may be very poor, in particular with smallp, bad splitters

Solutionkeep basic algorithm,dynamically balance work in last step

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 66: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 47/70

Parallel QuicksortBasic Algorithm

1. partition the sequence in parallel2. if group consists of more than one processor:

2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: sort the piece sequentially

Problemload balancing may be very poor, in particular with smallp, bad splitters

Solutionkeep basic algorithm,dynamically balance work in last step

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 67: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 48/70

Parallel Load-Balanced Quicksort1. partition the sequence in parallel

2. if group consists of more than one processor:2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: quicksort the piece sequentiallypush the piece onto a local stackwhile unsorted elements exist3.1 if non-empty: pop a piece from local stack3.2 otherwise: take (large) piece from bottom of other

PE’s stack (work-stealing)3.3 partition piece3.4 push right part onto stack, sort left part recursively

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 68: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 48/70

Parallel Load-Balanced Quicksort1. partition the sequence in parallel2. if group consists of more than one processor:

2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: quicksort the piece sequentiallypush the piece onto a local stackwhile unsorted elements exist3.1 if non-empty: pop a piece from local stack3.2 otherwise: take (large) piece from bottom of other

PE’s stack (work-stealing)3.3 partition piece3.4 push right part onto stack, sort left part recursively

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 69: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 48/70

Parallel Load-Balanced Quicksort1. partition the sequence in parallel2. if group consists of more than one processor:

2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: quicksort the piece sequentiallypush the piece onto a local stackwhile unsorted elements exist3.1 if non-empty: pop a piece from local stack3.2 otherwise: take (large) piece from bottom of other

PE’s stack (work-stealing)3.3 partition piece3.4 push right part onto stack, sort left part recursively

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 70: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 49/70

Parallel Load-Balanced Quicksort: Scheme

p0 p1 p2partition in parallel

input

p0 p1partition in parallel

sequential sortingp2p0 p1

steal

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 71: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 50/70

Parallel Load-Balanced Quicksort: Practice

I omit stack operations for small partsI use lock-free stack data structure

I every thread makes progress in every stepI no mutexes or semaphores are usedI many lock-free data-structures known, many use

linked listsI simple one used here

I how to detect termination?I erratic performance if more threads than processors:

why?

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 72: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 51/70

Lock Free (Restricted) Double-Ended QueueRequirements

I push front, pop front not concurrently,issued only by one specific thread

I pop back concurrently from all other threadsI number of elements is limited (logarithmic)I no is empty, no top, because semantics unclearI pop * may fail

SolutionI circular buffer with front and back pointerI encode front and back pointer into one word to allow

synchronous atomic update using compare-and-swap

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 73: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 51/70

Lock Free (Restricted) Double-Ended QueueRequirements

I push front, pop front not concurrently,issued only by one specific thread

I pop back concurrently from all other threadsI number of elements is limited (logarithmic)I no is empty, no top, because semantics unclearI pop * may fail

SolutionI circular buffer with front and back pointerI encode front and back pointer into one word to allow

synchronous atomic update using compare-and-swap

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 74: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 53/70

Lock Free (Restricted) Double-Ended QueueCode for pop backbefore := pointerswhile(before.front > before.back)

after := (before.front , before.back + 1)

after := (before.front - 1, before.back )

if(cas(pointers, before, after))

item := *(before.back)return true

return false

Code for push front*(pointers.front) := itemfetch_and_add(pointers.front, 1)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 75: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 53/70

Lock Free (Restricted) Double-Ended QueueCode for pop frontbefore := pointerswhile(before.front > before.back)

after := (before.front , before.back + 1)

after := (before.front - 1, before.back )

if(cas(pointers, before, after))

item := *(before.back)return true

return false

Code for push front*(pointers.front) := itemfetch_and_add(pointers.front, 1)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 76: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 53/70

Lock Free (Restricted) Double-Ended QueueCode for pop frontbefore := pointerswhile(before.front > before.back)

after := (before.front , before.back + 1)

after := (before.front - 1, before.back )

if(cas(pointers, before, after))

item := *(before.back)return true

return false

Code for push front*(pointers.front) := itemfetch_and_add(pointers.front, 1)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 77: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 55/70

Lock Free (Restricted) Double-Ended Queue

PropertiesI lock-free, but not wait-freeI pointer back increases monotonically no concurrency problems at queue back

I pointer front does not increase monotonically no problem, since no concurrent push and popallowed at queue front

I in case of failure: retry or done

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 78: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 56/70

Balanced Quicksort: Analysis

I time complexity O(n log np + B log p)

I communication volume + local memory movement:n log2 n

I good speedups require fast random-access acrossPE boundaries

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 79: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 57/70

Balanced Quicksort: Results

0

0.5

1

1.5

2

2.5

3

3.5

4

107106105100001000100

Spe

edup

n

Balanced Quicksort for 32-bit integers on 2 Dual-Core-Xeons

sequential1 thread

2 threads3 threads4 threads8 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 80: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 58/70

Balanced Quicksort: Problem Analysis

ProblemI not so nice performanceI particularly bad with too little processorsI where is the problem?I processor fully loaded while stealing when there is no

piece available

SolutionI switch to other processor if no work found ⇒ yield

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 81: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 58/70

Balanced Quicksort: Problem Analysis

ProblemI not so nice performanceI particularly bad with too little processorsI where is the problem?I processor fully loaded while stealing when there is no

piece available

SolutionI switch to other processor if no work found ⇒ yield

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 82: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 59/70

Balanced Quicksort: Results with yield

0

0.5

1

1.5

2

2.5

3

3.5

4

107106105100001000100

Spe

edup

n

Balanced Quicksort with Yield for 32-bit integers on 2 Dual-Core-Xeons

sequential1 thread

2 threads3 threads4 threads8 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 83: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 60/70

Balanced Quicksort: Comparison to PMWMS

0

0.5

1

1.5

2

2.5

3

3.5

4

107106105100001000100

Spe

edup

n

Multiway Mergesort for 32-bit integers on 2 Dual-Core-Xeons

sequential1 thread

2 threads3 threads4 threads8 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 84: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 61/70

Random Permutation (random shuffle)

Standard Sequential Algorithm (e. g. STL)for 0 ≤ i < n swap (a[i ], a[rand(i + 1, n − 1)])

Cache efficient (parallel) algorithm

1. distribute randomly to (local) buckets1b. (copy local buckets to global buckets)2. permute buckets

... ... ... ...

... ...

... ...

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 85: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 62/70

Random Permutation (random shuffle)I time complexity O(n

p + p), global communicationvolume n

I cache efficiency very important (factor 2)

0

1

2

3

4

5

6

7

8

108107106

Spe

edup

n

Cache-aware random shuffling of integers on 4-way Opteron

sequential1 thread

2 threads3 threads4 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 86: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 63/70

Embarrassingly Parallel Computation

I semanticsI process a set of elements completely independentlyI atomic units called jobs, running time unknown

I parallelizationI easy in principle (uniform workload) static load-balancing

I interesting for non-uniform workload dynamic load-balancing

I possible solutionsI equal splitting: perfect for uniform workloadI master-worker: possibly considerable overhead

(communication in each step)I work-stealing: communication only when necessary

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 87: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 64/70

Dynamic Load Balancing for for each etc.

I using work-stealingI divide iteration range into equal intervals initiallyI idle threads steal half the interval from random victim

I no explicit synchronization with victims needed(using fetch and add)

I adaptive granularity control (cache!)I logarithmic number of steals suffice

with high probability

... ... ...

... ... ...

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 88: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 65/70

0

1

2

3

4

105100001000100

Spe

edup

Number of pixels

Mandelbrot on 4-way Opteron, at most 1000 iterations per pixel

4 bal.3 bal.

4 unbal.2 bal.

3 unbal.2 unbal.

seq.

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 89: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 66/70

Outline

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 90: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 67/70

Conclusion

I MCSTL provides a very easy way to incorporateparallelism into programs on an algorithmic level

I performance is excellent for large inputsI basic algorithms known but detailed design and

performance engineering nontrivialI successful integration into STXXL (external memory)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 91: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 68/70

Future Work

I complete STL functionalityI better automatic algorithm and parameter selectionI machine model adequate for design and analysis of

multithreaded algorithmsI beyond STL

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 92: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 69/70

Algorithms & DS to be Implemented

I containers: initialization, bulk operationsI priority queuesI some embarrassingly parallel functions

(e. g. valarray)I memory transfer operations (reverse, copy)?I set operations (set union,. . .)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 93: MCSTL: Multi-Core Standard Template Libraryalgo2.iti.kit.edu/sanders/courses/paralg06/singler.pdf · Introduction Platform Support Algorithms Conclusion 1/70 MCSTL: Multi-Core Standard

Introduction Platform Support Algorithms Conclusion 70/70

More About All That

I MCSTL website:http://algo2.iti.uni-karlsruhe.de/singler/mcstl/

I Praktikum next semester:extension/usage of MCSTL

I Studien-/Diplomarbeiten

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism