mcstl: multi-core standard template...

Introduction Platform Support Algorithms Conclusion 1/70

MCSTL:Multi-Core Standard Template Library

Practical Implementation of Parallel Algorithms forShared-Memory Systems

Peter Sanders, Johannes Singler

Institute for Theoretical Computer ScienceUniversity of Karlsruhe

December 13th, 2006

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism


Lecture Contents

Introduction

Platform Support

Algorithms

Conclusion



Outline

Introduction

Platform Support

Algorithms

Conclusion



What is this Lecture About?Theory Practice

I machine model concrete machine(s)I pseudo-code existing C++ library

Communication Network Shared MemoryI implicit communication

I cache hierarchy, NUMA, bandwidth sharing

Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere



Why Multi-Cores?

I easy use of high transistor budgetI energy efficient

(at reduced clock speeds)I increase in clock speed

largely exhaustedI instruction level parallelism

exhaustedI SIMD/Vector

only for special applications

Multi-cores will be everywhere:mobile devices . . . super computers



Hardware Nowadays

I dual-cores omnipresentI mainstream quad-core availableI Sun T1: 8 cores, 32 threadsI high-end shared-memory

servers with many more cores(on multiple chips)



Programming Multicores

I automatic parallelization?only for simple loops

I explicitly parallel?too complicated for everyday use

I libraries of parallelized algorithms!

natural starting point:standard libraries of programming languages



Basic Approach

Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library

Why STL?I many efficient and useful algorithms includedI simple interface, very well-known among developersI template mechanism is known to allow

low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language



Goals

I parallelize all time consuming STL algorithmsI speedup already for small inputs scale downI high speedup for medium/large inputsI dynamically choose

algorithms and tuning parametersI coexist with other forms of parallelization load balancing even for regular computations



Special Requirements for a Library

GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases

Compatibility toI existing librariesI platforms



Layers

MCSTL

Application

Hardware

Threading Support



Threading SupportI OpenMP: currently used (basic primitives).

I example

#pragma omp parallel num threads(p) iam = omp get thread num(); ...#pragma omp barrier/single/master...

I quite elegantI no permanent separation possibleI still works when compiler ignores pragmasI growing compiler support (gcc, Sun, Intel, MS)

I atomic operationsI fetch-and-addI compare-and-swap



Implemented Algorithms

I find, find if, mismatch, . . .I partial sum (prefix sum)I partition

I nth element/partial sort

I merge

I sort, stable sort

I random shuffle

I embarrassingly parallel (for each, transform,. . . )≥ 50 % of STL

Extension to STLI multiway merge



Dependency Graph of Lecture Contents

for_each etc.

partial_sum

sort (quick)sort (mwms)

random_shuffle

(multiway_)merge

multi-sequence selection

exact splitting

atomic operations

lock-free DS

partition

partial_sort nth_element

load-balancing

fine-grained communication

consistency

placement

placement

init

initial splits

Pla

tform

S

up

po

rtS

eq

ue

ntia

l He

lpe

rA

lgorith

ms

Pa

ralle

l Algo

rithm

sMCSTL Overview

tournament trees

find etc.



Outline

Introduction

Platform Support

Algorithms

Conclusion



Shared-Memory Hardware

I cache coherency protocol makes memory viewconsistent, introduces implicit communication

I cores invalidate entries in cache when other corewrites (snooping)

I overhead only for actual transfer of dataI granularity is one cache-line: avoid false sharing!

I “cache level 0” = registers exempted, variable valuesnot updated in memory(from other core’s point of view)

I declare variable type volatile (once per variable)I #pragma omp flush variable when update

suspected (once per update)



Atomic Operations

a few operations are executed without any chance ofinterference atomically

I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;

I allows concurrent iteration over sequence

I compare and swap(x, c, r)I if(x = c) x := r; return c [true]; else return r [false];

I secure state transition, can emulate fetch and addand others by using in a loop

I slower than usual operation, in particular whenconcurrent



Atomic Operations

a few operations are executed without any chance ofinterference atomically

I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;

I allows concurrent iteration over sequenceI compare and swap(x, c, r)

I if(x = c) x := r; return c [true]; else return r [false];

I secure state transition, can emulate fetch and addand others by using in a loop

I slower than usual operation, in particular whenconcurrent



Outline

Introduction

Platform Support

Algorithms

Conclusion



find, find if, mismatch,. . .

find the first position in a sequence satisfying a predicate

AnalysisI O(n) sequential time if first hit is at position n

(unknown)I naıve parallel algorithm needs Ω(m/p).I parallelization not worthwhile for small n

mn1st hit



find: Algorithm

I start sequentially up to position m0

I dynamic load balancing using fetch-and-addI scale up and down

using geometrically growing block sizesI first successful thread grabs remaining work

p0 p1 p2 p3

. . . . . .

sequential parallelm0



0

1

2

3

4

108107106105100001000100101

Spe

edup

Position of found element

Find n in the sequence [1,...,108] of integers on 4-way Opteron

MCSTL find, 4 threadsMCSTL find, 3 threadsMCSTL find, 2 threads

sequentialnaive parallel, 4 threadsnaive parallel, 3 threadsnaive parallel, 2 threads



partial sum

Discrimination to Algorithms Seen so FarI n p: multiple elements per PE,

sum must be calculated in preprocessing step,prefix sum in postprocessing step

I 2n + O(1) additions in total, not optimal,speedup only p

2 , particularly bad for small pI O(log p) communication stepsI shared-memory advantage: can split data arbitrarily

Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI double calculation for first part can be avoided



partial sum: Algorithm

Processor i ∈ 0 . . . p − 1

1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i

2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]

AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p), no hidden factor 2

speedup p+12 for n p



partial sum: Scheme

p0 p1 p2

input

p0

p0 p1 p2



partial sum: Results

0

1

2

3

4

5

6

7

8

9

10

10810710610510000

Spe

edup

n

Prefix sum of integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads



partition

>pivot<

Sequential AlgorithmI scan from both endsI swap to desired order when contrary



Parallel Partitioning[Tsigas Zhang 2003]

1. scan blocks of size B from both ends1.1 claim new blocks when running out of data

2. swap the unfinished blocks to the “middle”3. recurse on the middle

p0 p1 p2

rest recursive or sequential

swap in parallel

input

I time complexity O(n/p + B log p)



partition: Example

361 91 429 81 4317 93 521 51 215 60 7388 40448 77 3467 53361 91 429 81 4317 93 1 51 215 7388 40448 77 3467 53

3 6191 429 814317 93521 5121 5 60 738840 448 7734 67 53

3185

3185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 5331 85

3 processors, B=3, pivot 50, no special cases

p0 p1 p2



0

2

4

6

8

10

12

14

16

10810710610510000

Spe

edup

n

Partitioning of 32-bit integers on Sun T1

sequential1 thread





nth element, partial sort, quicksort

n<nrank

partial_sortnth_element

AlgorithmsI nth element: quickselect—

linear recursion using partition

I partial sort: nth element then sort

I quicksort: recursion using partition,load balancing using work stealing

Parallel implementations profit from each other



Multi-Sequence Selection

Problem Definitionfind element with global rank r in k sorted sequences Si

Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE

I guaranteed even for many equal elements

Solution[Varman et al. 1991] see next slide



Multi-Sequence Selection: Algorithm

IdeaI partition into two sets with desired ratio

(corresponds to rank)I start with middle elementI refine partition by recursively adding the elements in

the middle of both sides, taking O(k) time for eachstep only

I running time O(k log |Si |)O(k log k log |Si |) practical variant



Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14

1 2 6 7 9 11 15

2 8 9 17 23 24 25

6 7 9 12 23 24 25

3 8 10 13 14 17 19

7

17

12

13



Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14

2 7 11

8 17 24

7 12 24

8 13 17

2 7 11

8 17 24

7 12 24

8 13 17



Multi-Sequence Selection: Remarks

Implementation ProblemsI non-uniform length, length not equal to 2i − 1:

“conceptual padding” running time ∼ log maxi |Si |I finding ranks 6= 1

2

∑i |Si |, short sequences:

complicated special cases at ends of sequencesI equal elements: find partition directly, not element

with specified global rank



Sequential multiway merge

Problem Definitionmerge k sorted sequences into one sorted sequence

Solutionuse a tournament tree, usually implemented as loser tree

I binary tree in arrayI optimal O(log k) running time per merge stepI efficient computation of indicesI downside: tricky without sentinels and/or k not being

a power of 2



Loser Tree

6 2 7 9 1 4 74

6 7 9 7

4 4

2

1

36 2 7 9 4 74

6 7 9 7

4 4

3

2deleteMin+insertNext

3



Parallel (multiway )merge

How to divide the problem?I find slabs, i. e. consistent sets

of sections from thesequences

I two possibilities:I (randomized) splitting by

samplingI exact splitting into parts of

equal size (usingmulti-sequence selection)

· · · · · · k

p0

p1

p2

p3



Parallel (multiway )merge: Analysis

I time complexity O(1p (n log k +k log k · log maxj |Sj |))

I no full linear speedupI good in practiceI special case k = p: O(n

p log k + log p · log maxj |Sj |))



Parallel (multiway )merge: Results

0

5

10

15

20

25

107106105100001000100

Spe

edup

n

Multiway merging of pairs of 64-bit integers on Sun T1

sequential1 thread





sort, stable sort

Parallel Multiway Mergesort

+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space

Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable

both variants implemented in the MCSTL



Parallel Multiway Mergesort

Procedure1. divide sequence into

p parts of equal size2. in parallel sort the

parts locally3. use parallel p-way

merging to computethe final sequence

4. copy result back tooriginal position

p0 p1 p2 p3



Parallel Multiway Mergesort: AnalysisRunning Time

I time complexity O(n log np + p log p · log n

p )

I one multi-sequence partition per PE

Comparison to (Deterministic) Sample SortI very similar, only splitting differsI exact splitting ⇐⇒ approximation guaranteedI DSS’ time complexity: O(n log n

p + p log p)

I tradeoff possible using oversamplingI global communication volume: 2n (copy back)I local memory movement: n

p log2np



Parallel Multiway Mergesort: Practical Issues

I copy to temporary memory first? or merge totemporary memory and copy back later?

I compute starting positions sequentially



0

5

10

15

20

107106105100001000100

Spe

edup

n

Multiway Mergesort of 64-bit integers on Sun T1

sequential1 thread





Parallel QuicksortBasic Algorithm

1. partition the sequence in parallel

2. if group consists of more than one processor:2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: sort the piece sequentially

Problemload balancing may be very poor, in particular with smallp, bad splitters

Solutionkeep basic algorithm,dynamically balance work in last step



Parallel QuicksortBasic Algorithm

1. partition the sequence in parallel2. if group consists of more than one processor:

2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: sort the piece sequentially

Problemload balancing may be very poor, in particular with smallp, bad splitters

Solutionkeep basic algorithm,dynamically balance work in last step



Parallel Load-Balanced Quicksort1. partition the sequence in parallel

2. if group consists of more than one processor:2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: quicksort the piece sequentiallypush the piece onto a local stackwhile unsorted elements exist3.1 if non-empty: pop a piece from local stack3.2 otherwise: take (large) piece from bottom of other

PE’s stack (work-stealing)3.3 partition piece3.4 push right part onto stack, sort left part recursively



Parallel Load-Balanced Quicksort1. partition the sequence in parallel2. if group consists of more than one processor:

2.1 divide group according to data balance2.2 continue with 1. recursively

3. otherwise: quicksort the piece sequentiallypush the piece onto a local stackwhile unsorted elements exist3.1 if non-empty: pop a piece from local stack3.2 otherwise: take (large) piece from bottom of other

PE’s stack (work-stealing)3.3 partition piece3.4 push right part onto stack, sort left part recursively



Parallel Load-Balanced Quicksort: Scheme

p0 p1 p2partition in parallel

input

p0 p1partition in parallel

sequential sortingp2p0 p1

steal



Parallel Load-Balanced Quicksort: Practice

I omit stack operations for small partsI use lock-free stack data structure

I every thread makes progress in every stepI no mutexes or semaphores are usedI many lock-free data-structures known, many use

linked listsI simple one used here

I how to detect termination?I erratic performance if more threads than processors:

why?



Lock Free (Restricted) Double-Ended QueueRequirements

I push front, pop front not concurrently,issued only by one specific thread

I pop back concurrently from all other threadsI number of elements is limited (logarithmic)I no is empty, no top, because semantics unclearI pop * may fail

SolutionI circular buffer with front and back pointerI encode front and back pointer into one word to allow

synchronous atomic update using compare-and-swap



Lock Free (Restricted) Double-Ended QueueCode for pop backbefore := pointerswhile(before.front > before.back)

after := (before.front , before.back + 1)

after := (before.front - 1, before.back )

if(cas(pointers, before, after))

item := *(before.back)return true

return false

Code for push front*(pointers.front) := itemfetch_and_add(pointers.front, 1)



Lock Free (Restricted) Double-Ended QueueCode for pop frontbefore := pointerswhile(before.front > before.back)

after := (before.front , before.back + 1)

after := (before.front - 1, before.back )

if(cas(pointers, before, after))

item := *(before.back)return true

return false

Code for push front*(pointers.front) := itemfetch_and_add(pointers.front, 1)



Lock Free (Restricted) Double-Ended Queue

PropertiesI lock-free, but not wait-freeI pointer back increases monotonically no concurrency problems at queue back

I pointer front does not increase monotonically no problem, since no concurrent push and popallowed at queue front

I in case of failure: retry or done



Balanced Quicksort: Analysis

I time complexity O(n log np + B log p)

I communication volume + local memory movement:n log2 n

I good speedups require fast random-access acrossPE boundaries



Balanced Quicksort: Results

0

0.5

1

1.5

2

2.5

3

3.5

4

107106105100001000100

Spe

edup

n

Balanced Quicksort for 32-bit integers on 2 Dual-Core-Xeons

sequential1 thread




Balanced Quicksort: Problem Analysis

ProblemI not so nice performanceI particularly bad with too little processorsI where is the problem?I processor fully loaded while stealing when there is no

piece available

SolutionI switch to other processor if no work found ⇒ yield



Balanced Quicksort: Results with yield

0

0.5

1

1.5

2

2.5

3

3.5

4

107106105100001000100

Spe

edup

n

Balanced Quicksort with Yield for 32-bit integers on 2 Dual-Core-Xeons

sequential1 thread




Balanced Quicksort: Comparison to PMWMS

0

0.5

1

1.5

2

2.5

3

3.5

4

107106105100001000100

Spe

edup

n

Multiway Mergesort for 32-bit integers on 2 Dual-Core-Xeons

sequential1 thread




Random Permutation (random shuffle)

Standard Sequential Algorithm (e. g. STL)for 0 ≤ i < n swap (a[i ], a[rand(i + 1, n − 1)])

Cache efficient (parallel) algorithm

1. distribute randomly to (local) buckets1b. (copy local buckets to global buckets)2. permute buckets

... ... ... ...

... ...

... ...



Random Permutation (random shuffle)I time complexity O(n

p + p), global communicationvolume n

I cache efficiency very important (factor 2)

0

1

2

3

4

5

6

7

8

108107106

Spe

edup

n

Cache-aware random shuffling of integers on 4-way Opteron

sequential1 thread

2 threads3 threads4 threads



Embarrassingly Parallel Computation

I semanticsI process a set of elements completely independentlyI atomic units called jobs, running time unknown

I parallelizationI easy in principle (uniform workload) static load-balancing

I interesting for non-uniform workload dynamic load-balancing

I possible solutionsI equal splitting: perfect for uniform workloadI master-worker: possibly considerable overhead

(communication in each step)I work-stealing: communication only when necessary



Dynamic Load Balancing for for each etc.

I using work-stealingI divide iteration range into equal intervals initiallyI idle threads steal half the interval from random victim

I no explicit synchronization with victims needed(using fetch and add)

I adaptive granularity control (cache!)I logarithmic number of steals suffice

with high probability

... ... ...

... ... ...



0

1

2

3

4

105100001000100

Spe

edup

Number of pixels

Mandelbrot on 4-way Opteron, at most 1000 iterations per pixel

4 bal.3 bal.

4 unbal.2 bal.

3 unbal.2 unbal.

seq.



Outline

Introduction

Platform Support

Algorithms

Conclusion



Conclusion

I MCSTL provides a very easy way to incorporateparallelism into programs on an algorithmic level

I performance is excellent for large inputsI basic algorithms known but detailed design and

performance engineering nontrivialI successful integration into STXXL (external memory)



Future Work

I complete STL functionalityI better automatic algorithm and parameter selectionI machine model adequate for design and analysis of

multithreaded algorithmsI beyond STL



Algorithms & DS to be Implemented

I containers: initialization, bulk operationsI priority queuesI some embarrassingly parallel functions

(e. g. valarray)I memory transfer operations (reverse, copy)?I set operations (set union,. . .)



More About All That

I MCSTL website:http://algo2.iti.uni-karlsruhe.de/singler/mcstl/

I Praktikum next semester:extension/usage of MCSTL

I Studien-/Diplomarbeiten


http://algo2.iti.uni-karlsruhe.de/singler/mcstl/

mcstl: multi-core standard template...

Documents