mcstl: multi-core standard template...
TRANSCRIPT
Introduction Platform Support Algorithms Conclusion 1/70
MCSTL:Multi-Core Standard Template Library
Practical Implementation of Parallel Algorithms forShared-Memory Systems
Peter Sanders, Johannes Singler
Institute for Theoretical Computer ScienceUniversity of Karlsruhe
December 13th, 2006
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 2/70
Lecture Contents
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 3/70
Outline
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 4/70
What is this Lecture About?Theory Practice
I machine model concrete machine(s)I pseudo-code existing C++ library
Communication Network Shared MemoryI implicit communication
I cache hierarchy, NUMA, bandwidth sharing
Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 4/70
What is this Lecture About?Theory Practice
I machine model concrete machine(s)I pseudo-code existing C++ library
Communication Network Shared MemoryI implicit communication
I cache hierarchy, NUMA, bandwidth sharing
Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 4/70
What is this Lecture About?Theory Practice
I machine model concrete machine(s)I pseudo-code existing C++ library
Communication Network Shared MemoryI implicit communication
I cache hierarchy, NUMA, bandwidth sharing
Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 5/70
Why Multi-Cores?
I easy use of high transistor budgetI energy efficient
(at reduced clock speeds)I increase in clock speed
largely exhaustedI instruction level parallelism
exhaustedI SIMD/Vector
only for special applications
Multi-cores will be everywhere:mobile devices . . . super computers
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 5/70
Why Multi-Cores?
I easy use of high transistor budgetI energy efficient
(at reduced clock speeds)I increase in clock speed
largely exhaustedI instruction level parallelism
exhaustedI SIMD/Vector
only for special applications
Multi-cores will be everywhere:mobile devices . . . super computers
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 6/70
Hardware Nowadays
I dual-cores omnipresentI mainstream quad-core availableI Sun T1: 8 cores, 32 threadsI high-end shared-memory
servers with many more cores(on multiple chips)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 7/70
Programming Multicores
I automatic parallelization?only for simple loops
I explicitly parallel?too complicated for everyday use
I libraries of parallelized algorithms!
natural starting point:standard libraries of programming languages
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 7/70
Programming Multicores
I automatic parallelization?only for simple loops
I explicitly parallel?too complicated for everyday use
I libraries of parallelized algorithms!
natural starting point:standard libraries of programming languages
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 8/70
Basic Approach
Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library
Why STL?I many efficient and useful algorithms includedI simple interface, very well-known among developersI template mechanism is known to allow
low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 8/70
Basic Approach
Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library
Why STL?I many efficient and useful algorithms includedI simple interface, very well-known among developersI template mechanism is known to allow
low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 9/70
Goals
I parallelize all time consuming STL algorithmsI speedup already for small inputs scale downI high speedup for medium/large inputsI dynamically choose
algorithms and tuning parametersI coexist with other forms of parallelization load balancing even for regular computations
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 10/70
Special Requirements for a Library
GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases
Compatibility toI existing librariesI platforms
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 10/70
Special Requirements for a Library
GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases
Compatibility toI existing librariesI platforms
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 11/70
Layers
MCSTL
Application
Hardware
Threading Support
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 12/70
Threading SupportI OpenMP: currently used (basic primitives).
I example
#pragma omp parallel num threads(p) iam = omp get thread num(); ...#pragma omp barrier/single/master...
I quite elegantI no permanent separation possibleI still works when compiler ignores pragmasI growing compiler support (gcc, Sun, Intel, MS)
I atomic operationsI fetch-and-addI compare-and-swap
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 12/70
Threading SupportI OpenMP: currently used (basic primitives).
I example
#pragma omp parallel num threads(p) iam = omp get thread num(); ...#pragma omp barrier/single/master...
I quite elegantI no permanent separation possibleI still works when compiler ignores pragmasI growing compiler support (gcc, Sun, Intel, MS)
I atomic operationsI fetch-and-addI compare-and-swap
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 13/70
Implemented Algorithms
I find, find if, mismatch, . . .I partial sum (prefix sum)I partition
I nth element/partial sort
I merge
I sort, stable sort
I random shuffle
I embarrassingly parallel (for each, transform,. . . )≥ 50 % of STL
Extension to STLI multiway merge
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 14/70
Dependency Graph of Lecture Contents
for_each etc.
partial_sum
sort (quick)sort (mwms)
random_shuffle
(multiway_)merge
multi-sequence selection
exact splitting
atomic operations
lock-free DS
partition
partial_sort nth_element
load-balancing
fine-grained communication
consistency
placement
placement
init
initial splits
Pla
tform
S
up
po
rtS
eq
ue
ntia
l He
lpe
rA
lgorith
ms
Pa
ralle
l Algo
rithm
sMCSTL Overview
tournament trees
find etc.
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 15/70
Outline
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 16/70
Shared-Memory Hardware
I cache coherency protocol makes memory viewconsistent, introduces implicit communication
I cores invalidate entries in cache when other corewrites (snooping)
I overhead only for actual transfer of dataI granularity is one cache-line: avoid false sharing!
I “cache level 0” = registers exempted, variable valuesnot updated in memory(from other core’s point of view)
I declare variable type volatile (once per variable)I #pragma omp flush variable when update
suspected (once per update)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 16/70
Shared-Memory Hardware
I cache coherency protocol makes memory viewconsistent, introduces implicit communication
I cores invalidate entries in cache when other corewrites (snooping)
I overhead only for actual transfer of dataI granularity is one cache-line: avoid false sharing!
I “cache level 0” = registers exempted, variable valuesnot updated in memory(from other core’s point of view)
I declare variable type volatile (once per variable)I #pragma omp flush variable when update
suspected (once per update)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 17/70
Atomic Operations
a few operations are executed without any chance ofinterference atomically
I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;
I allows concurrent iteration over sequence
I compare and swap(x, c, r)I if(x = c) x := r; return c [true]; else return r [false];
I secure state transition, can emulate fetch and addand others by using in a loop
I slower than usual operation, in particular whenconcurrent
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 17/70
Atomic Operations
a few operations are executed without any chance ofinterference atomically
I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;
I allows concurrent iteration over sequenceI compare and swap(x, c, r)
I if(x = c) x := r; return c [true]; else return r [false];
I secure state transition, can emulate fetch and addand others by using in a loop
I slower than usual operation, in particular whenconcurrent
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 18/70
Outline
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 19/70
find, find if, mismatch,. . .
find the first position in a sequence satisfying a predicate
AnalysisI O(n) sequential time if first hit is at position n
(unknown)I naıve parallel algorithm needs Ω(m/p).I parallelization not worthwhile for small n
mn1st hit
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 20/70
find: Algorithm
I start sequentially up to position m0
I dynamic load balancing using fetch-and-addI scale up and down
using geometrically growing block sizesI first successful thread grabs remaining work
p0 p1 p2 p3
. . . . . .
sequential parallelm0
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 21/70
0
1
2
3
4
108107106105100001000100101
Spe
edup
Position of found element
Find n in the sequence [1,...,108] of integers on 4-way Opteron
MCSTL find, 4 threadsMCSTL find, 3 threadsMCSTL find, 2 threads
sequentialnaive parallel, 4 threadsnaive parallel, 3 threadsnaive parallel, 2 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 22/70
partial sum
Discrimination to Algorithms Seen so FarI n p: multiple elements per PE,
sum must be calculated in preprocessing step,prefix sum in postprocessing step
I 2n + O(1) additions in total, not optimal,speedup only p
2 , particularly bad for small pI O(log p) communication stepsI shared-memory advantage: can split data arbitrarily
Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI double calculation for first part can be avoided
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 22/70
partial sum
Discrimination to Algorithms Seen so FarI n p: multiple elements per PE,
sum must be calculated in preprocessing step,prefix sum in postprocessing step
I 2n + O(1) additions in total, not optimal,speedup only p
2 , particularly bad for small pI O(log p) communication stepsI shared-memory advantage: can split data arbitrarily
Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI double calculation for first part can be avoided
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 23/70
partial sum: Algorithm
Processor i ∈ 0 . . . p − 1
1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i
2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]
AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p), no hidden factor 2
speedup p+12 for n p
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 23/70
partial sum: Algorithm
Processor i ∈ 0 . . . p − 1
1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i
2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]
AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p), no hidden factor 2
speedup p+12 for n p
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 24/70
partial sum: Scheme
p0 p1 p2
input
p0
p0 p1 p2
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 25/70
partial sum: Results
0
1
2
3
4
5
6
7
8
9
10
10810710610510000
Spe
edup
n
Prefix sum of integers on Sun T1
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 26/70
partition
>pivot<
Sequential AlgorithmI scan from both endsI swap to desired order when contrary
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 27/70
Parallel Partitioning[Tsigas Zhang 2003]
1. scan blocks of size B from both ends1.1 claim new blocks when running out of data
2. swap the unfinished blocks to the “middle”3. recurse on the middle
p0 p1 p2
rest recursive or sequential
swap in parallel
input
I time complexity O(n/p + B log p)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 28/70
partition: Example
361 91 429 81 4317 93 521 51 215 60 7388 40448 77 3467 53361 91 429 81 4317 93 1 51 215 7388 40448 77 3467 53
3 6191 429 814317 93521 5121 5 60 738840 448 7734 67 53
3185
3185
3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185
3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185
3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 5331 85
3 processors, B=3, pivot 50, no special cases
p0 p1 p2
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 29/70
0
2
4
6
8
10
12
14
16
10810710610510000
Spe
edup
n
Partitioning of 32-bit integers on Sun T1
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 30/70
nth element, partial sort, quicksort
n<nrank
partial_sortnth_element
AlgorithmsI nth element: quickselect—
linear recursion using partition
I partial sort: nth element then sort
I quicksort: recursion using partition,load balancing using work stealing
Parallel implementations profit from each other
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 31/70
Multi-Sequence Selection
Problem Definitionfind element with global rank r in k sorted sequences Si
Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE
I guaranteed even for many equal elements
Solution[Varman et al. 1991] see next slide
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 31/70
Multi-Sequence Selection
Problem Definitionfind element with global rank r in k sorted sequences Si
Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE
I guaranteed even for many equal elements
Solution[Varman et al. 1991] see next slide
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 31/70
Multi-Sequence Selection
Problem Definitionfind element with global rank r in k sorted sequences Si
Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE
I guaranteed even for many equal elements
Solution[Varman et al. 1991] see next slide
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 33/70
Multi-Sequence Selection: Algorithm
IdeaI partition into two sets with desired ratio
(corresponds to rank)I start with middle elementI refine partition by recursively adding the elements in
the middle of both sides, taking O(k) time for eachstep only
I running time O(k log |Si |)O(k log k log |Si |) practical variant
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 34/70
Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14
1 2 6 7 9 11 15
2 8 9 17 23 24 25
6 7 9 12 23 24 25
3 8 10 13 14 17 19
7
17
12
13
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 34/70
Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14
1 2 6 7 9 11 15
2 8 9 17 23 24 25
6 7 9 12 23 24 25
3 8 10 13 14 17 19
7
17
12
13
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 35/70
Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14
2 7 11
8 17 24
7 12 24
8 13 17
2 7 11
8 17 24
7 12 24
8 13 17
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 35/70
Multi-Sequence Selection: Examplek = 4, N = k · n = 4 · 7 = 28; select global rank 14
2 7 11
8 17 24
7 12 24
8 13 17
2 7 11
8 17 24
7 12 24
8 13 17
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 36/70
Multi-Sequence Selection: Remarks
Implementation ProblemsI non-uniform length, length not equal to 2i − 1:
“conceptual padding” running time ∼ log maxi |Si |I finding ranks 6= 1
2
∑i |Si |, short sequences:
complicated special cases at ends of sequencesI equal elements: find partition directly, not element
with specified global rank
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 37/70
Sequential multiway merge
Problem Definitionmerge k sorted sequences into one sorted sequence
Solutionuse a tournament tree, usually implemented as loser tree
I binary tree in arrayI optimal O(log k) running time per merge stepI efficient computation of indicesI downside: tricky without sentinels and/or k not being
a power of 2
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 37/70
Sequential multiway merge
Problem Definitionmerge k sorted sequences into one sorted sequence
Solutionuse a tournament tree, usually implemented as loser tree
I binary tree in arrayI optimal O(log k) running time per merge stepI efficient computation of indicesI downside: tricky without sentinels and/or k not being
a power of 2
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 38/70
Loser Tree
6 2 7 9 1 4 74
6 7 9 7
4 4
2
1
36 2 7 9 4 74
6 7 9 7
4 4
3
2deleteMin+insertNext
3
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 39/70
Parallel (multiway )merge
How to divide the problem?I find slabs, i. e. consistent sets
of sections from thesequences
I two possibilities:I (randomized) splitting by
samplingI exact splitting into parts of
equal size (usingmulti-sequence selection)
· · · · · · k
p0
p1
p2
p3
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 40/70
Parallel (multiway )merge: Analysis
I time complexity O(1p (n log k +k log k · log maxj |Sj |))
I no full linear speedupI good in practiceI special case k = p: O(n
p log k + log p · log maxj |Sj |))
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 41/70
Parallel (multiway )merge: Results
0
5
10
15
20
25
107106105100001000100
Spe
edup
n
Multiway merging of pairs of 64-bit integers on Sun T1
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 42/70
sort, stable sort
Parallel Multiway Mergesort
+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space
Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable
both variants implemented in the MCSTL
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 42/70
sort, stable sort
Parallel Multiway Mergesort
+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space
Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable
both variants implemented in the MCSTL
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 43/70
Parallel Multiway Mergesort
Procedure1. divide sequence into
p parts of equal size2. in parallel sort the
parts locally3. use parallel p-way
merging to computethe final sequence
4. copy result back tooriginal position
p0 p1 p2 p3
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 44/70
Parallel Multiway Mergesort: AnalysisRunning Time
I time complexity O(n log np + p log p · log n
p )
I one multi-sequence partition per PE
Comparison to (Deterministic) Sample SortI very similar, only splitting differsI exact splitting ⇐⇒ approximation guaranteedI DSS’ time complexity: O(n log n
p + p log p)
I tradeoff possible using oversamplingI global communication volume: 2n (copy back)I local memory movement: n
p log2np
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 44/70
Parallel Multiway Mergesort: AnalysisRunning Time
I time complexity O(n log np + p log p · log n
p )
I one multi-sequence partition per PE
Comparison to (Deterministic) Sample SortI very similar, only splitting differsI exact splitting ⇐⇒ approximation guaranteedI DSS’ time complexity: O(n log n
p + p log p)
I tradeoff possible using oversamplingI global communication volume: 2n (copy back)I local memory movement: n
p log2np
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 45/70
Parallel Multiway Mergesort: Practical Issues
I copy to temporary memory first? or merge totemporary memory and copy back later?
I compute starting positions sequentially
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 46/70
0
5
10
15
20
107106105100001000100
Spe
edup
n
Multiway Mergesort of 64-bit integers on Sun T1
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 47/70
Parallel QuicksortBasic Algorithm
1. partition the sequence in parallel
2. if group consists of more than one processor:2.1 divide group according to data balance2.2 continue with 1. recursively
3. otherwise: sort the piece sequentially
Problemload balancing may be very poor, in particular with smallp, bad splitters
Solutionkeep basic algorithm,dynamically balance work in last step
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 47/70
Parallel QuicksortBasic Algorithm
1. partition the sequence in parallel2. if group consists of more than one processor:
2.1 divide group according to data balance2.2 continue with 1. recursively
3. otherwise: sort the piece sequentially
Problemload balancing may be very poor, in particular with smallp, bad splitters
Solutionkeep basic algorithm,dynamically balance work in last step
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 47/70
Parallel QuicksortBasic Algorithm
1. partition the sequence in parallel2. if group consists of more than one processor:
2.1 divide group according to data balance2.2 continue with 1. recursively
3. otherwise: sort the piece sequentially
Problemload balancing may be very poor, in particular with smallp, bad splitters
Solutionkeep basic algorithm,dynamically balance work in last step
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 48/70
Parallel Load-Balanced Quicksort1. partition the sequence in parallel
2. if group consists of more than one processor:2.1 divide group according to data balance2.2 continue with 1. recursively
3. otherwise: quicksort the piece sequentiallypush the piece onto a local stackwhile unsorted elements exist3.1 if non-empty: pop a piece from local stack3.2 otherwise: take (large) piece from bottom of other
PE’s stack (work-stealing)3.3 partition piece3.4 push right part onto stack, sort left part recursively
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 48/70
Parallel Load-Balanced Quicksort1. partition the sequence in parallel2. if group consists of more than one processor:
2.1 divide group according to data balance2.2 continue with 1. recursively
3. otherwise: quicksort the piece sequentiallypush the piece onto a local stackwhile unsorted elements exist3.1 if non-empty: pop a piece from local stack3.2 otherwise: take (large) piece from bottom of other
PE’s stack (work-stealing)3.3 partition piece3.4 push right part onto stack, sort left part recursively
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 48/70
Parallel Load-Balanced Quicksort1. partition the sequence in parallel2. if group consists of more than one processor:
2.1 divide group according to data balance2.2 continue with 1. recursively
3. otherwise: quicksort the piece sequentiallypush the piece onto a local stackwhile unsorted elements exist3.1 if non-empty: pop a piece from local stack3.2 otherwise: take (large) piece from bottom of other
PE’s stack (work-stealing)3.3 partition piece3.4 push right part onto stack, sort left part recursively
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 49/70
Parallel Load-Balanced Quicksort: Scheme
p0 p1 p2partition in parallel
input
p0 p1partition in parallel
sequential sortingp2p0 p1
steal
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 50/70
Parallel Load-Balanced Quicksort: Practice
I omit stack operations for small partsI use lock-free stack data structure
I every thread makes progress in every stepI no mutexes or semaphores are usedI many lock-free data-structures known, many use
linked listsI simple one used here
I how to detect termination?I erratic performance if more threads than processors:
why?
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 51/70
Lock Free (Restricted) Double-Ended QueueRequirements
I push front, pop front not concurrently,issued only by one specific thread
I pop back concurrently from all other threadsI number of elements is limited (logarithmic)I no is empty, no top, because semantics unclearI pop * may fail
SolutionI circular buffer with front and back pointerI encode front and back pointer into one word to allow
synchronous atomic update using compare-and-swap
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 51/70
Lock Free (Restricted) Double-Ended QueueRequirements
I push front, pop front not concurrently,issued only by one specific thread
I pop back concurrently from all other threadsI number of elements is limited (logarithmic)I no is empty, no top, because semantics unclearI pop * may fail
SolutionI circular buffer with front and back pointerI encode front and back pointer into one word to allow
synchronous atomic update using compare-and-swap
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 53/70
Lock Free (Restricted) Double-Ended QueueCode for pop backbefore := pointerswhile(before.front > before.back)
after := (before.front , before.back + 1)
after := (before.front - 1, before.back )
if(cas(pointers, before, after))
item := *(before.back)return true
return false
Code for push front*(pointers.front) := itemfetch_and_add(pointers.front, 1)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 53/70
Lock Free (Restricted) Double-Ended QueueCode for pop frontbefore := pointerswhile(before.front > before.back)
after := (before.front , before.back + 1)
after := (before.front - 1, before.back )
if(cas(pointers, before, after))
item := *(before.back)return true
return false
Code for push front*(pointers.front) := itemfetch_and_add(pointers.front, 1)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 53/70
Lock Free (Restricted) Double-Ended QueueCode for pop frontbefore := pointerswhile(before.front > before.back)
after := (before.front , before.back + 1)
after := (before.front - 1, before.back )
if(cas(pointers, before, after))
item := *(before.back)return true
return false
Code for push front*(pointers.front) := itemfetch_and_add(pointers.front, 1)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 55/70
Lock Free (Restricted) Double-Ended Queue
PropertiesI lock-free, but not wait-freeI pointer back increases monotonically no concurrency problems at queue back
I pointer front does not increase monotonically no problem, since no concurrent push and popallowed at queue front
I in case of failure: retry or done
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 56/70
Balanced Quicksort: Analysis
I time complexity O(n log np + B log p)
I communication volume + local memory movement:n log2 n
I good speedups require fast random-access acrossPE boundaries
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 57/70
Balanced Quicksort: Results
0
0.5
1
1.5
2
2.5
3
3.5
4
107106105100001000100
Spe
edup
n
Balanced Quicksort for 32-bit integers on 2 Dual-Core-Xeons
sequential1 thread
2 threads3 threads4 threads8 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 58/70
Balanced Quicksort: Problem Analysis
ProblemI not so nice performanceI particularly bad with too little processorsI where is the problem?I processor fully loaded while stealing when there is no
piece available
SolutionI switch to other processor if no work found ⇒ yield
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 58/70
Balanced Quicksort: Problem Analysis
ProblemI not so nice performanceI particularly bad with too little processorsI where is the problem?I processor fully loaded while stealing when there is no
piece available
SolutionI switch to other processor if no work found ⇒ yield
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 59/70
Balanced Quicksort: Results with yield
0
0.5
1
1.5
2
2.5
3
3.5
4
107106105100001000100
Spe
edup
n
Balanced Quicksort with Yield for 32-bit integers on 2 Dual-Core-Xeons
sequential1 thread
2 threads3 threads4 threads8 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 60/70
Balanced Quicksort: Comparison to PMWMS
0
0.5
1
1.5
2
2.5
3
3.5
4
107106105100001000100
Spe
edup
n
Multiway Mergesort for 32-bit integers on 2 Dual-Core-Xeons
sequential1 thread
2 threads3 threads4 threads8 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 61/70
Random Permutation (random shuffle)
Standard Sequential Algorithm (e. g. STL)for 0 ≤ i < n swap (a[i ], a[rand(i + 1, n − 1)])
Cache efficient (parallel) algorithm
1. distribute randomly to (local) buckets1b. (copy local buckets to global buckets)2. permute buckets
... ... ... ...
... ...
... ...
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 62/70
Random Permutation (random shuffle)I time complexity O(n
p + p), global communicationvolume n
I cache efficiency very important (factor 2)
0
1
2
3
4
5
6
7
8
108107106
Spe
edup
n
Cache-aware random shuffling of integers on 4-way Opteron
sequential1 thread
2 threads3 threads4 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 63/70
Embarrassingly Parallel Computation
I semanticsI process a set of elements completely independentlyI atomic units called jobs, running time unknown
I parallelizationI easy in principle (uniform workload) static load-balancing
I interesting for non-uniform workload dynamic load-balancing
I possible solutionsI equal splitting: perfect for uniform workloadI master-worker: possibly considerable overhead
(communication in each step)I work-stealing: communication only when necessary
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 64/70
Dynamic Load Balancing for for each etc.
I using work-stealingI divide iteration range into equal intervals initiallyI idle threads steal half the interval from random victim
I no explicit synchronization with victims needed(using fetch and add)
I adaptive granularity control (cache!)I logarithmic number of steals suffice
with high probability
... ... ...
... ... ...
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 65/70
0
1
2
3
4
105100001000100
Spe
edup
Number of pixels
Mandelbrot on 4-way Opteron, at most 1000 iterations per pixel
4 bal.3 bal.
4 unbal.2 bal.
3 unbal.2 unbal.
seq.
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 66/70
Outline
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 67/70
Conclusion
I MCSTL provides a very easy way to incorporateparallelism into programs on an algorithmic level
I performance is excellent for large inputsI basic algorithms known but detailed design and
performance engineering nontrivialI successful integration into STXXL (external memory)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 68/70
Future Work
I complete STL functionalityI better automatic algorithm and parameter selectionI machine model adequate for design and analysis of
multithreaded algorithmsI beyond STL
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 69/70
Algorithms & DS to be Implemented
I containers: initialization, bulk operationsI priority queuesI some embarrassingly parallel functions
(e. g. valarray)I memory transfer operations (reverse, copy)?I set operations (set union,. . .)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 70/70
More About All That
I MCSTL website:http://algo2.iti.uni-karlsruhe.de/singler/mcstl/
I Praktikum next semester:extension/usage of MCSTL
I Studien-/Diplomarbeiten
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism