university of californiastellar.cct.lsu.edu/pubs/c++17 parallel algorithms and beyond.pdfc++17...

86
UNIVERSITY OF CALIFORNIA

Upload: others

Post on 28-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

UNIVERSITY OF

CALIFORNIA

Page 2: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

C++17 Parallel Algorithms

and Beyond

Bryce Adelstein Lelbach

Computer Architecture Group, Computing Research Division

CppCon 2016

Page 3: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

C++17 Parallel Algorithms

and Beyond

Bryce Adelstein Lelbach

Computer Architecture Group, Computing Research Division

CppCon 2016

Page 4: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

4

Page 5: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Hardware is increasingly parallel and increasingly

diverse. Vendor-neutral parallel programming

abstractions are desperately needed to avoid leaving

performance on the table.

C++11/14 provide low-level concurrency primitives, but

no real higher-level generic abstractions for parallel

programming.

Copyright (C) 2016 Bryce Adelstein Lelbach

5

Page 6: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

6

Q: What are standard algorithms?

Page 7: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

The standard says the algorithms library

describes components that C++

programs may use to perform

algorithmic operations on containers

and other sequences

(25.1 [algorithms.general] p1)

Copyright (C) 2016 Bryce Adelstein Lelbach

7

Page 8: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Q: What are standard algorithms?

A: Generic operations on sequences.

Copyright (C) 2016 Bryce Adelstein Lelbach

8

Page 9: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Q: What are standard algorithms?

A: Generic operations on sequences.

(except min(), max(), etc)

Copyright (C) 2016 Bryce Adelstein Lelbach

9

Page 10: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Three types of standard algorithms:

• Non-modifying sequence operations.

• Mutating sequence operations.

• Sorting and related operations.

Copyright (C) 2016 Bryce Adelstein Lelbach

10

Page 11: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

The Committee Draft of C++17 includes a

new standard parallel algorithms library

which provides parallelized versions of

these sequence operations.

Copyright (C) 2016 Bryce Adelstein Lelbach

11

Page 12: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

12

Q: What does parallel mean?

Page 13: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

13

Bit-Level Parallelism

Instruction-Level Parallelism

Vector-Level Parallelism

Task-Level Parallelism

Process-Level Parallelism

implicit

explicit

Page 14: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

14

Bit-Level Parallelism

Instruction-Level Parallelism

Vector-Level Parallelism

Task-Level Parallelism

Process-Level Parallelism

implicit

explicit

Page 15: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

15

Bit-Level Parallelism

Instruction-Level Parallelism

Vector-Level Parallelism

Task-Level Parallelism

Process-Level Parallelism

implicit

explicit

Page 16: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Parallel algorithms library components:

• ExecutionPolicy concept.

• Three standard execution policies.

• New ExecutionPolicy overloads for most

existing standard algorithms.

• New “unordered” algorithms based on

existing “ordered” algorithms.

• Fused algorithms.

Copyright (C) 2016 Bryce Adelstein Lelbach

16

Page 17: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

An ExecutionPolicy describes how a

generic algorithm may be parallelized.

They allow programmers to request

parallelism and describe constraints.

Copyright (C) 2016 Bryce Adelstein Lelbach

17

Page 18: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Standard execution policies:

• std::seq – operations are indeterminately

sequenced in the calling thread.

• std::par – operations are indeterminately

sequenced with respect to each other within

the same thread.

• std::par_unseq – operations are

unsequenced with respect to each other

and possibly interleaved.

Copyright (C) 2016 Bryce Adelstein Lelbach

18

Page 19: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Suppose we are using the following binary

operation with std::transform():

double multiply(double x, double y){ return x * y; }

std::transform(// "Left" input sequence.x.begin(), x.end(),y.begin(), // "Right" input sequence.x.begin(), // Output sequence.multiply);

Copyright (C) 2016 Bryce Adelstein Lelbach

19

Page 20: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Suppose we are using the following binary

operation with std::transform():

load x[i] to a scalar registerload y[i] to a scalar registermultiply x[i] and y[i]store the result to x[i]

Copyright (C) 2016 Bryce Adelstein Lelbach

20

Page 21: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

21

std::par

load x[i ] to a scalar registerload y[i ] to a scalar registermultiply x[i ] and y[i ]store the result to x[I ]load x[i+1] to a scalar registerload y[i+1] to a scalar registermultiply x[i+1] and y[i+1]store the result to x[i+1]load x[i+2] to a scalar registerload y[i+2] to a scalar registermultiply x[i+2] and y[i+2]store the result to x[i+2]load x[i+3] to a scalar registerload y[i+3] to a scalar registermultiply x[i+3] and y[i+3]store the result to x[i+3]

Page 22: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::par_unseq

load x[i ] to a scalar registerload x[i+1] to a scalar registerload x[i+2] to a scalar registerload x[i+3] to a scalar registerload y[i ] to a scalar registerload y[i+1] to a scalar registerload y[i+2] to a scalar registerload y[i+3] to a scalar registermultiply x[i ] and y[i ]multiply x[i+1] and y[i+1]multiply x[i+2] and y[i+2]multiply x[i+3] and y[i+3]store the result to x[i ]store the result to x[i+1]store the result to x[i+2]store the result to x[i+3]

Copyright (C) 2016 Bryce Adelstein Lelbach

22

std::par

load x[i ] to a scalar registerload y[i ] to a scalar registermultiply x[i ] and y[i ]store the result to x[I ]load x[i+1] to a scalar registerload y[i+1] to a scalar registermultiply x[i+1] and y[i+1]store the result to x[i+1]load x[i+2] to a scalar registerload y[i+2] to a scalar registermultiply x[i+2] and y[i+2]store the result to x[i+2]load x[i+3] to a scalar registerload y[i+3] to a scalar registermultiply x[i+3] and y[i+3]store the result to x[i+3]

Page 23: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::par_unseq

load x[i:i+3] to a vector registerload y[i:i+3] to a vector registermultiply x[i:i+3] and y[i:i+3]store the results to x[i:i+3]

Copyright (C) 2016 Bryce Adelstein Lelbach

23

std::par

load x[i ] to a scalar registerload y[i ] to a scalar registermultiply x[i ] and y[i ]store the result to x[I ]load x[i+1] to a scalar registerload y[i+1] to a scalar registermultiply x[i+1] and y[i+1]store the result to x[i+1]load x[i+2] to a scalar registerload y[i+2] to a scalar registermultiply x[i+2] and y[i+2]store the result to x[i+2]load x[i+3] to a scalar registerload y[i+3] to a scalar registermultiply x[i+3] and y[i+3]store the result to x[i+3]

Page 24: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

24

adjacent_difference is_sorted[_until] Rotate[_copy]

adjacent_find lexicographical_compare Search[_n]

all_of max_element set_difference

any_of merge set_intersection

copy[_if|_n] min_element set_symmetric_difference

count[_if] minmax_element set_union

equal mismatch sort

fill[_n] move stable_partition

find[_end|_first_of|_if|_if_not] none_of stable_sort

for_each nth_element swap_ranges

generate[_n] partial_sort[_copy] transform

includes partition[_copy] uninitialized_copy[_n]

inplace_merge remove[_copy|_copy_if|_if] uninitialized_fill[_n]

is_heap[_until] Replace[_copy|_copy_if|_if] unique

is_partitioned Reverse[_copy] unique_copy

New ExecutionPolicy overloads for most existing standard algorithms.

Page 25: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<T> x = // ...

std::sort(std::par, x.begin(), x.end());

Copyright (C) 2016 Bryce Adelstein Lelbach

25

Page 26: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<T> x = // ...

std::for_each(std::par_unseq,x.begin(), x.end(), process);

Copyright (C) 2016 Bryce Adelstein Lelbach

26

Page 27: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<T> x = // ...

#pragma omp parallel for simdfor (std::size_t i = 0; i < x.size(); ++i)process(x[i]);

Copyright (C) 2016 Bryce Adelstein Lelbach

27

Page 28: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

New “unordered” algorithms based

on existing “ordered” algorithms.

• std::reduce()

• std::inclusive_scan()

• std::exclusive_scan()

• std::transform_reduce()

Copyright (C) 2016 Bryce Adelstein Lelbach

28

Page 29: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::reduce() - unordered std::accumulate()

T v = std::reduce([ep,]first, last,[init,] [op])

Copyright (C) 2016 Bryce Adelstein Lelbach

29

Page 30: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::accumulate():

first, acc = init

then for every it in [first, last) in order

acc = binary_op(acc, *it)

std::reduce():

GSUM(binary_op, init, *first, ...)

Copyright (C) 2016 Bryce Adelstein Lelbach

30

Page 31: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Commutativity: Changing the order of operations does

not change the result.

• Integer Addition: x + y == y + x

• Integer Multiplication: xy == yx

• Integer Subtraction: x – y != y - x

Associativity: The grouping of operations does not

change the result.

• Integer Addition: (x + y) + z == x + (y + z)

• Integer Multiplication: (xy)z == x(yz)

• Integer Subtraction: (x – y) – z == x – (y – z)

Copyright (C) 2016 Bryce Adelstein Lelbach

31

Page 32: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

GNSUM(op, a1, …, aN) =

a1 N == 1

op(GNSUM(op, a1, …, ak), otherwise

GNSUM(op, ak+1, …, aN))

Copyright (C) 2016 Bryce Adelstein Lelbach

32

Page 33: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

GSUM(op, a1, …, aN) == GNSUM(op, b1, …, bN)

where b1, …, bN may be any permutation of b1, …, bN

Copyright (C) 2016 Bryce Adelstein Lelbach

33

Page 34: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x{1e-2, 1e-1, 1e0, 1e1, 1e2};

// sum ~= 111.111double sum = std::accumulate(x.begin(), x.end(), 0.0);

Copyright (C) 2016 Bryce Adelstein Lelbach

34

Page 35: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x{1e-2, 1e-1, 1e0, 1e1, 1e2};

// sum ~= 111.111double sum = 0.0;

sum = sum + x[0];sum = sum + x[1];sum = sum + x[2];sum = sum + x[3];sum = sum + x[4];

Copyright (C) 2016 Bryce Adelstein Lelbach

35

Page 36: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x{1e-2, 1e-1, 1e0, 1e1, 1e2};

// sum ~= 111.111double sum = std::reduce(x.begin(), x.end(), 0.0);

Copyright (C) 2016 Bryce Adelstein Lelbach

36

Page 37: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x{1e-2, 1e-1, 1e0, 1e1, 1e2};

// sum ~= 111.111double sum = 0.0;

sum = sum + x[0];sum = sum + x[1];sum = sum + x[2];sum = sum + x[3];sum = sum + x[4];

Copyright (C) 2016 Bryce Adelstein Lelbach

37

Page 38: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x{1e-2, 1e-1, 1e0, 1e1, 1e2};

// sum ~= 111.111double sum = 0.0;

sum = sum + x[1];sum = sum + x[0];sum = sum + x[2];sum = sum + x[4];sum = sum + x[3];

Copyright (C) 2016 Bryce Adelstein Lelbach

38

Page 39: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x{1e-2, 1e-1, 1e0, 1e1, 1e2};

// GNSUM(plus, x[2], x[3], x[4])double sum0 = x[2] + x[3] + x[4];

// GNSUM(plus, 0.0, x[0], x[1])double sum1 = 0.0 + x[0] + x[1];

// GNSUM(plus, 0.0, x[0], ..., x[4])// = plus(GNSUM(plus, x[2], x[3], x[4]),// GNSUM(plus, 0.0, x[0], x[1]))double sum = sum0 + sum1;

Copyright (C) 2016 Bryce Adelstein Lelbach

39

sum0 sum1

sum

Page 40: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x{1e-2, 1e-1, 1e0, 1e1, 1e2};

// GNSUM(plus, x[0], x[4], x[2])double sum0 = x[0] + x[4] + x[2];

// GNSUM(plus, 0.0, x[3], x[1])double sum1 = 0.0 + x[3] + x[1];

// GNSUM(plus, 0.0, x[0], ..., x[4])// = plus(GNSUM(plus, x[0], x[4], x[2]),// GNSUM(plus, 0.0, x[3], x[1]))double sum = sum0 + sum1;

Copyright (C) 2016 Bryce Adelstein Lelbach

40

sum0 sum1

sum

Page 41: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::inclusive_scan() - unordered std::partial_sum().

OutputIt it =std::inclusive_scan([ep,]

first, last, output,[op,] [init]);

Copyright (C) 2016 Bryce Adelstein Lelbach

41

Page 42: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Copyright (C) 2016 Bryce Adelstein Lelbach

42

std::inclusive_scan() - unordered std::partial_sum().

OutputIt it =std::inclusive_scan([ep,]

first, last, output,[op,] [init]);

Page 43: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

*(output) = *first;*(output+1) = *first + *(first+1);*(output+2) = *first + *(first+1) + *(first+2);// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

43

Page 44: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::exclusive_scan() - exclusive prefix sum.

OutputIt it =std::exclusive_scan([ep,]

first, last, output,[op,] [init]);

Copyright (C) 2016 Bryce Adelstein Lelbach

44

Page 45: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

*(output) = init;*(output+1) = init + *first;*(output+2) = init + *first + *(first+1);// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

45

Page 46: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Fused algorithms:

• transform_reduce()

• transform_inclusive_scan()

• transform_exclusive_scan()

Copyright (C) 2016 Bryce Adelstein Lelbach

46

Page 47: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

C++17 Parallel Algorithms

and Beyond

Bryce Adelstein Lelbach

Computer Architecture Group, Computing Research Division

CppCon 2016

Page 48: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

C++17 Parallel Algorithms

and Beyond

Bryce Adelstein Lelbach

Computer Architecture Group, Computing Research Division

CppCon 2016

Page 49: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

C++17 Parallel Algorithms

and Beyond

Bryce Adelstein Lelbach

Computer Architecture Group, Computing Research Division

CppCon 2016

Page 50: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::transform_reduce() - unordered std::transform_reduce().

R v =std::transform_reduce([ep,]

first, last,trans_op, init, reduce_op);

Copyright (C) 2016 Bryce Adelstein Lelbach

50

Page 51: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

R trans_op(T const&);R reduce_op(R const&, R const&);

R v = reduce_op(trans_op(x[i]), trans_op(x[i+1]));

Copyright (C) 2016 Bryce Adelstein Lelbach

51

Page 52: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

R v = reduce_op(

reduce_op(trans_op(x[i] ), trans_op(x[i+1])),

reduce_op(trans_op(x[i+2]), trans_op(x[i+3])));

Copyright (C) 2016 Bryce Adelstein Lelbach

52

reduce_op

reduce_op

trans_opx[i+2]

trans_opx[i+2]

reduce_op

trans_opx[i]

trans_opx[i+2]

Page 53: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::transform_reduce() - unordered std::transform_reduce().

R v =std::transform_reduce([ep,]

first1, last1, first2,trans_op, init, reduce_op);

Copyright (C) 2016 Bryce Adelstein Lelbach

54

Page 54: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...

double norm =std::sqrt((x[0] * x[0]) + (x[1] * x[1]) + /* ... */);

Copyright (C) 2016 Bryce Adelstein Lelbach

55

Page 55: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...

double norm =std::sqrt(std::transform_reduce(std::par_unseq,

// Input sequence.x.begin(), x.end(),

// Unary transform op.[] (double x) { return x * x; },

// Initial reduction value.double(0.0),

// Reduction op.[] (double xl, double xr) { return xl + xr; }

));

Copyright (C) 2016 Bryce Adelstein Lelbach

56

Page 56: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...

double norm =std::sqrt(std::transform_reduce(std::par_unseq,

// Input sequence.x.begin(), x.end(),

// Unary transform op.[] (double x) { return x * x; },

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

57

Page 57: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...

double norm =std::sqrt(std::transform_reduce(std::par_unseq,

// Input sequence.x.begin(), x.end(),

// Unary transform op.[] (double x) { return x * x; },

// Initial reduction value.double(0.0),

// Reduction op.[] (double xl, double xr) { return xl + xr; }

));

Copyright (C) 2016 Bryce Adelstein Lelbach

58

Page 58: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...std::vector<double> y = // ...

double dot_product =(x[0] * y[0]) + (x[1] * y[1]) + // ...

Copyright (C) 2016 Bryce Adelstein Lelbach

59

Page 59: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...std::vector<double> y = // ...

double dot_product = std::transform_reduce(std::par_unseq,

// "Left" input sequence.x.begin(), x.end(),

// "Right" input sequence.y.begin(),

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

60

Page 60: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...std::vector<double> y = // ...

double dot_product = std::transform_reduce(std::par_unseq,

// "Left" input sequence.x.begin(), x.end(),

// "Right" input sequence.y.begin(),

// Binary transform op.[] (double x, double y) { return x * y; },

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

61

Page 61: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...std::vector<double> y = // ...

double dot_product = std::transform_reduce(std::par_unseq,

// "Left" input sequence.x.begin(), x.end(),

// "Right" input sequence.y.begin(),

// Binary transform op.[] (double x, double y) { return x * y; },

// Initial value for reduction.double(0.0),

// Reduction op.[] (double x, double y) { return x + y; }

);

Copyright (C) 2016 Bryce Adelstein Lelbach

62

Page 62: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::vector<double> x = // ...std::vector<double> y = // ...

double dot_product = std::transform_reduce(std::par_unseq,

// Input sequence.boost::counting_iterator<std::size_t>(0),boost::counting_iterator<std::size_t>(x.size()),

// Unary transform op.[&x, &y] (std::size_t i) { return x[i] * y[i]; },

// Initial value for reduction.double(0.0),

// Reduction op.[] (double x, double y) { return x + y; }

);

Copyright (C) 2016 Bryce Adelstein Lelbach

63

Page 63: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::size_t word_count(std::string_view s) {// Goal: Count the number of word "beginnings" in the// input sequence.

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

64

Page 64: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

bool is_word_beginning(char left, char right) {// If left is a space and right is not, we've hit a// new word.return std::isspace(left) && !std::isspace(right);

}

Copyright (C) 2016 Bryce Adelstein Lelbach

65

Page 65: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::size_t word_count(std::string_view s) {if (s.empty()) return 0;

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

66

Page 66: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::size_t word_count(std::string_view s) {if (s.empty()) return 0;

// If the first character is not a space, then it's the// beginning of a word.std::size_t wc = (!std::isspace(s.front()) ? 1 : 0);

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

67

Page 67: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::size_t word_count(std::string_view s) {// ...

// Count the number of characters that start a new wordwc +=std::transform_reduce(std::par_unseq,

// "Left" input: s[0], s[1], ..., s[s.size() - 2]s.begin(), s.end() - 1,// "Right" input: s[1], s[2], ..., s[s.size() - 1]s.begin() + 1,

// Binary transform op: Return 1 when we hit a new word.is_word_beginning,

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

68

Page 68: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::size_t word_count(std::string_view s) {// ...

// Count the number of characters that start a new wordwc +=std::transform_reduce(std::par_unseq,

// "Left" input: s[0], s[1], ..., s[s.size() - 2]s.begin(), s.end() - 1,// "Right" input: s[1], s[2], ..., s[s.size() - 1]s.begin() + 1,

// Binary transform op: Return 1 when we hit a new word.is_word_beginning,

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

69

Page 69: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

std::size_t word_count(std::string_view s) {// ...

// Count the number of characters that start a new wordwc +=std::transform_reduce(std::par_unseq,

// "Left" input: s[0], s[1], ..., s[s.size() - 2]s.begin(), s.end() - 1,// "Right" input: s[1], s[2], ..., s[s.size() - 1]s.begin() + 1,

// Binary transform op: Return 1 when we hit a new word.is_word_beginning,

std::size_t(0), // Initial value for reduction.std::plus<std::size_t>() // Reduction op.

);

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

71

Page 70: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

input sequence:"Whose woods these are I think I know.\n""His house is in the village though; \n""He will not see me stopping here \n""To watch his woods fill up with snow.\n"

First Stanza of Stopping by Woods on a Snowy Evening, Robert Frost

post-transform pseudo-sequence:0000010000010000010001010000010100000 100010000010010010001000000010000000001001000010001000100100000000100000000010010000010001000001000010010000100000

Copyright (C) 2016 Bryce Adelstein Lelbach

72

Page 71: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

input sequence:"Whose woods these are I think I know.\n""His house is in the village though; \n""He will not see me stopping here \n""To watch his woods fill up with snow.\n"

First Stanza of Stopping by Woods on a Snowy Evening, Robert Frost

post-transform pseudo-sequence:1 + 1 + 1 + 1+1 + 1+1 +

1 + 1 + 1 +1 +1 + 1 + 1 + 1 +1 + 1 + 1 + 1 +1 + 1 + 1 +1 + 1 + 1 + 1 + 1 +1 + 1

Copyright (C) 2016 Bryce Adelstein Lelbach

73

Page 72: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

bool is_word_beginning(char left, char right) {return std::isspace(left) && !std::isspace(right);

}

std::size_t word_count(std::string_view s) {if (s.empty()) return 0;

std::size_t wc = (!std::isspace(s.front()) ? 1 : 0);

wc +=std::transform_reduce(std::par_unseq,s.begin(), s.end() - 1,s.begin() + 1,is_word_beginning,std::size_t(0),std::plus<std::size_t>()

);

return wc;}

Copyright (C) 2016 Bryce Adelstein Lelbach

74

Page 73: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

bool is_word_beginning(char left, char right) {return std::isspace(left) && !std::isspace(right);

}

std::size_t word_count(std::string_view s) {if (s.empty()) return 0;

std::size_t wc =std::transform_reduce(std::par_unseq,s.begin(), s.end() - 1,s.begin() + 1,is_word_beginning,std::size_t(!std::isspace(s.front()) ? 1 : 0),std::plus<std::size_t>()

);

return wc;}

Copyright (C) 2016 Bryce Adelstein Lelbach

75

Page 74: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Sparse histogram:

• Goal: Find all the unique values in a

sequence and count the number of times

they occur.

• Example:

– Input: a, b, c, c, a, a, b, b, b, b, e

–Output keys: [a, b, c, e]

–Output counts: [2, 5, 3, 1]

Copyright (C) 2016 Bryce Adelstein Lelbach

76

Page 75: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {std::vector<T> hist_keys;std::vector<std::size_t> hist_counts;

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

77

Page 76: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {std::vector<T> hist_keys;std::vector<std::size_t> hist_counts;

if (x.empty())return std::make_tuple(std::move(hist_keys),

std::move(hist_counts));

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

78

Page 77: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {std::vector<T> hist_keys;std::vector<std::size_t> hist_counts;

if (x.empty())return std::make_tuple(std::move(hist_keys),

std::move(hist_counts));

// Sort x to bring equal elements together.std::sort(std::par_unseq, x.begin(), x.end());

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

79

Page 78: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {// ...

// Count the number of unique elements.std::size_t num_unique_elements = // ...

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

80

Page 79: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {// ...

// Count the number of unique elements.std::size_t num_unique_elements = std::transform_reduce(std::par_unseq,

x.begin(), x.end() - 1, // x[0], x[1], ..., x[x.size() - 2]x.begin() + 1, // x[1], x[2], ..., x[x.size() - 1]

// Transform op: Return 1 if right is a new unique element.[] (auto&& left, auto&& right)// If the right is not equal to the left, then we've// hit the next unique element.{ return left != right; },

std::size_t(1), // x isn’t empty, so 1 unique key minimum.std::plus<std::size_t>() // Reduction operation.

);

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

81

Page 80: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {// ...

// Allocate storage.hist_keys.resize (num_unique_elements);hist_counts.resize(num_unique_elements);

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

82

Page 81: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {// ...

// Count the number of occurrences of each unique key.hpx::reduce_by_key(hpx::par_unseq,

// Input key sequence.x.begin(), x.end(),// Input value sequence.boost::constant_iterator<std::size_t>(1),

// Output key sequence.hist_keys.begin(),// Output value sequence.hist_counts.begin()

);

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

83

Page 82: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {// ...

// Count the number of occurrences of each unique key.hpx::reduce_by_key(hpx::par_unseq,

// Input key sequence.x.begin(), x.end(),// Input value sequence.boost::constant_iterator<std::size_t>(1),

// Output key sequence.hist_keys.begin(),// Output value sequence.hist_counts.begin()

);

// ...

Copyright (C) 2016 Bryce Adelstein Lelbach

84

Page 83: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

auto sparse_histogram(std::vector<T> const& x) {// ...

hpx::reduce_by_key(hpx::par_unseq,

// Input key sequence.x.begin(), x.end(),// Input value sequence.boost::constant_iterator<std::size_t>(1),

// Output key sequence.hist_keys.begin(),// Output value sequence.hist_counts.begin()

);

return std::make_tuple(std::move(hist_keys),std::move(hist_counts));

}

Copyright (C) 2016 Bryce Adelstein Lelbach

85

Page 84: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Parallel algorithms exception handling

• If an element access function exits via an

uncaught exception, std::terminate() is

called.

Copyright (C) 2016 Bryce Adelstein Lelbach

86

Page 85: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Parallel algorithms exception handling

• If an element access function exits via an

uncaught exception, std::terminate() is

called.

• Parallel algorithms may also throw

std::bad_alloc if temporary memory

resources are needed for execution and

none are available.

Copyright (C) 2016 Bryce Adelstein Lelbach

87

Page 86: UNIVERSITY OF CALIFORNIAstellar.cct.lsu.edu/pubs/C++17 Parallel Algorithms and Beyond.pdfC++17 Parallel Algorithms and Beyond Bryce Adelstein Lelbach Computer Architecture Group, Computing

Acknowledgments

HPX: github.com/STEllAR-GROUP/hpx

Thrust: github.com/thrust/thrust

Boost.Compute: github.com/boostorg/compute

Jared Hoberock

Michael Garland

Grant Mercer

Hartmut Kaiser

Copyright (C) 2016 Bryce Adelstein Lelbach

88