introduction & tutorial · programming parallel systems is hard! • resource utilization •...

62
August Ernstsson Christoph Kessler Introduction & Tutorial MCC 2019 SkePU tutorial {firstname.lastname}@liu.se Linköping University, Sweden November 27, 2019 ida.liu.se/labs/pelab/skepu/

Upload: others

Post on 21-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson Christoph Kessler

Introduction & Tutorial

MCC 2019 SkePU tutorial

{firstname.lastname}@liu.se

Linköping University, Sweden

November 27, 2019

ida.liu.se/labs/pelab/skepu/

Page 2: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Overview

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

• Introduction and history

• SkePU interface

• Installation

• Skeletons in depth

• Map

• Reduce

• MapReduce

• Scan

• MapOverlap

• Preview: Future SkePU

• Demonstration, Outlook, Discussion…

2

Page 3: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Skeleton Programming

3

Page 4: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Programming parallel systems is hard!

• Resource utilization

• Synchronization, Communication

• Memory consistency

• Different hardware architectures, heterogeneity

Skeleton Programming :: Motivation

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

4

• A high-level parallel programming concept

• Inspired by functional programming

• Generic computational patterns

• Abstracts architecture-specific issues

Skeleton programming (algorithmic skeletons)

Page 5: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Skeleton Programming :: Skeletons

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Skeletons Parametrizable higher-order constructs

• Map

• Reduce

• MapReduce

• Scan

• and others

Map

MapReduce

5

Page 6: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Skeleton Programming :: User Functions

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Mult Add

User functions User-defined operators

AbsPow

6

Page 7: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Skeleton Programming :: Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

MapReduce

Mult

Add

Skeleton parametrization example Dot product operation

7

Page 8: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Skeleton Programming ∷ Map

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

1

2

3

4

5

6

7

8

2

2

2

2

2

2

2

2

Map MultMult

4

2

8

6

12

10

16

14

Mult

Mult

Mult

Mult

Mult

Mult

Mult

Sequential algorithm

8

Page 9: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Skeleton Programming ∷ Map

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

1

2

3

4

5

6

7

8

2

2

2

2

2

2

2

2

Map Mult

Mult

4

2

8

6

12

10

16

14

Mult

Mult

Mult

Mult

Mult

Mult

Mult

Parallel algorithm

9

Page 10: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

3

Skeleton Programming :: Reduce

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Reduce Add

3

28

15

6

36

21

10

2 75 8641

Add

Add

Add

Add

Add

Add

Add

Sequential algorithm

10

Page 11: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

3

Skeleton Programming :: Reduce

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Reduce Add

3

28

15

6

36

21

10

2 75 8641

Add

Add

Add

Add

Add

Add

Add

Dependencies!

Parallel algorithm?

11

Page 12: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Skeleton Programming :: Reduce

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

1 2 3 4 5 6 7 8

Add Add Add Add

Add Add

Add

Reduce Add

36

3 117 15

2610

Parallel algorithm (assuming associativity)

12

Page 13: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

SkePU

13

Page 14: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

But First: Modern C++

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

14

• SkePU uses ”modern” C++

// ”auto” type specifier auto addOneMap = skepu2::Map<1>(addOneFunc); skepu2::Vector<float> input(size), res(size); input.randomize(0, 9); // Lambda expression auto dur = skepu2::benchmark::measureExecTime([&] { addOneMap(res, input); });

capture by reference

• Implementation reliant on variadic templates, template metaprogramming, and other C++11 features

Page 15: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Features

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

✓Efficient parallel algorithms

✓Memory management and data movement

✓Automatic backend selection and tuning

15

Page 16: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

History

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

16

• SkePU 1, released 2010

• C++ template-based interface (limited arity)

• Multi-backend using macro-based code generation

• SkePU 2, released 2016

• C++11 variadic template interface (flexible arity)

• Multi-backend using source-to-source precompiler

• SkePU 3, in development for 2020

• Expanding skeleton set, container set, and expressivity

Page 17: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Features

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

17

• Skeleton programming framework

• C++11 library with skeleton and data container classes

• A Clang-based source-to-source pre-compiler

• Smart containers: Vector<T>, Matrix<T>

• In development: SparseMatrix<T>

• For heterogeneous multicore systems

• Multiple backends

• Active research tool with a number of publications 2010-2019 (see website)

Page 18: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• Skeletons provided by SkePU

• Map

• Reduce

• MapReduce

• Scan

• MapOverlap

• (In development) MapPairs

Skeletons

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

18

Page 19: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Backend Architecture

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

C++ interface (skeletons, smart containers, …)

C++ OpenMP OpenCL CUDA MPI

CPU Multi-core CPU Accelerator GPU Cluster

19

Internalprototype

Page 20: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Smart Containers

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Vector Matrix Sparse matrix

20

Page 21: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• C++ template class instance

• Contains:

• CPU memory buffer pointer (alt. StarPU handles)

• Size information (size, width/height)

• OpenCL/CUDA/MPI handles

• Consistency states

• Template type can be custom struct, but be careful!

• Data layout not verified across backends/languages

Smart Containers

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

21

Page 22: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Using SkePU

22

Page 23: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Source-to-Source Translation

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Compiler

Source-to-source translator

High-level program (C++)

High-level program (C++)

Low-level program (Assembly)

Same-level program (C++)

23

Page 24: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Compilation Architecture

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

24

Executable

Source-to-source compiler

Backend compiler (e.g., GCC)

Program sources

SkePU runtime library

Backend sources (C++, OpenCL, etc.)

Han

dled

by

Mak

efile

s

Page 25: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• Two main installation options:

• Use provided pre-built pre-compiler tool and library(For 64-bit Linux systems)

• Use installation script to download and build pre-compiler (Requires time and disk space for LLVM/Clang source trees)

Installation

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

25

See website: https://www.ida.liu.se/labs/pelab/skepu/

Page 26: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Syntax

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

26

int add(int a, int b, int m) { return (a + b) % m;

}

vec_sum(result, v1, v2, 5);

Map addauto vec_sum = Map<2>(add);

add

Mult

Mult

Mult

Mult

Mult

Mult

Mult

Mult

Page 27: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• User functions are C++ (rather, C) functions

• The signature is analyzed by the pre-compiler to extract the skeleton signature

• Each skeleton has their own expected patterns for UF parameters (but the general structure is shared)

• The UF body is side-effect free C (compatible with CUDA/OpenCL)

• No communication/synchronization

• No memory allocation

• No disk IO

User Functions

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

27

int add(int a, int b, int m) { return (a + b) % m;

}

Page 28: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• Variable arity on Map and MapReduce skeletons

• Index argument (of current Map’d container element)

• Uniform arguments

• Smart container arguments accessible freely inside user function

• Read-only / write-only / read-write copy modes

• User function templates

Flexibility

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

28

Page 29: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Advanced Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

template<typename T> T abs(T input) { return input < 0 ? -input : input; }

template<typename T> T mvmult(Index1D row, const Mat<T> m, const Vec<T> v) { T res = 0; for (size_t i = 0; i < v.size; ++i) res += m[row.i * m.cols + i] * v[i];

return abs(res); }

Page 30: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• Multi-variant user function specialization

• Targeting backend

• Custom types

• Chained user functions

• In-line lambda syntax for user functions

• ”Intrinsic” functionsSome functions exist in the standard library of most SkePU backends. SkePU will allow certain such functions to be called from a user function. Examples: sin(x), pow(x, e)

Advanced/Experim. Features

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

30

Page 31: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Lambda Syntax

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

auto vec_sum = Map<2>([](int a, int b) {

return a + b; });

// ...

vec_sum(result, v1, v2);

31

Page 32: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Compilation options

32

Page 33: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• Precompiler options

• skepu-tool -name map_precompiled map.cpp -dir bin -opencl -- [Clang flags]

• Handled by Makefiles

• Makefile.in

• Set up your system configuration, e.g. backend compiler

• Makefile.include

• Set up backends

• Makefile

• Set up programs

Compilation

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

33

Page 34: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• auto spec = skepu2::BackendSpec{skepu2::Backend::typeFromString(argv[2])};

• Sets the backend from a string, must match any of:

• ”CPU”

• ”OpenMP”

• ”OpenCL”

• ”CUDA”

Backend Specification

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

34

Page 35: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Smart Containers

35

Page 36: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• container.updateHost();

• Flushes and ensures that the CPU buffer contains up-to-date values.

• container[i] = value;

• Managed element access. Flushes if needed.

• container(i) = value;

• Direct element access into the raw CPU buffer.

Smart Containers

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

36

Changed inSkePU 3

Page 37: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Map

37

Page 38: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• Three groups of user function parameters:

• Element-wiseOnly one element per user function call

• Random-access containersReplicated for each memory space (e.g. GPUs)Proxy types Vec<T> and Mat<T> in user function

• Unifom scalars Same values everywhere

• Argument groups are variadic (flexible count, including 0)

• Above order must be obeyed (element-wise first etc.)

• The parallelism/number of user function invocations is always determined by the return container (first argument), also in case of element-wise arity of 0.

• Also applies to MapReduce, MapOverlap!

Map Parameter Groups

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

38

Page 39: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• Optionally, use iterators with Map (and most other skeletons)

• mapper(r.begin(), r.end(), v1.begin(), v2.begin());

Map – Iterators

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

39

Page 40: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Map Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

40

float sum(float a, float b) { return a + b; }

Vector<float> vector_sum(Vector<float> &v1, Vector<float> &v2) { auto vsum = Map<2>(sum); Vector<float> result(v1.size()); return vsum(result, v1, v2); }

Page 41: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Reduce

41

Page 42: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• 1D Reduce

• Regular Vector

• Matrix RowWise (returns Vector)

• Matrix ColWise (returns Vector)

• ”2D” Reduce

• Regular Matrix (treated as a vector)

• instance.setStartValue(value)

• Set Reduction start value. Defaults to 0-initialized.

Reduce Modes

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

42

instance.setReduceMode(ReduceMode::RowWise) // default

instance.setReduceMode(ReduceMode::ColWise)

Page 43: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Reduce Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

43

float min_f(float a, float b) { return (a < b) ? a : b; }

float min_element(Vector<float> &v) { auto min_calc = Reduce(min_f); return min_calc(v); }

Page 44: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Reduce Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

44

float plus_f(float a, float b) { return a + b; }

float max_f(float a, float b) { return (a > b) ? a : b; }

auto max_sum = skepu2::Reduce(plus_f, max_f);

max_sum.setReduceMode(skepu2::ReduceMode::RowWise); r = max_sum(m);

Page 45: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

MapReduce

45

Page 46: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• instance.setDefaultSize(size_t)

• When the element-wise arity is 0, this controls the number of user function invocations(That is, the size of the ”virtual” temporary container in between the Map and Reduce steps)

• instance.setStartValue(value)

• Set Reduction start value. Defaults to 0-initialized.

MapReduce

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

46

Page 47: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

MapReduce Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

47

float add(float a, float b) { return a + b; }

float mult(float a, float b) { return a * b; }

float dot_product(Vector<float> &v1, Vector<float> &v2) { auto dotprod = MapReduce<2>(mult, add); return dotprod(v1, v2); }

Page 48: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

Scan

48

Page 49: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• instance.setScanMode(mode)

• Set the scan mode:ScanMode::Inclusive (default)ScanMode::Exclusive

• instance.setStartValue(value)

• Set start value of scans. Defaults to 0-initialized.

Scan

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

49

Page 50: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

Scan Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

50

float max_f(float a, float b) { return (a > b) ? a : b; }

Vector<float> partial_max(Vector<float> &v) { auto premax = Scan(max_f); Vector<float> result(v.size()); return premax(result, v); }

Page 51: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

MapOverlap

51

Page 52: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• 1D MapOverlap

• Regular Vector

• Matrix RowWise

• Matrix ColWise

• 2D MapOverlap

• Regular Matrix

• Separable MapOverlap (2D-with-1D)

• Matrix RolColWise

• Matrix ColRowWise

MapOverlap Modes

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

52

instance.setOverlapMode(Overlap::RowWise) // default

instance.setOverlapMode(Overlap::ColWise)

With 1D user function

instance.setOverlapMode(Overlap::RowColWise)

instance.setOverlapMode(Overlap::ColRowWise) +

+

Page 53: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

• instance.setOverlap(x, [y])

• Set overlap radius

• instance.setEdgeMode(mode)

• Edge::Pad

• instance.setPad(pad) – set value

• Edge::Duplicate (default)

• Edge::Cyclic

MapOverlap Edge Mode

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

53

Page 54: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

MapOverlap 1D Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

54

float conv( int overlap, size_t stride, const float *v, const Mat<float> stencil, float scale ) { float res = 0; for (int i = -overlap; i <= overlap; ++i) res += stencil[i + overlap] * v[i*stride]; return res / scale; }

Vector<float> convolution(Vector<float> &v) { auto convol = MapOverlap(conv); Vector<float> stencil {1, 2, 4, 2, 1}; Vector<float> result(v.size()); convol.setOverlap(2); return convol(result, v, stencil, 10); }

Page 55: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

MapOverlap 2D Example

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

55

float over_2d( int ox, int oy, size_t stride, const float *m, const skepu2::Mat<float> filter ) { float res = 0; for (int y = -oy; y <= oy; ++y) for (int x = -ox; x <= ox; ++x) res += m[y*(int)stride+x]

* filter.data[(y+oy)*ox + (x+ox)]; return res; }

Page 56: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

SkePU 3 preview

56

Page 57: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

SkePU 3 preview

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

57

• New skeleton

• MapPairs

• Higher-dimensionality containers

• Tensor3<T>, Tensor4<T>

• Multiple return values from user functions

• Can already return by struct: array-of-records format

• New feature allows result in multiple separate arrays

• Improved MapOverlap user-function syntax

• And more to come!

Hsize (dynamic)

Harit

y(s

tatic

)

Varity (static)

Vsize

(dyn

amic

)

Inputs

Inpu

ts

Output

i

j

PREVIEW

Page 58: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

MapPairs

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

58

int uf(int a, int b) { return a * b; }

// …

auto pairs = skepu::MapPairs<1, 1>(uf); skepu::Vector<int> v1(Vsize, 3), h1(Hsize, 7); skepu::Matrix<int> res(Vsize, Hsize); pairs(res, v1, h1);PREVIE

W

Page 59: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

MapOverlap

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

59

float over_1d(skepu::Region1D<float> r, int scale) { return (r(-2)*4 + r(-1)*2 + r(0) + r(1)*2 + r(2)*4) / scale; }

float over_2d(skepu::Region2D<float> r, const skepu::Mat<float> stencil) { float res = 0; for (int i = -r.oi; i <= r.oi; ++i) for (int j = -r.oj; j <= r.oj; ++j) res += r(i, j) * stencil(i + r.oi, j + r.oj); return res; } PREVIE

W

Page 60: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

SkePUDEMO

60

Page 61: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

SkePU inCurrent Research

61

Page 62: Introduction & Tutorial · Programming parallel systems is hard! • Resource utilization • Synchronization, Communication • Memory consistency • Different hardware architectures,

August Ernstsson – 2019-11-27MCC 2019 SkePU Tutorial

SkePU inTeaching

62