programming models for heterogeneous chips

Programming Models for Heterogeneous Chips

Rafael Asenjo Dept. of Computer Architecture

University of Malaga, Spain.

Agenda

•  Motivation

•  Hardware –  Heterogeneous chips –  Integrated GPUs –  Advantages

•  Software –  Programming models for heterogeneous systems –  Programming models for heterogeneous chips –  Our approach based on TBB

2

Motivation •  A new mantra: Power and Energy saving •  In all domains

3

Motivation

•  GPUs came to rescue: – Massive Data Parallel Code at a

low price in terms of power – Supercomputers and servers:

NVIDIA

•  GREEN500 Top 15:

•  TOP500: – 45 systems w. NVIDIA – 19 systems w. Xeon Phi

4

Motivation

•  There is (parallel) live beyond supercomputers:

5

Motivation

•  Plenty of GPUs elsewhere: –  Integrated GPUs on more than 90% of shipped processors

6

Motivation

•  Plenty of GPUs on desktops and laptops: –  Desktops (35 – 130W) and laptops (15– 57 W):

7

Intel Haswell AMD APU Kaveri

http://www.techspot.com/photos/article/770-amd-a8-7600-kaveri/ http://techguru3d.com/4th-gen-intel-haswell-processors-architecture-and-lineup/

Motivation

8

Motivation

•  Plenty of integrated GPUs in mobile devices.

9

Samsung Exynos 5 Octa (2 - 6 W)

http://www.samsung.com/us/showcase/galaxy-smartphones-and-tablets/

Samsung Galaxy S5 SM-G900H

Samsung Galaxy Note Pro 12

Motivation

•  Plenty of integrated GPUs in mobile devices.

10

Qualcomm Snapdragon 800 (2 - 6 W)

https://www.qualcomm.com/products/snapdragon/processors/800

Nexus 5

Nokia Lumia

Sony Xperia

Motivation

•  Plenty of room for improvements –  Want to make the most out of the CPU and the GPU –  Lack of programming models –  “Heterogeneous exec., but homogeneous programming” –  Huge potential impact

•  Servers and supercomputing market –  Google: porting the search engine for ARM and PowerPC –  AMD Seattle Server-on-a-Chip based on Cortex-A57 (v8) –  Mont Blanc project: supercomputer made of ARM

•  Once commodity processors took over •  Be prepared for when mobile processors do so

–  E4’s EK003 Servers: X-Gene ARM A57 (8 cores) + K20

11

Agenda

•  Motivation



12

Hardware

13

Intel Haswell AMD Kaveri

Samsung Exynos 5 Octa Qualcomm Snapdragon 800

Intel Haswell

14

•  Modular design –  2 or 4 cores –  GPU

•  GT-1: 10 EU •  GT-2: 20 EU •  GT-3: 40 EU

•  TSX: HW transac. mem. –  HLE (HW lock elis.)

•  XACQUIRE •  XRELEASE

–  RTM (Restrtd. TM) •  XBEGIN •  XEND

http://www.anandtech.com/show/6355/intels-haswell-architecture

Intel Haswell

•  Three frequency domains –  Cores –  GPU –  LLC and Ring

•  On old Ivy Bridge –  Only 2 domains –  Cores and LLC together –  Only GPU à CPU Fz é

•  OpenCL driver only for Win. •  PCM as power monitor

15

http://www.anandtech.com/show/7744/intel-reveals-new-haswell-details-at-isscc-2014

Intel Iris Graphics

16

https://software.intel.com/en-us/articles/opencl-fall-webinar-series

Intel Iris Graphics

17

•  GPU slice –  2 sub slices –  20 EU (GPU cores) –  Local L3 cache (256KB) –  16 barriers per sub slice –  2 x 64KB Local mem.

•  2 GPU slices = 40 EU •  Up to 7 in-flight EU-th •  8, 16 or 32 SIMD per EU-th •  In flight:7x40x32=8960 work it •  Each EU à 2 x 4-wide FPU

–  40x8x2 (fmadd) = 640 op sim. –  1.3GHz à 832GFlops

Intel Iris GPU

18

Matrix work-group ≈ block

EU-threads (SIMD16) ≈ warp ≈ wavefront

AMD Kaveri

•  Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores.

19 http://wccftech.com/

AMD Kaveri

•  Steamroller microarch. –  Each moduleà 2 “Cores”. –  2 threads, each with

•  4x superscalar INT •  2x SIMD4 FP

–  3.7GHz

•  Max GFLOPS: •  3.7 GHz x •  4 threads x •  4 wide x •  2 fmad = •  118 GFLOPS

20

AMD Graphics Core Next (GCN)

•  In Kaveri, GCG takes 47% of the die –  8 Compute Units (CU) –  Each CU: 4 SIMD16 –  Each SIMD16: 16 lines –  Total: 512 FPUs –  720 MHz

•  Max GFLOPS= •  0.72 GHz x •  512 FPUs x •  2 fmad = •  737 GFLOPS

•  CPU+GPU à 855 GFLOPS

21

OpenCL execution on GCN

Work-group à wavefronts (64 work-items) à pools

22

WG

CU0

SIMD0-0

SIMD0 SIMD1 SIMD2 SIMD3

SIMD1-0 SIMD2-0 SIMD3-0 SIMD0-1 SIMD1-1 SIMD2-1 SIMD3-1 SIMD0-2 SIMD1-2

4 pools: 4 wavefronts in flight per SIMD

4 ck to execute each wavefront

pool number

Wavefronts

HSA (Heterogeneous System Architecture)

–  CPU, GPU, DSPs..

•  Scheduled on three phasesà

•  Second phase: Kaveri –  hUMA –  Same pointers used on CPU

and GPU –  Cache coherency

23

•  HSA Foundation’s goal: Productivity on heterogeneous HW

Kaveri’s main HSA features

•  hUMA –  Shared and coherent view of up to 32GB

•  Heterogeneous queuing (hQ) –  CPU and GPU can create and dispatch work

24

HSA Motivation

•  Too many steps to get the job done

25

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/

Requirements

•  To enable lower overhead job dispatch requires four mechanisms: –  Shared Virtual Memory

•  Send pointers (not data) back and forth between HSA agents. –  System Coherency

•  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance

–  Signaling •  HSA Agents can directly create/access signal objects.

–  Signaling a signal object (this will wake up HSA agents waiting upon the object)

–  Query current object –  Wait on the current object (various conditions supported).

–  User mode queueing •  Enables user space applications to directly, without OS intervention,

enqueue jobs (“Dispatch Packets”) for HSA agents.

26

Non-HSA Shared Virtual Memory

•  Multiple Virtual memory address spaces

27

PHYSICAL MEMORY

CPU0 GPU

VIRTUAL MEMORY1

PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2

HSA Shared Virtual Memory

•  Common Virtual Memory for all HSA agents

28

CPU0 GPU

VIRTUAL MEMORY

PHYSICAL MEMORY

VA->PA VA->PA

After adding SVM

•  With SVM we get rid of copy/map memory back and forth

29

Application OS GPU


Copy/Map Memory

Queue Job


Finish Job


Get Buffer

Copy/Map Memory


After adding coherency

•  If the CPU allocates a global pointer, the GPU see that value

30

Application OS GPU


Copy/Map Memory

Queue Job


Finish Job


Get Buffer

Copy/Map Memory


After adding signaling

•  The CPU can wait on a signal object

31

Application OS GPU


Copy/Map Memory

Queue Job


Finish Job


Get Buffer

Copy/Map Memory


After adding user-level enqueuing

•  The user directly enqueues the job without OS intervention

32

Application OS GPU


Copy/Map Memory

Queue Job


Finish Job


Get Buffer

Copy/Map Memory


Success!!

•  That’s definitely way simpler and with less overhead

33

Application OS GPU

Queue Job

Start Job

Finish Job

OpenCL 2.0

•  OpenCL 2.0 will contain most of the features of HSA –  Intel’s version supports HSA for Core M (Broadwell). Windows. –  AMD’s version does not support SVM fine grain.

•  AMD 1.2 beta driver –  http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-1-2-beta-driver/

–  Only for Windows 8.1 –  Example: allocating “Coherent Host Memory” on Kaveri:

34

#include <CL/cl_ext.h> // Implements SVM #include "hsa_helper.h” // AMD helper functions … cl_svm_mem_flags_amd flags = CL_MEM_READ_WRITE |

CL_MEM_SVM_FINE_GRAIN_BUFFER_AMD | CL_MEM_SVM_ATOMICS_AMD; volatile std::atomic_int * data; data = (volatile std::atomic_int *) clSVMAlloc(context,

flags, MAX_DATA * sizeof(volatile std::atomic_int), 0);

Samsung Exynos 5

35

•  Odroid XU-E and XU3 bareboards •  Sports a Exynos 5 Octa

–  big.LITTLE architecture –  big: Cortex-A15 quad –  LITTLE: Cortex-A7 quad

•  Exynos 5 Octa 5410 –  Only 4 CPU-cores active at a time –  GPU: Power VR SGX544MP3 (Imagination Technologies)

•  3 GPU-cores at 533MHz à 51 GFLOPS

•  Exynos 5 Octa 5422 –  All 8 CPU-cores can be working simultaneously –  GPU: ARM Mali-T628 MP6

•  6 GPU-cores at 533MHz à 102 GFLOPS

180$

Power VR SGX544MP3

•  OpenCL 1.1 for Android

•  Some limitations: –  Compute units: 1 –  Max WG Size: 1 –  Local Mem: 1KB –  Pick MFLOPS:

•  12 Ops per ck •  3 SIMD-ALUs x 4-wide

•  Power monitor: –  4 x INA231 monitors

•  A15, A7, GPU, Mem. •  Instant Power •  Every 260ms

36

SGX architecture

…N

Texas Instrument INA231

37

ARM Mali-T628 MP6

•  Supporting: –  OpenGL® ES 3.0 –  OpenCL™ 1.1 –  DirectX® 11 –  Renderscript™

•  Cache L2 size –  32 – 256KB per core

•  6 Cores –  16 FP units –  2 SIMD4 each

•  Other –  Built-in MMU –  Standard ARM Bus

•  AMBA 4 ACE-Lite

38

Mali architecture

Qualcomm Snapdragon

39

Snapdragon 800

40

Snapdragon 800

•  CPU: Quad-core Krait 400 up to 2.26GHz (ARMv7 ISA) –  Similar to Cortex-A15. 11 stage integer pipeline with 3-way

decode and 4-way out-of-order speculative issue superscalar execution

–  Pipelined VFPv4[2] and 128-bit wide NEON (SIMD) –  4 KB + 4 KB direct mapped L0 cache –  16 KB + 16 KB 4-way set associative L1 cache –  2 MB (quad-core) L2 cache

•  GPU: Adreno 330, 450MHz –  OpenGL ES 3.0, DirectX, OpenCL 1.2, RenderScript –  32 Execution Units. Each with 2 x SIMD4 units

•  DSP: Hexagon 600MHz

41

Measuring power

•  Snapdragon Performance Visualizer

•  Trepn Profiler

•  Power Tutor –  Tuned for Nexus One –  Model with 5%

precision –  Open Source

42

More development boards

•  Jetson TK1 board –  Tegra K1 –  Kepler GPU with 192 CUDA cores –  4-Plus-1 quad-core ARM Cortex A15 –  Linux + CUDA –  180$

•  Arndale –  Exynos 5420 –  big.LITTLE (A15 + A7) –  GPU Mali T628 MP6 –  Linux + OpenCL –  200$

•  …

43

Advantages of integrated GPUs

•  Discrete and integrated GPUs: different goals –  NVIDIA Kepler: 2880 CUDA cores, 235W, 4.3 TFLOPS –  Intel Iris 5200: 40 EU x 8 SIMD, 15-28W, 0.83 TFLOPS –  PowerVR: 3 EU x 16 SIMD, < 1W, 0.051 TFLOPS

•  Higher bandwidth between CPU and GPU. –  Shared DRAM

•  Avoid PCI data transfer –  Shared LLC (Last Level Cache)

•  Data coherence in some cases…

•  CPU and GPU may have similar performance –  It’s more likely that they can collaborate

•  Cheaper!

44

Integrated GPUs are also improving

45

Agenda

•  Motivation



46

Programming models for heterogeneous

•  Targeted at single device –  CUDA (NVIDIA) –  OpenCL (Khronos Group Standard) –  OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0) –  C++AMP (Windows’ extension of C++. Recently HSA announced own ver.) –  RenderScript (Google’s Java API for Android) –  ParallDroid (Java + Directives from ULL, Spain) –  Many more (Sycl, Numba Python, IBM Java, Matlab, R, JavaScript, …)

•  Targeted at several devices (discrete GPUs) –  Qilin (C++ and Qilin API compiled to TBB+CUDA) –  OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler) –  XKaapi –  StarPU

•  Targeted at several devices (integrated GPUs) –  Qualcomm MARE –  Intel Concord

47

OpenCL on mobile devices

48

http://streamcomputing.eu/blog/2014-06-30/opencl-support-recent-android-smartphones/

OpenCL running on CPU

49

75% 41% 45%

15%

9%

34%

10%

0

5

10

15

20

25

30

35

40

45

50

Base Auto T-Auto SSE AVX-SSE AVX OpenCL

Exe

cutio

n Ti

me

(ms)

AVX code version is -  1.8x times faster than OpenCL -  1.8x more Halstead effort

CPU Ivy Bridge 3.3 GHz

“Easy, Fast and Energy Efficient Object Detection on Heterogeneous On-Chip Architectures”, E. Totoni, M. Dikmen, M. J. Garzaran, ACM Transactions on Architecture and Code Optimization (TACO),10(4), December 2013.

Complexities of AVX Intrinsics __m256 image_cache0 = _mm256_broadcast_ss(&fr_ptr[pixel_offsets[0]]);!curr_filter = _mm256_load_ps(&fb_array[fi]); !temp_sum = _mm256_add_ps(_mm256_mul_ps(image_cache7, "

" curr_filter), temp_sum); !temp_sum2 = _mm256_insertf128_ps(temp_sum, !

" _mm256_extractf128_ps(temp_sum, 1), 0); !cpm = _mm256_cmp_ps(temp_sum2, max_fil, _CMP_GT_OS); !r = _mm256_movemask_ps(cpm); !!if(r&(1<<1)) { !

"best_ind = filter_ind+2; !"int control = 1|(1<<2)|(1<<4)|(1<<6);; !"max_fil = _mm256_permute_ps(temp_sum2, control); !" " " " " " ""r=_mm256_movemask_ps( _mm256_cmp_ps(temp_sum2, !" "max_fil, _CMP_GT_OS)); !

} !

50

Load Multiply-add

Copy high to low

Compare

Store max Store index

OpenCL doesn’t have to be tough

51

Courtesy: Khronos Group

Libraries and languages using OpenCL

52

Courtesy: AMD

Libraries and languages using OpenCL

53

Courtesy: AMD

Libraries and languages using OpenCL (cont.)

54

Courtesy: AMD

Libraries and languages using OpenCL (cont.)

55

Courtesy: AMD

C++AMP

•  C++ Accelerated Massive Parallelism •  Pioneered by Microsoft

–  Requirements: Windows 7 + Visual Studio 2012 •  Followed by Intel's experimental implementation

–  C++ AMP on Clang/LLVM and OpenCL (AWOL since 2013) •  Now HSA Foundation taking the lead •  Keywords: restrict(device), array_view, parallel_for_each,…

–  Example: SUM = A + B; // (2D arrays)

56

OpenCL Ecosystem

57

Courtesy: Khronos Group

SYCL’s flavour: A[i]=B[i]*2

58

Work in progress developments: - AMD: trySYCL à https://github.com/amd/triSYCL - Codeplay: http://www.codeplay.com/ Advantages: 1. Easy to understand the concept of work-groups 2. Performance-portable between CPU and GPU 3. Barriers are automatically deduced

StarPU

•  A runtime system for heterogeneous architectures

•  Dynamically schedule tasks on all processing units –  See a pool of heterogeneous

cores

•  Avoid unnecessary data transfers between accelerators –  Software SVM for

heterogeneous machines

59

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

M

M

M GPU

M GPU

M GPU

M GPU

A = A+B

B

A B

Overview of StarPU

•  Maximizing PU occupancy, minimizing data transfers •  Ideas:

60

–  Accept tasks that may have multiple implementations

•  Together with potential inter-dependencies –  Leads to a dynamic acyclic graph of

tasks

–  Provide a high-level data management layer (Virtual Shared Memory VSM)

•  Application should only describe –  which data may be accessed by tasks –  how data may be divided

Applica0ons

Parallel Compilers

Parallel Libraries

StarPU

Drivers (CUDA, OpenCL)

CPU GPU …

Tasks scheduling

•  Dealing with heterogeneous hardware accelerators

61

•  Tasks = –  Data input & output –  Dependencies with other tasks –  Multiple implementations

•  E.g. CUDA + CPU •  Scheduling hints

•  StarPU provides an Open Scheduling platform –  Scheduling algorithm = plug-ins –  Predefined set of popular policies

Applica0ons

Parallel Compilers

Parallel Libraries

StarPU

Drivers (CUDA, OpenCL)

CPU GPU … (ARW, BR) f cpu

gpu spu

Tasks scheduling

•  Predefined set of popular policies

62

•  Eager Scheduler –  First come, first served policy –  Only one queue

•  Work Stealing Scheduler –  Load balancing policy –  One queue per worker

•  Priority Scheduler –  Describe the relative importance

of tasks –  One queue per priority

CPU CPU CPU GPU GPU

Eag

er S

ched

uler

task

CPU CPU CPU GPU GPU

task

WS

Sch

edul

er

CPU CPU CPU GPU GPU

task

Prio

. Sch

edul

er

prio0 prio1 prio2

Tasks scheduling

•  Predefined set of popular policies

63

•  Dequeue Model (DM) Scheduler –  Using codelet performance models

•  Kernel calibration on each available computing device – Raw history model of kernels’ past execution times – Refined models using regression on kernels’ execution times history

•  Dequeue Model Data Aware (DMDA) Scheduler –  Data transfer cost vs kernel offload

benefit –  Transfer cost modelling ( ) –  Bus calibration

CPU CPU CPU GPU GPU

task

DM

Sch

edul

er

cpu1 cpu2 cpu3 gpu1 gpu2

cpu3 cpu2 cpu1

gpu1 gpu2

time

CPU CPU CPU GPU GPU

task

DM

DA

Sch

edul

er

cpu1 cpu2 cpu3 gpu1 gpu2

cpu3 cpu2 cpu1

gpu1 gpu2

time

Some results (MxV, 4 CPUs, 1 GPU)

64

SPU config: Eager, 4 CPUs, 1GPU SPU config: DMDA, 4 CPUs, 1GPU

SPU config: Eager, 3 CPUs, 1GPU SPU config: DMDA, 3 CPUs, 1GPU

Terminology

•  A Codelet. . . –  . . . relates an abstract computation kernel to its implementation(s) –  . . . can be instantiated into one or more tasks –  . . . defines characteristics common to a set of tasks

•  A Task. . . –  . . . is an instantiation of a Codelet –  . . . atomically executes a kernel from its beginning to its end –  . . . receives some input –  . . . produces some output

•  A Data Handle. . . –  . . . designates a piece of data managed by StarPU –  . . . is typed (vector, matrix, etc.) –  . . . can be passed as input/output for a Task

65

Basic Example: Scaling a Vector

66

123456

struct starpu_codelet scal_cl = { . cpu_funcs = { scal_cpu_f, NULL}, . cuda_funcs = { scal_cuda_f, NULL } , . nbuffers = 1, . modes = { STARPU_RW } , };

Declaring a Codelet kernel functions

data pieces

data mode access

1 2 3 4 5 6 7 8 9

void scal_cpu_f(void ∗buffers [] , void ∗cl_arg) { struct starpu_vector_interface ∗vector_handle = buffers [ 0 ] ; float ∗vector = STARPU_VECTOR_GET_PTR(vector_handle); float ∗ptr_factor = cl_arg ; for (i = 0; i < NX; i++) vector [ i ] ∗= ∗ptr_factor ; } void scal_cuda_f(void ∗buffers [] , void ∗cl_arg) { … }

Kernel functions

kernel function

prototype

get pointer from data handle

get small-size inline data

do computation

retrieve data handle

Basic Example: Scaling a Vector

67

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

float factor = 3.14; float vector1 [NX] ; float vector2 [NX] ; starpu_data_handle_t vector_handle1 ; starpu_data_handle_t vector_handle2 ; /∗ ..... ∗/ starpu_vector_data_register(&vector_handle1, 0, (uintptr_t)vector1, NX, sizeof(vector1[0])); starpu_vector_data_register(&vector_handle2, 0, (uintptr_t)vector2, NX, sizeof(vector2[0])); /∗ non−blocking task submits ∗/ starpu_task_insert (&scal_cl , STARPU_RW , vector_handle1 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; starpu_task_insert (&scal_cl , STARPU_RW , vector_handle2 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; /∗ wait for all task submitted so far ∗/ starpu_task_wait_for_all () ; starpu_data_unregister ( vector_handle1 ) ; starpu_data_unregister ( vector_handle2 ) ; /∗ ..... ∗/

Main code

declare data handles

register pieces of data and get the

handles (now under StarPU control)

submit tasks (param: codelet, StarPU-managed data, small-size

inline data)

wait for all submitted tasks

Unregister pieces of data

(the handles are destroyed, the vectors are now back

under user control)

Qualcomm MARE

•  MARE is a programming model and a runtime system that provides simple yet powerful abstractions for parallel, power-efficient software –  Simple C++ API allows developers to express concurrency –  User-level library that runs on any Android device, and on Linux,

Mac OS X, and Windows platforms

•  The goal of MARE is to reduce the effort required to write apps that fully utilize heterogeneous SoCs

•  Concepts: –  Tasks are units of work that can be asynchronously executed –  Groups are sets of tasks that can be canceled or waited on

68

Basic Example: Hello World

69

More complex example: C=A+B on GPU

70

More complex example: C=A+B on GPU

71

MARE departures

•  Similarities with TBB –  Based on tasks and 2-level API (task level and templates)

•  pfor_each, ptransform, pscan, … •  Synchronous Dataflow classes ≈ TBB’s Flow Graphs

–  Concurrent data structures: queue, stack, … •  Departures

–  Expression of dependencies is first class –  Flexible group membership and work or group cancelation –  Optimized for some Qualcomm chips

•  Power classes: –  Static: mare::power::mode {efficient, saver, …} –  Dynamic: mare:power::set_goal(desired, tolerance)

•  Aware of the mobile architecture: agressive power mangmt. –  Cores can be shutdown or affected by DVFS

72

MARE results

•  Zoomm web browser implemented on top of MARE

73

C. Cascaval, et al.. ZOOMM: a parallel web browser engine for multicore mobile devices. In Symposium on Principles and practice of parallel programming, PPoPP ’13, pages 271–280, 2013.

MARE results

•  Bullet Physics parallelized with MARE

74

Courtesy: Calin Cascaval

Intel Concord

•  C++ heterogeneous programming framework for integrated CPU and GPU processors –  Shared Virtual Memory (SVM) in software –  Adapts existing data-parallel C++ constructs to heterogeneous

computing using TBB –  Available open source as Intel Heterogeneous Research

Compiler (iHRC) at https://github.com/IntelLabs/iHRC/

•  Papers: –  Rajkishore Barik, Tatiana Shpeisman, et al. Efficient mapping of

irregular C++ applications to integrated GPUs. CGO 2014. –  Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian

Lewis, Chunling Hu, and Keshav Pingali. Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014.

75

Intel Concord

•  Extend TBB API: –  parallel_for_hetero (int numiters, const Body &B, bool device); –  parallel_reduce_hetero (int numiters, const Body &B, bool

device);

76 Courtesy: Intel

Example: Parellel_for_hetero

•  Concord compiler generates OpenCL version –  Automatically takes care of the data thanks to SVM

77 Courtesy: Intel

Concord framework

78 Courtesy: Intel

SVM SW implementation on Haswell

79

SVM translation in OpenCL code

•  svm_const is a runtime constant and is computed once •  Every CPU pointer before dereference on the GPU is

converted into GPU address space using AS_GPU_PTR

80

Concord results

81

Speedup & Energy savings vs multicore CPU

82

Heterogeneous execution on both devices

•  Iteration space distributed among available devices •  Problem: find the best data partition •  Example: Barnes Hut and Facedetect relative exect. time

–  Varying the amount of work offloaded to the GPU –  For BH the optimum is 40% of the work carried out on the GPU –  For FD the optimum is 0% of the work carried out on the GPU

83

Partitioning based on on-line profiling

Naïve profiling Asymmetric profiling

84

assign chunk to CPU and

GPU

compute chunk on

CPU

compute chunk on

GPU

barrier

according to relative speeds partition the rest of the iteration

space

assign chunk to just to

GPU

compute on CPU

compute chunk on

GPU

when the GPU is done

according to relative speeds partition the rest of the iteration

space

Agenda

•  Motivation



85

Our heterogeneous parallel_for

86

Angeles Navarro, Antonio Vilches, Francisco Corbera and Rafael Asenjo Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures, The Journal of Supercomputing, May 2014

Comparison with StarPU

•  MxV benchmark –  Three schedulers tested: greedy, work-stealing, HEFT –  Static chunk size: 2000, 200 and 20 matrix rows

91

Choosing the GPU block size

•  Belviranli, M. E., Bhuyan, L. N., & Gupta, R. (2013). A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4), 57:1–57:20.

92

GPU block size for irregular codes

•  Adapt between time-steps and inside the time-step

93

0 40 160 640 2560 10240 40980 819600

10

20

30

40

50

60

70

80

90

100Barnes−Hut: Average throughput per chunk size

Chunk size

Ave

rage

Thr

ough

put

static, ts=0static, ts=30

adap., ts=0adap., ts=30

GPU block size for irregular codes

•  Throughput variation along the iteration space –  For two different time-steps –  Different GPU chunk-sizes

94

0 2 4 6 8 10x 104

20

40

60

80

100

120

140

160

Barnes−Hut: Throughput variation (time step=0)

Iteration Space

Thro

ughp

ut

chunk=320chunk=640chunk=1280chunk=2560

0 2 4 6 8 10x 104

20

40

60

80

100

120

140

160

Barnes−Hut: Throughput variation (time step=5)

Iteration Space

Thro

ughp

ut

chunk=320chunk=640chunk=1280chunk=2560

Adapting the GPU chunk-size

•  Assumption: –  Irregular behavior as a sequence of regimes of regular behavior

95

thro

ughp

ut

xG(t-1)*2G(t-1)/2

increasechunk size

decreasechunk size

higher λG

lower λG

samples G=a/thld

a·ln(x)+bλGλG

x

x

x

0 2 4 6 8 10x 104

0

20

40

Iteration spaceG

PU T

hrou

ghpu

t

GPU Thr. & Chunk size: LogFit

0 2 4 6 8 10x 104

350

700

1050

1400

1750

GPU

Chu

nk s

izeThroughput

Chunk Size

0 2 4 6 8 10x 104

0

50

100

Iteration space

GPU

Thr

ough

put

GPU Thr. & Chunk size: LogFit

0 2 4 6 8 10x 104

700

1400

2100

2800

3500

GPU

Chu

nk s

izeThroughput

Chunk Size

Preliminary results: Energy-Performance

•  Static: Oracle-Like static partition of the work based on profiling •  Concord: Intel approach: GPU size computed once •  HDSS: Belviranli et al. approach: GPU size computed once •  LogFit: our dynamic CPU and GPU chunk size partitioner 96

On Haswell

65 70 75 80 85 90 95 100 105 110

3

4

5

6

x 10−4

Performance (iterations per ms.)

Ener

gy p

er it

erat

ion

(Jou

les)

Barnes Hut: Energy − Performance

StaticConcordHDSSLogFit

0"

50"

100"

150"

0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%"Percentage)of)the)itera.on)space)offloaded)to)the)GPU)

Barnes)Hut:)Offline)search)for)sta.c)par..on)

Execu2on"2me"in"seconds"

Preliminary results: Energy-Performance

97 100 150 200 250 300 350 400 450 500

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2 x 10−4


Ener

gy p

er it

erat

ion

(Jou

les)

SpMV: Enery �−��3erformance


45 50 55 60 65 70 75 80 853

4

5

6

7

8

9 x 10−4


Ener

gy p

er it

em (J

oule

s)

P(3&: Energy ��Performance


4000 5000 6000 7000 8000 9000 100002

3

4

5

6

7

8

9

10 x 10−6


Ener

gy p

er it

em (J

oule

s)

CFD: Energy − Performance


30 35 40 45 50 55 60 655

6

7

8

9

10

11

12

13x 10−4


Ener

gy p

er it

erat

ion

(Jou

les)

Nbody: Energy − Performance


•  w.r.t. Static: up to 52% (18% on average)

•  w.r.t. Concord and HDSS: up to 94% and 69% of (28% and 27% on average)

Our heterogeneous pipeline •  ViVid, an object detection application •  Contains three main kernels that form a pipeline

•  Would like to answer the following questions:

–  Granularity: coarse or fine grained parallelism? –  Mapping of stages: where do we run them (CPU/GPU)? –  Number of cores: how many of them when running on CPU? –  Optimum: what metric do we optimize (time, energy, both)?

98

Stag

e 1

Stag

e 2

Stag

e 3

Inputframe Filter Histogram Classifier

Output

response mtx.

index mtx.

histogramsdetect. response

Granularity

•  Coarse grain: –  CG

•  Medium grain: –  MG

•  Fine grain:

–  Fine grain also in the CPU via AVX intrinsics

99

Input Stage Output Stage

CPU

Citem

CPU

Citem

CPU

Stage 1

Citem

CPU

Stage 2

Citem

CPU

Stage 3

Citem

C =core

Input Stage

CPU

Citem

Output Stage

CPU

Citem

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C


GPUC C CC C C

itemCPU

Citem

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

Stage 1 Stage 2 Stage 3

More mappings

100 Input Stage Output Stage

GPUC C CC C C

itemCPU

Citem

CPU

Citem

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

Stage 1

CPU

GPU

CPU

GPU

CPU

GPU

CPU CPU CPU

GPU

CPU CPU CPU CPU

CPU CPU

GPU

CPU

GPU

CPU CPU CPU

GPU

CPU CPU

GPU

CPU CPU

CPU CPU CPU CPU CPU

CPU CPU

GPU

CPU

GPU

CPU CPU

GPU

CPU CPU CPU CPU CPU

CPU CPU CPU CPU CPU

GPU

Accounting for all alternatives

•  In general: nC CPU cores, 1 GPU and p pipeline stages

# alternatives = 2p x (nC +2)

•  For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives

101

Input Stage

GPUC C CC C C

item

CPU

Citem

Output Stage

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

CPU

Stage 1

itemCPU

Stage 2

itemCPU

Stage p

itemC C

nC coresC C

nC coresC C

nC cores

Framework and Model

•  Key idea: 1.  Run only on GPU 2.  Run only on CPU 3.  Analytically extrapolate for heterogeneous execution 4.  Find out the best configuration à RUN

102

Input Stage

GPUC C CC C C

item

CPU

Citem

Output Stage

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

DP-MG

collect λ and E (homogeneous values)

Environmental Setup: Benchmarks

•  Four Benchmarks – ViVid (Low and High Definition inputs)

– SRAD

– Tracking

– Scene Recognition

103

InptFilter Histogram Classifier

St 1 St 2 St 3 Out

InptExtrac. Prep. Reduct.

St 1 St 2 St 3 St 4Comp.1 Comp. 2

St 5 St 6 OutStatist.

InptTrack.

St 1 Out

InptFeature. SVM

St 1 St 2 Out

ViVid: throughput/energy on Ivy Bridge

106

LD (600x416), Ivy Bridge HD (1920x1080), Ivy Bridge

0.0E+00

1.0E-‐02

2.0E-‐02

3.0E-‐02

4.0E-‐02

5.0E-‐02

6.0E-‐02

7.0E-‐02

8.0E-‐02

9.0E-‐02

1.0E-‐01

Num-‐Threads (CG)

0.0E+00

1.0E-‐04

2.0E-‐04

3.0E-‐04

4.0E-‐04

5.0E-‐04

6.0E-‐04

7.0E-‐04

8.0E-‐04

Num-‐Threads (CG)


GPUC C CC C C

item

C = Core

CPU

Citem

CPU

Stage 1

Citem

CPU

Citem

CPU

Stage 2

Citem

CPU

Stage 3

Citem

CP-CG GPU-CPU PathCPU Path


GPUC C CC C C

item

CPU

Citem

CPU

Citem

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

CP-MGC = Core

ViVid: throughput/energy on Haswell

107

LD, Haswell HD, Haswell

0.0E+00

2.0E-‐02

4.0E-‐02

6.0E-‐02

8.0E-‐02

1.0E-‐01

1.2E-‐01

1.4E-‐01

1.6E-‐01

1.8E-‐01

Num-‐Threads (CG)

0.0E+00

2.0E-‐04

4.0E-‐04

6.0E-‐04

8.0E-‐04

1.0E-‐03

Num-‐Threads (CG)


GPUC C CC C C

item

CPU

Citem

CPU

Citem

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

CP-MGC = Core

Does Higher Throughput imply Lower Energy?

108

1.0E-‐03 2.0E-‐03 3.0E-‐03 4.0E-‐03 5.0E-‐03 6.0E-‐03 7.0E-‐03 8.0E-‐03 9.0E-‐03

Num-‐Threads (CG)

6.0

8.0

10.0

12.0

14.0

16.0

Num-‐Threads (CG)

0.0E+00

2.0E-‐04

4.0E-‐04

6.0E-‐04

8.0E-‐04

1.0E-‐03

Num-‐Threads (CG)

Haswell HD

SRAD on Ivy Bridge

109

0.0E+00

1.0E-‐01

2.0E-‐01

3.0E-‐01

4.0E-‐01

5.0E-‐01

6.0E-‐01

Num-‐Threads (CG)

0.0E+00

2.0E-‐02

4.0E-‐02

6.0E-‐02

8.0E-‐02

1.0E-‐01

1.2E-‐01

1.4E-‐01

Num-‐Threads (CG)

1.0E-‐01 2.0E-‐01 3.0E-‐01 4.0E-‐01 5.0E-‐01 6.0E-‐01 7.0E-‐01 8.0E-‐01 9.0E-‐01 1.0E+00 1.1E+00

Num-‐Threads (CG)

Input Stage

GPUC C C

C C C

item

CPU

Citem

CPU

Stage 1

CitemC

C C

CPU

Stage 2

CitemC

C COutput Stage

CPU

Citem

GPUC C C

C C C

item

CPU

Stage 6

CitemC

C C

CP-MGGPUC C C

C C C

item

CPU

Stage 3

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 4

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 5

CitemC

C C

Input Stage

GPUC C C

C C C

item

CPU

Citem

CPU

Stage 1

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 2

CitemC

C COutput Stage

CPU

Citem

GPUC C C

C C C

item

CPU

Stage 6

CitemC

C C

DP-MGGPUC C C

C C C

item

CPU

Stage 3

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 4

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 5

CitemC

C C

1.0E-‐01

3.0E-‐01

5.0E-‐01

7.0E-‐01

9.0E-‐01

1.1E+00

Num-‐Threads (CG)

SRAD on Haswell

110

0.0E+00

1.0E-‐01

2.0E-‐01

3.0E-‐01

4.0E-‐01

5.0E-‐01

6.0E-‐01

7.0E-‐01

8.0E-‐01

Num-‐Threads (CG)

0.0E+00 2.0E-‐02 4.0E-‐02 6.0E-‐02 8.0E-‐02 1.0E-‐01 1.2E-‐01 1.4E-‐01 1.6E-‐01 1.8E-‐01

Num-‐Threads (CG)

Input Stage

GPUC C C

C C C

item

CPU

Citem

CPU

Stage 1

Citem

GPUC C C

C C C

item

CPU

Stage 2

Citem

Output Stage

CPU

Citem

GPUC C C

C C C

item

CPU

Stage 6

Citem

DP-CGGPUC C C

C C C

item

CPU

Stage 3

Citem

GPUC C C

C C C

item

CPU

Stage 4

Citem

GPUC C C

C C C

item

CPU

Stage 5

Citem

On-going work

•  Test the model in other heterogeneous chips •  ViVid LD running on Odroid XU-E

–  Ivy Bridge: 65 fps, 0.7 J/frame and 92 fps/J with CP-CG(5) –  Exynos 5: 0.7 fps, 11 J/frame and 0.06 fps/J with CP-CG(4)

112

0

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0.0007

0.0008

1 2 3 4

Throughput

CP-‐CG Est.

CP-‐CG Mea.

DP-‐CG Est.

DP-‐CG Mea.

CPU-‐CG 0

0.00001

0.00002

0.00003

0.00004

0.00005

0.00006

0.00007

1 2 3 4

Throughput/Energy

CP-‐CG Est.

CP-‐CG Mea.

DP-‐CG Est.

DP-‐CG Mea.

CPU-‐CG

Future Work •  Consider energy for parallel loops scheduling/partitioning

•  Consider MARE as an alternative to our TBB-based implem.

•  Apply DVFS to reduce energy consumption –  For video applications: no need to go faster than 33 fps

•  Also explore other parallel patterns: –  Reduce –  Parallel_do –  …

113

Conclusions •  Plenty of heterogeneous on-chip architectures out there

•  It is important to use both devices –  Need to find the best mapping/distribution/scheduling out of the

many possible alternatives.

•  Programming models and runtimes aimed at this goal are in their infancy: it may have huge impact in mobile market

•  Challenges: –  Hide hardware complexity –  Consider energy in the partition/scheduling decisions –  Minimize overhead of adaptation policies

114

Collaborators

•  Mª Ángeles González Navarro (UMA, Spain) •  Fracisco Corbera (UMA, Spain) •  Antonio Vilches (UMA, Spain) •  Andrés Rodríguez (UMA, Spain) •  Alejandro Villegas (UMA, Spain) •  Rubén Gran (U. Zaragoza, Spain) •  Maria Jesús Garzarán (UIUC, USA) •  Mert Dikmen (UIUC, USA) •  Kurt Fellows (UIUC, USA) •  Ehsan Totoni (UIUC, USA)

115

Questions

[email protected] http://www.ac.uma.es/~asenjo

116

programming models for heterogeneous chips

Education