programming models for heterogeneous chips

109
Programming Models for Heterogeneous Chips Rafael Asenjo Dept. of Computer Architecture University of Malaga, Spain.

Upload: universidad-complutense-de-madrid

Post on 07-Jul-2015

515 views

Category:

Education


4 download

DESCRIPTION

Charla del Profesor Rafael Asenjo Plaza impartida el 24 de Octubre en la Facultad de Informática

TRANSCRIPT

Page 1: Programming Models for  Heterogeneous Chips

Programming Models for Heterogeneous Chips

Rafael Asenjo Dept. of Computer Architecture

University of Malaga, Spain.

Page 2: Programming Models for  Heterogeneous Chips

Agenda

•  Motivation

•  Hardware –  Heterogeneous chips –  Integrated GPUs –  Advantages

•  Software –  Programming models for heterogeneous systems –  Programming models for heterogeneous chips –  Our approach based on TBB

2

Page 3: Programming Models for  Heterogeneous Chips

Motivation •  A new mantra: Power and Energy saving •  In all domains

3

Page 4: Programming Models for  Heterogeneous Chips

Motivation

•  GPUs came to rescue: – Massive Data Parallel Code at a

low price in terms of power – Supercomputers and servers:

NVIDIA

•  GREEN500 Top 15:

•  TOP500: – 45 systems w. NVIDIA – 19 systems w. Xeon Phi

4

Page 5: Programming Models for  Heterogeneous Chips

Motivation

•  There is (parallel) live beyond supercomputers:

5

Page 6: Programming Models for  Heterogeneous Chips

Motivation

•  Plenty of GPUs elsewhere: –  Integrated GPUs on more than 90% of shipped processors

6

Page 7: Programming Models for  Heterogeneous Chips

Motivation

•  Plenty of GPUs on desktops and laptops: –  Desktops (35 – 130W) and laptops (15– 57 W):

7

Intel Haswell AMD APU Kaveri

http://www.techspot.com/photos/article/770-amd-a8-7600-kaveri/ http://techguru3d.com/4th-gen-intel-haswell-processors-architecture-and-lineup/

Page 8: Programming Models for  Heterogeneous Chips

Motivation

8

Page 9: Programming Models for  Heterogeneous Chips

Motivation

•  Plenty of integrated GPUs in mobile devices.

9

Samsung Exynos 5 Octa (2 - 6 W)

http://www.samsung.com/us/showcase/galaxy-smartphones-and-tablets/

Samsung Galaxy S5 SM-G900H

Samsung Galaxy Note Pro 12

Page 10: Programming Models for  Heterogeneous Chips

Motivation

•  Plenty of integrated GPUs in mobile devices.

10

Qualcomm Snapdragon 800 (2 - 6 W)

https://www.qualcomm.com/products/snapdragon/processors/800

Nexus 5

Nokia Lumia

Sony Xperia

Page 11: Programming Models for  Heterogeneous Chips

Motivation

•  Plenty of room for improvements –  Want to make the most out of the CPU and the GPU –  Lack of programming models –  “Heterogeneous exec., but homogeneous programming” –  Huge potential impact

•  Servers and supercomputing market –  Google: porting the search engine for ARM and PowerPC –  AMD Seattle Server-on-a-Chip based on Cortex-A57 (v8) –  Mont Blanc project: supercomputer made of ARM

•  Once commodity processors took over •  Be prepared for when mobile processors do so

–  E4’s EK003 Servers: X-Gene ARM A57 (8 cores) + K20

11

Page 12: Programming Models for  Heterogeneous Chips

Agenda

•  Motivation

•  Hardware –  Heterogeneous chips –  Integrated GPUs –  Advantages

•  Software –  Programming models for heterogeneous systems –  Programming models for heterogeneous chips –  Our approach based on TBB

12

Page 13: Programming Models for  Heterogeneous Chips

Hardware

13

Intel Haswell AMD Kaveri

Samsung Exynos 5 Octa Qualcomm Snapdragon 800

Page 14: Programming Models for  Heterogeneous Chips

Intel Haswell

14

•  Modular design –  2 or 4 cores –  GPU

•  GT-1: 10 EU •  GT-2: 20 EU •  GT-3: 40 EU

•  TSX: HW transac. mem. –  HLE (HW lock elis.)

•  XACQUIRE •  XRELEASE

–  RTM (Restrtd. TM) •  XBEGIN •  XEND

http://www.anandtech.com/show/6355/intels-haswell-architecture

Page 15: Programming Models for  Heterogeneous Chips

Intel Haswell

•  Three frequency domains –  Cores –  GPU –  LLC and Ring

•  On old Ivy Bridge –  Only 2 domains –  Cores and LLC together –  Only GPU à CPU Fz é

•  OpenCL driver only for Win. •  PCM as power monitor

15

http://www.anandtech.com/show/7744/intel-reveals-new-haswell-details-at-isscc-2014

Page 16: Programming Models for  Heterogeneous Chips

Intel Iris Graphics

16

https://software.intel.com/en-us/articles/opencl-fall-webinar-series

Page 17: Programming Models for  Heterogeneous Chips

Intel Iris Graphics

17

•  GPU slice –  2 sub slices –  20 EU (GPU cores) –  Local L3 cache (256KB) –  16 barriers per sub slice –  2 x 64KB Local mem.

•  2 GPU slices = 40 EU •  Up to 7 in-flight EU-th •  8, 16 or 32 SIMD per EU-th •  In flight:7x40x32=8960 work it •  Each EU à 2 x 4-wide FPU

–  40x8x2 (fmadd) = 640 op sim. –  1.3GHz à 832GFlops

Page 18: Programming Models for  Heterogeneous Chips

Intel Iris GPU

18

Matrix work-group ≈ block

EU-threads (SIMD16) ≈ warp ≈ wavefront

Page 19: Programming Models for  Heterogeneous Chips

AMD Kaveri

•  Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores.

19 http://wccftech.com/

Page 20: Programming Models for  Heterogeneous Chips

AMD Kaveri

•  Steamroller microarch. –  Each moduleà 2 “Cores”. –  2 threads, each with

•  4x superscalar INT •  2x SIMD4 FP

–  3.7GHz

•  Max GFLOPS: •  3.7 GHz x •  4 threads x •  4 wide x •  2 fmad = •  118 GFLOPS

20

Page 21: Programming Models for  Heterogeneous Chips

AMD Graphics Core Next (GCN)

•  In Kaveri, GCG takes 47% of the die –  8 Compute Units (CU) –  Each CU: 4 SIMD16 –  Each SIMD16: 16 lines –  Total: 512 FPUs –  720 MHz

•  Max GFLOPS= •  0.72 GHz x •  512 FPUs x •  2 fmad = •  737 GFLOPS

•  CPU+GPU à 855 GFLOPS

21

Page 22: Programming Models for  Heterogeneous Chips

OpenCL execution on GCN

Work-group à wavefronts (64 work-items) à pools

22

WG

CU0

SIMD0-0

SIMD0 SIMD1 SIMD2 SIMD3

SIMD1-0 SIMD2-0 SIMD3-0 SIMD0-1 SIMD1-1 SIMD2-1 SIMD3-1 SIMD0-2 SIMD1-2

4 pools: 4 wavefronts in flight per SIMD

4 ck to execute each wavefront

pool number

Wavefronts

Page 23: Programming Models for  Heterogeneous Chips

HSA (Heterogeneous System Architecture)

–  CPU, GPU, DSPs..

•  Scheduled on three phasesà

•  Second phase: Kaveri –  hUMA –  Same pointers used on CPU

and GPU –  Cache coherency

23

•  HSA Foundation’s goal: Productivity on heterogeneous HW

Page 24: Programming Models for  Heterogeneous Chips

Kaveri’s main HSA features

•  hUMA –  Shared and coherent view of up to 32GB

•  Heterogeneous queuing (hQ) –  CPU and GPU can create and dispatch work

24

Page 25: Programming Models for  Heterogeneous Chips

HSA Motivation

•  Too many steps to get the job done

25

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/

Page 26: Programming Models for  Heterogeneous Chips

Requirements

•  To enable lower overhead job dispatch requires four mechanisms: –  Shared Virtual Memory

•  Send pointers (not data) back and forth between HSA agents. –  System Coherency

•  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance

–  Signaling •  HSA Agents can directly create/access signal objects.

–  Signaling a signal object (this will wake up HSA agents waiting upon the object)

–  Query current object –  Wait on the current object (various conditions supported).

–  User mode queueing •  Enables user space applications to directly, without OS intervention,

enqueue jobs (“Dispatch Packets”) for HSA agents.

26

Page 27: Programming Models for  Heterogeneous Chips

Non-HSA Shared Virtual Memory

•  Multiple Virtual memory address spaces

27

PHYSICAL MEMORY

CPU0 GPU

VIRTUAL MEMORY1

PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2

Page 28: Programming Models for  Heterogeneous Chips

HSA Shared Virtual Memory

•  Common Virtual Memory for all HSA agents

28

CPU0 GPU

VIRTUAL MEMORY

PHYSICAL MEMORY

VA->PA VA->PA

Page 29: Programming Models for  Heterogeneous Chips

After adding SVM

•  With SVM we get rid of copy/map memory back and forth

29

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/

Page 30: Programming Models for  Heterogeneous Chips

After adding coherency

•  If the CPU allocates a global pointer, the GPU see that value

30

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/

Page 31: Programming Models for  Heterogeneous Chips

After adding signaling

•  The CPU can wait on a signal object

31

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/

Page 32: Programming Models for  Heterogeneous Chips

After adding user-level enqueuing

•  The user directly enqueues the job without OS intervention

32

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/

Page 33: Programming Models for  Heterogeneous Chips

Success!!

•  That’s definitely way simpler and with less overhead

33

Application OS GPU

Queue Job

Start Job

Finish Job

Page 34: Programming Models for  Heterogeneous Chips

OpenCL 2.0

•  OpenCL 2.0 will contain most of the features of HSA –  Intel’s version supports HSA for Core M (Broadwell). Windows. –  AMD’s version does not support SVM fine grain.

•  AMD 1.2 beta driver –  http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-1-2-beta-driver/

–  Only for Windows 8.1 –  Example: allocating “Coherent Host Memory” on Kaveri:

34

#include  <CL/cl_ext.h>    //  Implements  SVM  #include  "hsa_helper.h”  //  AMD  helper  functions  …  cl_svm_mem_flags_amd  flags  =  CL_MEM_READ_WRITE  |    

         CL_MEM_SVM_FINE_GRAIN_BUFFER_AMD  |                                                              CL_MEM_SVM_ATOMICS_AMD;    volatile  std::atomic_int  *    data;  data  =  (volatile  std::atomic_int  *)  clSVMAlloc(context,  

         flags,                                    MAX_DATA  *  sizeof(volatile  std::atomic_int),  0);  

Page 35: Programming Models for  Heterogeneous Chips

Samsung Exynos 5

35

•  Odroid XU-E and XU3 bareboards •  Sports a Exynos 5 Octa

–  big.LITTLE architecture –  big: Cortex-A15 quad –  LITTLE: Cortex-A7 quad

•  Exynos 5 Octa 5410 –  Only 4 CPU-cores active at a time –  GPU: Power VR SGX544MP3 (Imagination Technologies)

•  3 GPU-cores at 533MHz à 51 GFLOPS

•  Exynos 5 Octa 5422 –  All 8 CPU-cores can be working simultaneously –  GPU: ARM Mali-T628 MP6

•  6 GPU-cores at 533MHz à 102 GFLOPS

180$

Page 36: Programming Models for  Heterogeneous Chips

Power VR SGX544MP3

•  OpenCL 1.1 for Android

•  Some limitations: –  Compute units: 1 –  Max WG Size: 1 –  Local Mem: 1KB –  Pick MFLOPS:

•  12 Ops per ck •  3 SIMD-ALUs x 4-wide

•  Power monitor: –  4 x INA231 monitors

•  A15, A7, GPU, Mem. •  Instant Power •  Every 260ms

36

SGX architecture

…N

Page 37: Programming Models for  Heterogeneous Chips

Texas Instrument INA231

37

Page 38: Programming Models for  Heterogeneous Chips

ARM Mali-T628 MP6

•  Supporting: –  OpenGL® ES 3.0 –  OpenCL™ 1.1 –  DirectX® 11 –  Renderscript™

•  Cache L2 size –  32 – 256KB per core

•  6 Cores –  16 FP units –  2 SIMD4 each

•  Other –  Built-in MMU –  Standard ARM Bus

•  AMBA 4 ACE-Lite

38

Mali architecture

Page 39: Programming Models for  Heterogeneous Chips

Qualcomm Snapdragon

39

Page 40: Programming Models for  Heterogeneous Chips

Snapdragon 800

40

Page 41: Programming Models for  Heterogeneous Chips

Snapdragon 800

•  CPU: Quad-core Krait 400 up to 2.26GHz (ARMv7 ISA) –  Similar to Cortex-A15. 11 stage integer pipeline with 3-way

decode and 4-way out-of-order speculative issue superscalar execution

–  Pipelined VFPv4[2] and 128-bit wide NEON (SIMD) –  4 KB + 4 KB direct mapped L0 cache –  16 KB + 16 KB 4-way set associative L1 cache –  2 MB (quad-core) L2 cache

•  GPU: Adreno 330, 450MHz –  OpenGL ES 3.0, DirectX, OpenCL 1.2, RenderScript –  32 Execution Units. Each with 2 x SIMD4 units

•  DSP: Hexagon 600MHz

41

Page 42: Programming Models for  Heterogeneous Chips

Measuring power

•  Snapdragon Performance Visualizer

•  Trepn Profiler

•  Power Tutor –  Tuned for Nexus One –  Model with 5%

precision –  Open Source

42

Page 43: Programming Models for  Heterogeneous Chips

More development boards

•  Jetson TK1 board –  Tegra K1 –  Kepler GPU with 192 CUDA cores –  4-Plus-1 quad-core ARM Cortex A15 –  Linux + CUDA –  180$

•  Arndale –  Exynos 5420 –  big.LITTLE (A15 + A7) –  GPU Mali T628 MP6 –  Linux + OpenCL –  200$

•  …

43

Page 44: Programming Models for  Heterogeneous Chips

Advantages of integrated GPUs

•  Discrete and integrated GPUs: different goals –  NVIDIA Kepler: 2880 CUDA cores, 235W, 4.3 TFLOPS –  Intel Iris 5200: 40 EU x 8 SIMD, 15-28W, 0.83 TFLOPS –  PowerVR: 3 EU x 16 SIMD, < 1W, 0.051 TFLOPS

•  Higher bandwidth between CPU and GPU. –  Shared DRAM

•  Avoid PCI data transfer –  Shared LLC (Last Level Cache)

•  Data coherence in some cases…

•  CPU and GPU may have similar performance –  It’s more likely that they can collaborate

•  Cheaper!

44

Page 45: Programming Models for  Heterogeneous Chips

Integrated GPUs are also improving

45

Page 46: Programming Models for  Heterogeneous Chips

Agenda

•  Motivation

•  Hardware –  Heterogeneous chips –  Integrated GPUs –  Advantages

•  Software –  Programming models for heterogeneous systems –  Programming models for heterogeneous chips –  Our approach based on TBB

46

Page 47: Programming Models for  Heterogeneous Chips

Programming models for heterogeneous

•  Targeted at single device –  CUDA (NVIDIA) –  OpenCL (Khronos Group Standard) –  OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0) –  C++AMP (Windows’ extension of C++. Recently HSA announced own ver.) –  RenderScript (Google’s Java API for Android) –  ParallDroid (Java + Directives from ULL, Spain) –  Many more (Sycl, Numba Python, IBM Java, Matlab, R, JavaScript, …)

•  Targeted at several devices (discrete GPUs) –  Qilin (C++ and Qilin API compiled to TBB+CUDA) –  OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler) –  XKaapi –  StarPU

•  Targeted at several devices (integrated GPUs) –  Qualcomm MARE –  Intel Concord

47

Page 48: Programming Models for  Heterogeneous Chips

OpenCL on mobile devices

48

http://streamcomputing.eu/blog/2014-06-30/opencl-support-recent-android-smartphones/

Page 49: Programming Models for  Heterogeneous Chips

OpenCL running on CPU

49

75% 41% 45%

15%

9%

34%

10%

0

5

10

15

20

25

30

35

40

45

50

Base Auto T-Auto SSE AVX-SSE AVX OpenCL

Exe

cutio

n Ti

me

(ms)

AVX code version is -  1.8x times faster than OpenCL -  1.8x more Halstead effort

CPU Ivy Bridge 3.3 GHz

“Easy, Fast and Energy Efficient Object Detection on Heterogeneous On-Chip Architectures”, E. Totoni, M. Dikmen, M. J. Garzaran, ACM Transactions on Architecture and Code Optimization (TACO),10(4), December 2013.

Page 50: Programming Models for  Heterogeneous Chips

Complexities of AVX Intrinsics __m256 image_cache0 = _mm256_broadcast_ss(&fr_ptr[pixel_offsets[0]]);!curr_filter = _mm256_load_ps(&fb_array[fi]); !temp_sum = _mm256_add_ps(_mm256_mul_ps(image_cache7, "

" curr_filter), temp_sum); !temp_sum2 = _mm256_insertf128_ps(temp_sum, !

" _mm256_extractf128_ps(temp_sum, 1), 0); !cpm = _mm256_cmp_ps(temp_sum2, max_fil, _CMP_GT_OS); !r = _mm256_movemask_ps(cpm); !!if(r&(1<<1)) { !

"best_ind = filter_ind+2; !"int control = 1|(1<<2)|(1<<4)|(1<<6);; !"max_fil = _mm256_permute_ps(temp_sum2, control); !" " " " " " ""r=_mm256_movemask_ps( _mm256_cmp_ps(temp_sum2, !" "max_fil, _CMP_GT_OS)); !

} !

50

Load Multiply-add

Copy high to low

Compare

Store max Store index

Page 51: Programming Models for  Heterogeneous Chips

OpenCL doesn’t have to be tough

51

Courtesy: Khronos Group

Page 52: Programming Models for  Heterogeneous Chips

Libraries and languages using OpenCL

52

Courtesy: AMD

Page 53: Programming Models for  Heterogeneous Chips

Libraries and languages using OpenCL

53

Courtesy: AMD

Page 54: Programming Models for  Heterogeneous Chips

Libraries and languages using OpenCL (cont.)

54

Courtesy: AMD

Page 55: Programming Models for  Heterogeneous Chips

Libraries and languages using OpenCL (cont.)

55

Courtesy: AMD

Page 56: Programming Models for  Heterogeneous Chips

C++AMP

•  C++ Accelerated Massive Parallelism •  Pioneered by Microsoft

–  Requirements: Windows 7 + Visual Studio 2012 •  Followed by Intel's experimental implementation

–  C++ AMP on Clang/LLVM and OpenCL (AWOL since 2013) •  Now HSA Foundation taking the lead •  Keywords: restrict(device), array_view, parallel_for_each,…

–  Example: SUM = A + B; // (2D arrays)

56

Page 57: Programming Models for  Heterogeneous Chips

OpenCL Ecosystem

57

Courtesy: Khronos Group

Page 58: Programming Models for  Heterogeneous Chips

SYCL’s flavour: A[i]=B[i]*2

58

Work in progress developments: - AMD: trySYCL à https://github.com/amd/triSYCL - Codeplay: http://www.codeplay.com/ Advantages: 1. Easy to understand the concept of work-groups 2. Performance-portable between CPU and GPU 3. Barriers are automatically deduced

Page 59: Programming Models for  Heterogeneous Chips

StarPU

•  A runtime system for heterogeneous architectures

•  Dynamically schedule tasks on all processing units –  See a pool of heterogeneous

cores

•  Avoid unnecessary data transfers between accelerators –  Software SVM for

heterogeneous machines

59

CPU  

CPU  

CPU  

CPU  

CPU  

CPU  

CPU  

CPU  

M  

M  

M  GPU  

M  GPU  

M  GPU  

M  GPU  

A  =  A+B  

B  

A  B  

Page 60: Programming Models for  Heterogeneous Chips

Overview of StarPU

•  Maximizing PU occupancy, minimizing data transfers •  Ideas:

60

–  Accept tasks that may have multiple implementations

•  Together with potential inter-dependencies –  Leads to a dynamic acyclic graph of

tasks

–  Provide a high-level data management layer (Virtual Shared Memory VSM)

•  Application should only describe –  which data may be accessed by tasks –  how data may be divided

Applica0ons  

Parallel  Compilers  

Parallel  Libraries  

StarPU  

Drivers  (CUDA,  OpenCL)  

CPU   GPU   …  

Page 61: Programming Models for  Heterogeneous Chips

Tasks scheduling

•  Dealing with heterogeneous hardware accelerators

61

•  Tasks = –  Data input & output –  Dependencies with other tasks –  Multiple implementations

•  E.g. CUDA + CPU •  Scheduling hints

•  StarPU provides an Open Scheduling platform –  Scheduling algorithm = plug-ins –  Predefined set of popular policies

Applica0ons  

Parallel  Compilers  

Parallel  Libraries  

StarPU  

Drivers  (CUDA,  OpenCL)  

CPU   GPU   …  (ARW,  BR)  f  cpu

gpu spu

Page 62: Programming Models for  Heterogeneous Chips

Tasks scheduling

•  Predefined set of popular policies

62

•  Eager Scheduler –  First come, first served policy –  Only one queue

•  Work Stealing Scheduler –  Load balancing policy –  One queue per worker

•  Priority Scheduler –  Describe the relative importance

of tasks –  One queue per priority

CPU   CPU   CPU   GPU   GPU  

Eag

er S

ched

uler

task

CPU   CPU   CPU   GPU   GPU  

task

WS

Sch

edul

er

CPU   CPU   CPU   GPU   GPU  

task

Prio

. Sch

edul

er

prio0 prio1 prio2

Page 63: Programming Models for  Heterogeneous Chips

Tasks scheduling

•  Predefined set of popular policies

63

•  Dequeue Model (DM) Scheduler –  Using codelet performance models

•  Kernel calibration on each available computing device – Raw history model of kernels’ past execution times – Refined models using regression on kernels’ execution times history

•  Dequeue Model Data Aware (DMDA) Scheduler –  Data transfer cost vs kernel offload

benefit –  Transfer cost modelling ( ) –  Bus calibration

CPU   CPU   CPU   GPU   GPU  

task

DM

Sch

edul

er

cpu1 cpu2 cpu3 gpu1 gpu2

cpu3 cpu2 cpu1

gpu1 gpu2

time

CPU   CPU   CPU   GPU   GPU  

task

DM

DA

Sch

edul

er

cpu1 cpu2 cpu3 gpu1 gpu2

cpu3 cpu2 cpu1

gpu1 gpu2

time

Page 64: Programming Models for  Heterogeneous Chips

Some results (MxV, 4 CPUs, 1 GPU)

64

SPU config: Eager, 4 CPUs, 1GPU SPU config: DMDA, 4 CPUs, 1GPU

SPU config: Eager, 3 CPUs, 1GPU SPU config: DMDA, 3 CPUs, 1GPU

Page 65: Programming Models for  Heterogeneous Chips

Terminology

•  A Codelet. . . –  . . . relates an abstract computation kernel to its implementation(s) –  . . . can be instantiated into one or more tasks –  . . . defines characteristics common to a set of tasks

•  A Task. . . –  . . . is an instantiation of a Codelet –  . . . atomically executes a kernel from its beginning to its end –  . . . receives some input –  . . . produces some output

•  A Data Handle. . . –  . . . designates a piece of data managed by StarPU –  . . . is typed (vector, matrix, etc.) –  . . . can be passed as input/output for a Task

65

Page 66: Programming Models for  Heterogeneous Chips

Basic Example: Scaling a Vector

66

123456

struct starpu_codelet scal_cl = { . cpu_funcs = { scal_cpu_f, NULL}, . cuda_funcs = { scal_cuda_f, NULL } , . nbuffers = 1, . modes = { STARPU_RW } , };

Declaring a Codelet kernel functions

data pieces

data mode access

1 2 3 4 5 6 7 8 9

void scal_cpu_f(void ∗buffers [] , void ∗cl_arg) { struct starpu_vector_interface ∗vector_handle = buffers [ 0 ] ; float ∗vector = STARPU_VECTOR_GET_PTR(vector_handle); float ∗ptr_factor = cl_arg ; for (i = 0; i < NX; i++) vector [ i ] ∗= ∗ptr_factor ; } void scal_cuda_f(void ∗buffers [] , void ∗cl_arg) { … }

Kernel functions

kernel function

prototype

get pointer from data handle

get small-size inline data

do computation

retrieve data handle

Page 67: Programming Models for  Heterogeneous Chips

Basic Example: Scaling a Vector

67

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

float factor = 3.14; float vector1 [NX] ; float vector2 [NX] ; starpu_data_handle_t vector_handle1 ; starpu_data_handle_t vector_handle2 ; /∗ ..... ∗/ starpu_vector_data_register(&vector_handle1, 0, (uintptr_t)vector1, NX, sizeof(vector1[0])); starpu_vector_data_register(&vector_handle2, 0, (uintptr_t)vector2, NX, sizeof(vector2[0])); /∗ non−blocking task submits ∗/ starpu_task_insert (&scal_cl , STARPU_RW , vector_handle1 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; starpu_task_insert (&scal_cl , STARPU_RW , vector_handle2 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; /∗ wait for all task submitted so far ∗/ starpu_task_wait_for_all () ; starpu_data_unregister ( vector_handle1 ) ; starpu_data_unregister ( vector_handle2 ) ; /∗ ..... ∗/

Main code

declare data handles

register pieces of data and get the

handles (now under StarPU control)

submit tasks (param: codelet, StarPU-managed data, small-size

inline data)

wait for all submitted tasks

Unregister pieces of data

(the handles are destroyed, the vectors are now back

under user control)

Page 68: Programming Models for  Heterogeneous Chips

Qualcomm MARE

•  MARE is a programming model and a runtime system that provides simple yet powerful abstractions for parallel, power-efficient software –  Simple C++ API allows developers to express concurrency –  User-level library that runs on any Android device, and on Linux,

Mac OS X, and Windows platforms

•  The goal of MARE is to reduce the effort required to write apps that fully utilize heterogeneous SoCs

•  Concepts: –  Tasks are units of work that can be asynchronously executed –  Groups are sets of tasks that can be canceled or waited on

68

Page 69: Programming Models for  Heterogeneous Chips

Basic Example: Hello World

69

Page 70: Programming Models for  Heterogeneous Chips

More complex example: C=A+B on GPU

70

Page 71: Programming Models for  Heterogeneous Chips

More complex example: C=A+B on GPU

71

Page 72: Programming Models for  Heterogeneous Chips

MARE departures

•  Similarities with TBB –  Based on tasks and 2-level API (task level and templates)

•  pfor_each, ptransform, pscan, … •  Synchronous Dataflow classes ≈ TBB’s Flow Graphs

–  Concurrent data structures: queue, stack, … •  Departures

–  Expression of dependencies is first class –  Flexible group membership and work or group cancelation –  Optimized for some Qualcomm chips

•  Power classes: –  Static: mare::power::mode {efficient, saver, …} –  Dynamic: mare:power::set_goal(desired, tolerance)

•  Aware of the mobile architecture: agressive power mangmt. –  Cores can be shutdown or affected by DVFS

72

Page 73: Programming Models for  Heterogeneous Chips

MARE results

•  Zoomm web browser implemented on top of MARE

73

C. Cascaval, et al.. ZOOMM: a parallel web browser engine for multicore mobile devices. In Symposium on Principles and practice of parallel programming, PPoPP ’13, pages 271–280, 2013.

Page 74: Programming Models for  Heterogeneous Chips

MARE results

•  Bullet Physics parallelized with MARE

74

Courtesy: Calin Cascaval

Page 75: Programming Models for  Heterogeneous Chips

Intel Concord

•  C++ heterogeneous programming framework for integrated CPU and GPU processors –  Shared Virtual Memory (SVM) in software –  Adapts existing data-parallel C++ constructs to heterogeneous

computing using TBB –  Available open source as Intel Heterogeneous Research

Compiler (iHRC) at https://github.com/IntelLabs/iHRC/

•  Papers: –  Rajkishore Barik, Tatiana Shpeisman, et al. Efficient mapping of

irregular C++ applications to integrated GPUs. CGO 2014. –  Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian

Lewis, Chunling Hu, and Keshav Pingali. Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014.

75

Page 76: Programming Models for  Heterogeneous Chips

Intel Concord

•  Extend TBB API: –  parallel_for_hetero (int numiters, const Body &B, bool device); –  parallel_reduce_hetero (int numiters, const Body &B, bool

device);

76 Courtesy: Intel

Page 77: Programming Models for  Heterogeneous Chips

Example: Parellel_for_hetero

•  Concord compiler generates OpenCL version –  Automatically takes care of the data thanks to SVM

77 Courtesy: Intel

Page 78: Programming Models for  Heterogeneous Chips

Concord framework

78 Courtesy: Intel

Page 79: Programming Models for  Heterogeneous Chips

SVM SW implementation on Haswell

79

Page 80: Programming Models for  Heterogeneous Chips

SVM translation in OpenCL code

•  svm_const is a runtime constant and is computed once •  Every CPU pointer before dereference on the GPU is

converted into GPU address space using AS_GPU_PTR

80

Page 81: Programming Models for  Heterogeneous Chips

Concord results

81

Page 82: Programming Models for  Heterogeneous Chips

Speedup & Energy savings vs multicore CPU

82

Page 83: Programming Models for  Heterogeneous Chips

Heterogeneous execution on both devices

•  Iteration space distributed among available devices •  Problem: find the best data partition •  Example: Barnes Hut and Facedetect relative exect. time

–  Varying the amount of work offloaded to the GPU –  For BH the optimum is 40% of the work carried out on the GPU –  For FD the optimum is 0% of the work carried out on the GPU

83

Page 84: Programming Models for  Heterogeneous Chips

Partitioning based on on-line profiling

Naïve profiling Asymmetric profiling

84

assign chunk to CPU and

GPU

compute chunk on

CPU

compute chunk on

GPU

barrier

according to relative speeds partition the rest of the iteration

space

assign chunk to just to

GPU

compute on CPU

compute chunk on

GPU

when the GPU is done

according to relative speeds partition the rest of the iteration

space

Page 85: Programming Models for  Heterogeneous Chips

Agenda

•  Motivation

•  Hardware –  Heterogeneous chips –  Integrated GPUs –  Advantages

•  Software –  Programming models for heterogeneous systems –  Programming models for heterogeneous chips –  Our approach based on TBB

85

Page 86: Programming Models for  Heterogeneous Chips

Our heterogeneous parallel_for

86

Angeles Navarro, Antonio Vilches, Francisco Corbera and Rafael Asenjo Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures, The Journal of Supercomputing, May 2014

Page 87: Programming Models for  Heterogeneous Chips

Comparison with StarPU

•  MxV benchmark –  Three schedulers tested: greedy, work-stealing, HEFT –  Static chunk size: 2000, 200 and 20 matrix rows

91

Page 88: Programming Models for  Heterogeneous Chips

Choosing the GPU block size

•  Belviranli, M. E., Bhuyan, L. N., & Gupta, R. (2013). A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4), 57:1–57:20.

92

Page 89: Programming Models for  Heterogeneous Chips

GPU block size for irregular codes

•  Adapt between time-steps and inside the time-step

93

0 40 160 640 2560 10240 40980 819600

10

20

30

40

50

60

70

80

90

100Barnes−Hut: Average throughput per chunk size

Chunk size

Ave

rage

Thr

ough

put

static, ts=0static, ts=30

adap., ts=0adap., ts=30

Page 90: Programming Models for  Heterogeneous Chips

GPU block size for irregular codes

•  Throughput variation along the iteration space –  For two different time-steps –  Different GPU chunk-sizes

94

0 2 4 6 8 10x 104

20

40

60

80

100

120

140

160

Barnes−Hut: Throughput variation (time step=0)

Iteration Space

Thro

ughp

ut

chunk=320chunk=640chunk=1280chunk=2560

0 2 4 6 8 10x 104

20

40

60

80

100

120

140

160

Barnes−Hut: Throughput variation (time step=5)

Iteration Space

Thro

ughp

ut

chunk=320chunk=640chunk=1280chunk=2560

Page 91: Programming Models for  Heterogeneous Chips

Adapting the GPU chunk-size

•  Assumption: –  Irregular behavior as a sequence of regimes of regular behavior

95

thro

ughp

ut

xG(t-1)*2G(t-1)/2

increasechunk size

decreasechunk size

higher λG

lower λG

samples G=a/thld

a·ln(x)+bλGλG

x

x

x

0 2 4 6 8 10x 104

0

20

40

Iteration spaceG

PU T

hrou

ghpu

t

GPU Thr. & Chunk size: LogFit

0 2 4 6 8 10x 104

350

700

1050

1400

1750

GPU

Chu

nk s

izeThroughput

Chunk Size

0 2 4 6 8 10x 104

0

50

100

Iteration space

GPU

Thr

ough

put

GPU Thr. & Chunk size: LogFit

0 2 4 6 8 10x 104

700

1400

2100

2800

3500

GPU

Chu

nk s

izeThroughput

Chunk Size

Page 92: Programming Models for  Heterogeneous Chips

Preliminary results: Energy-Performance

•  Static: Oracle-Like static partition of the work based on profiling •  Concord: Intel approach: GPU size computed once •  HDSS: Belviranli et al. approach: GPU size computed once •  LogFit: our dynamic CPU and GPU chunk size partitioner 96

On Haswell

65 70 75 80 85 90 95 100 105 110

3

4

5

6

x 10−4

Performance (iterations per ms.)

Ener

gy p

er it

erat

ion

(Jou

les)

Barnes Hut: Energy − Performance

StaticConcordHDSSLogFit

0"

50"

100"

150"

0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%"Percentage)of)the)itera.on)space)offloaded)to)the)GPU)

Barnes)Hut:)Offline)search)for)sta.c)par..on)

Execu2on"2me"in"seconds"

Page 93: Programming Models for  Heterogeneous Chips

Preliminary results: Energy-Performance

97 100 150 200 250 300 350 400 450 500

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2 x 10−4

Performance (iterations per ms.)

Ener

gy p

er it

erat

ion

(Jou

les)

SpMV: Enery �−��3erformance

StaticConcordHDSSLogFit

45 50 55 60 65 70 75 80 853

4

5

6

7

8

9 x 10−4

Performance (iterations per ms.)

Ener

gy p

er it

em (J

oule

s)

P(3&: Energy ��Performance

StaticConcordHDSSLogFit

4000 5000 6000 7000 8000 9000 100002

3

4

5

6

7

8

9

10 x 10−6

Performance (iterations per ms.)

Ener

gy p

er it

em (J

oule

s)

CFD: Energy − Performance

StaticConcordHDSSLogFit

30 35 40 45 50 55 60 655

6

7

8

9

10

11

12

13x 10−4

Performance (iterations per ms.)

Ener

gy p

er it

erat

ion

(Jou

les)

Nbody: Energy − Performance

StaticConcordHDSSLogFit

•  w.r.t. Static: up to 52% (18% on average)

•  w.r.t. Concord and HDSS: up to 94% and 69% of (28% and 27% on average)

Page 94: Programming Models for  Heterogeneous Chips

Our heterogeneous pipeline •  ViVid, an object detection application •  Contains three main kernels that form a pipeline

•  Would like to answer the following questions:

–  Granularity: coarse or fine grained parallelism? –  Mapping of stages: where do we run them (CPU/GPU)? –  Number of cores: how many of them when running on CPU? –  Optimum: what metric do we optimize (time, energy, both)?

98

Stag

e 1

Stag

e 2

Stag

e 3

Inputframe Filter Histogram Classifier

Output

response mtx.

index mtx.

histogramsdetect. response

Page 95: Programming Models for  Heterogeneous Chips

Granularity

•  Coarse grain: –  CG

•  Medium grain: –  MG

•  Fine grain:

–  Fine grain also in the CPU via AVX intrinsics

99

Input Stage Output Stage

CPU

Citem

CPU

Citem

CPU

Stage 1

Citem

CPU

Stage 2

Citem

CPU

Stage 3

Citem

C =core

Input Stage

CPU

Citem

Output Stage

CPU

Citem

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

Input Stage Output Stage

GPUC C CC C C

itemCPU

Citem

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

Stage 1 Stage 2 Stage 3

Page 96: Programming Models for  Heterogeneous Chips

More mappings

100 Input Stage Output Stage

GPUC C CC C C

itemCPU

Citem

CPU

Citem

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

Stage 1

CPU

GPU

CPU

GPU

CPU

GPU

CPU CPU CPU

GPU

CPU CPU CPU CPU

CPU CPU

GPU

CPU

GPU

CPU CPU CPU

GPU

CPU CPU

GPU

CPU CPU

CPU CPU CPU CPU CPU

CPU CPU

GPU

CPU

GPU

CPU CPU

GPU

CPU CPU CPU CPU CPU

CPU CPU CPU CPU CPU

GPU

Page 97: Programming Models for  Heterogeneous Chips

Accounting for all alternatives

•  In general: nC CPU cores, 1 GPU and p pipeline stages

# alternatives = 2p x (nC +2)

•  For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives

101

Input Stage

GPUC C CC C C

item

CPU

Citem

Output Stage

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

CPU

Stage 1

itemCPU

Stage 2

itemCPU

Stage p

itemC C

nC coresC C

nC coresC C

nC cores

Page 98: Programming Models for  Heterogeneous Chips

Framework and Model

•  Key idea: 1.  Run only on GPU 2.  Run only on CPU 3.  Analytically extrapolate for heterogeneous execution 4.  Find out the best configuration à RUN

102

Input Stage

GPUC C CC C C

item

CPU

Citem

Output Stage

CPU

Citem

GPUC C CC C C

itemGPUC C CC C C

item

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

DP-MG

collect λ and E (homogeneous values)

Page 99: Programming Models for  Heterogeneous Chips

Environmental Setup: Benchmarks

•  Four Benchmarks – ViVid (Low and High Definition inputs)

– SRAD

– Tracking

– Scene Recognition

103

InptFilter Histogram Classifier

St 1 St 2 St 3 Out

InptExtrac. Prep. Reduct.

St 1 St 2 St 3 St 4Comp.1 Comp. 2

St 5 St 6 OutStatist.

InptTrack.

St 1 Out

InptFeature. SVM

St 1 St 2 Out

Page 100: Programming Models for  Heterogeneous Chips

ViVid: throughput/energy on Ivy Bridge

106

LD (600x416), Ivy Bridge HD (1920x1080), Ivy Bridge

0.0E+00  

1.0E-­‐02  

2.0E-­‐02  

3.0E-­‐02  

4.0E-­‐02  

5.0E-­‐02  

6.0E-­‐02  

7.0E-­‐02  

8.0E-­‐02  

9.0E-­‐02  

1.0E-­‐01  

Num-­‐Threads  (CG)  

0.0E+00  

1.0E-­‐04  

2.0E-­‐04  

3.0E-­‐04  

4.0E-­‐04  

5.0E-­‐04  

6.0E-­‐04  

7.0E-­‐04  

8.0E-­‐04  

Num-­‐Threads  (CG)  

Input Stage Output Stage

GPUC C CC C C

item

C = Core

CPU

Citem

CPU

Stage 1

Citem

CPU

Citem

CPU

Stage 2

Citem

CPU

Stage 3

Citem

CP-CG GPU-CPU PathCPU Path

Input Stage Output Stage

GPUC C CC C C

item

CPU

Citem

CPU

Citem

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

CP-MGC = Core

Page 101: Programming Models for  Heterogeneous Chips

ViVid: throughput/energy on Haswell

107

LD, Haswell HD, Haswell

0.0E+00  

2.0E-­‐02  

4.0E-­‐02  

6.0E-­‐02  

8.0E-­‐02  

1.0E-­‐01  

1.2E-­‐01  

1.4E-­‐01  

1.6E-­‐01  

1.8E-­‐01  

Num-­‐Threads  (CG)  

0.0E+00  

2.0E-­‐04  

4.0E-­‐04  

6.0E-­‐04  

8.0E-­‐04  

1.0E-­‐03  

Num-­‐Threads  (CG)  

Input Stage Output Stage

GPUC C CC C C

item

CPU

Citem

CPU

Citem

CPU

Stage 1

Citem

CC C

CPU

Stage 2

Citem

CC C

CPU

Stage 3

Citem

CC C

CP-MGC = Core

Page 102: Programming Models for  Heterogeneous Chips

Does Higher Throughput imply Lower Energy?

108

1.0E-­‐03  2.0E-­‐03  3.0E-­‐03  4.0E-­‐03  5.0E-­‐03  6.0E-­‐03  7.0E-­‐03  8.0E-­‐03  9.0E-­‐03  

Num-­‐Threads  (CG)  

6.0  

8.0  

10.0  

12.0  

14.0  

16.0  

Num-­‐Threads  (CG)  

0.0E+00  

2.0E-­‐04  

4.0E-­‐04  

6.0E-­‐04  

8.0E-­‐04  

1.0E-­‐03  

Num-­‐Threads  (CG)  

Haswell HD

Page 103: Programming Models for  Heterogeneous Chips

SRAD on Ivy Bridge

109

0.0E+00  

1.0E-­‐01  

2.0E-­‐01  

3.0E-­‐01  

4.0E-­‐01  

5.0E-­‐01  

6.0E-­‐01  

Num-­‐Threads  (CG)  

0.0E+00  

2.0E-­‐02  

4.0E-­‐02  

6.0E-­‐02  

8.0E-­‐02  

1.0E-­‐01  

1.2E-­‐01  

1.4E-­‐01  

Num-­‐Threads  (CG)  

1.0E-­‐01  2.0E-­‐01  3.0E-­‐01  4.0E-­‐01  5.0E-­‐01  6.0E-­‐01  7.0E-­‐01  8.0E-­‐01  9.0E-­‐01  1.0E+00  1.1E+00  

Num-­‐Threads  (CG)  

Input Stage

GPUC C C

C C C

item

CPU

Citem

CPU

Stage 1

CitemC

C C

CPU

Stage 2

CitemC

C COutput Stage

CPU

Citem

GPUC C C

C C C

item

CPU

Stage 6

CitemC

C C

CP-MGGPUC C C

C C C

item

CPU

Stage 3

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 4

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 5

CitemC

C C

Input Stage

GPUC C C

C C C

item

CPU

Citem

CPU

Stage 1

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 2

CitemC

C COutput Stage

CPU

Citem

GPUC C C

C C C

item

CPU

Stage 6

CitemC

C C

DP-MGGPUC C C

C C C

item

CPU

Stage 3

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 4

CitemC

C C

GPUC C C

C C C

item

CPU

Stage 5

CitemC

C C

Page 104: Programming Models for  Heterogeneous Chips

1.0E-­‐01  

3.0E-­‐01  

5.0E-­‐01  

7.0E-­‐01  

9.0E-­‐01  

1.1E+00  

Num-­‐Threads  (CG)  

SRAD on Haswell

110

0.0E+00  

1.0E-­‐01  

2.0E-­‐01  

3.0E-­‐01  

4.0E-­‐01  

5.0E-­‐01  

6.0E-­‐01  

7.0E-­‐01  

8.0E-­‐01  

Num-­‐Threads  (CG)  

0.0E+00  2.0E-­‐02  4.0E-­‐02  6.0E-­‐02  8.0E-­‐02  1.0E-­‐01  1.2E-­‐01  1.4E-­‐01  1.6E-­‐01  1.8E-­‐01  

Num-­‐Threads  (CG)  

Input Stage

GPUC C C

C C C

item

CPU

Citem

CPU

Stage 1

Citem

GPUC C C

C C C

item

CPU

Stage 2

Citem

Output Stage

CPU

Citem

GPUC C C

C C C

item

CPU

Stage 6

Citem

DP-CGGPUC C C

C C C

item

CPU

Stage 3

Citem

GPUC C C

C C C

item

CPU

Stage 4

Citem

GPUC C C

C C C

item

CPU

Stage 5

Citem

Page 105: Programming Models for  Heterogeneous Chips

On-going work

•  Test the model in other heterogeneous chips •  ViVid LD running on Odroid XU-E

–  Ivy Bridge: 65 fps, 0.7 J/frame and 92 fps/J with CP-CG(5) –  Exynos 5: 0.7 fps, 11 J/frame and 0.06 fps/J with CP-CG(4)

112

0  

0.0001  

0.0002  

0.0003  

0.0004  

0.0005  

0.0006  

0.0007  

0.0008  

1   2   3   4  

Throughput  

CP-­‐CG  Est.  

CP-­‐CG  Mea.  

DP-­‐CG  Est.  

DP-­‐CG  Mea.  

CPU-­‐CG  0  

0.00001  

0.00002  

0.00003  

0.00004  

0.00005  

0.00006  

0.00007  

1   2   3   4  

Throughput/Energy  

CP-­‐CG  Est.  

CP-­‐CG  Mea.  

DP-­‐CG  Est.  

DP-­‐CG  Mea.  

CPU-­‐CG  

Page 106: Programming Models for  Heterogeneous Chips

Future Work •  Consider energy for parallel loops scheduling/partitioning

•  Consider MARE as an alternative to our TBB-based implem.

•  Apply DVFS to reduce energy consumption –  For video applications: no need to go faster than 33 fps

•  Also explore other parallel patterns: –  Reduce –  Parallel_do –  …

113

Page 107: Programming Models for  Heterogeneous Chips

Conclusions •  Plenty of heterogeneous on-chip architectures out there

•  It is important to use both devices –  Need to find the best mapping/distribution/scheduling out of the

many possible alternatives.

•  Programming models and runtimes aimed at this goal are in their infancy: it may have huge impact in mobile market

•  Challenges: –  Hide hardware complexity –  Consider energy in the partition/scheduling decisions –  Minimize overhead of adaptation policies

114

Page 108: Programming Models for  Heterogeneous Chips

Collaborators

•  Mª Ángeles González Navarro (UMA, Spain) •  Fracisco Corbera (UMA, Spain) •  Antonio Vilches (UMA, Spain) •  Andrés Rodríguez (UMA, Spain) •  Alejandro Villegas (UMA, Spain) •  Rubén Gran (U. Zaragoza, Spain) •  Maria Jesús Garzarán (UIUC, USA) •  Mert Dikmen (UIUC, USA) •  Kurt Fellows (UIUC, USA) •  Ehsan Totoni (UIUC, USA)

115

Page 109: Programming Models for  Heterogeneous Chips

Questions

[email protected] http://www.ac.uma.es/~asenjo

116