programming models for heterogeneous chips
DESCRIPTION
Charla del Profesor Rafael Asenjo Plaza impartida el 24 de Octubre en la Facultad de InformáticaTRANSCRIPT
Programming Models for Heterogeneous Chips
Rafael Asenjo Dept. of Computer Architecture
University of Malaga, Spain.
Agenda
• Motivation
• Hardware – Heterogeneous chips – Integrated GPUs – Advantages
• Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB
2
Motivation • A new mantra: Power and Energy saving • In all domains
3
Motivation
• GPUs came to rescue: – Massive Data Parallel Code at a
low price in terms of power – Supercomputers and servers:
NVIDIA
• GREEN500 Top 15:
• TOP500: – 45 systems w. NVIDIA – 19 systems w. Xeon Phi
4
Motivation
• There is (parallel) live beyond supercomputers:
5
Motivation
• Plenty of GPUs elsewhere: – Integrated GPUs on more than 90% of shipped processors
6
Motivation
• Plenty of GPUs on desktops and laptops: – Desktops (35 – 130W) and laptops (15– 57 W):
7
Intel Haswell AMD APU Kaveri
http://www.techspot.com/photos/article/770-amd-a8-7600-kaveri/ http://techguru3d.com/4th-gen-intel-haswell-processors-architecture-and-lineup/
Motivation
8
Motivation
• Plenty of integrated GPUs in mobile devices.
9
Samsung Exynos 5 Octa (2 - 6 W)
http://www.samsung.com/us/showcase/galaxy-smartphones-and-tablets/
Samsung Galaxy S5 SM-G900H
Samsung Galaxy Note Pro 12
Motivation
• Plenty of integrated GPUs in mobile devices.
10
Qualcomm Snapdragon 800 (2 - 6 W)
https://www.qualcomm.com/products/snapdragon/processors/800
Nexus 5
Nokia Lumia
Sony Xperia
Motivation
• Plenty of room for improvements – Want to make the most out of the CPU and the GPU – Lack of programming models – “Heterogeneous exec., but homogeneous programming” – Huge potential impact
• Servers and supercomputing market – Google: porting the search engine for ARM and PowerPC – AMD Seattle Server-on-a-Chip based on Cortex-A57 (v8) – Mont Blanc project: supercomputer made of ARM
• Once commodity processors took over • Be prepared for when mobile processors do so
– E4’s EK003 Servers: X-Gene ARM A57 (8 cores) + K20
11
Agenda
• Motivation
• Hardware – Heterogeneous chips – Integrated GPUs – Advantages
• Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB
12
Hardware
13
Intel Haswell AMD Kaveri
Samsung Exynos 5 Octa Qualcomm Snapdragon 800
Intel Haswell
14
• Modular design – 2 or 4 cores – GPU
• GT-1: 10 EU • GT-2: 20 EU • GT-3: 40 EU
• TSX: HW transac. mem. – HLE (HW lock elis.)
• XACQUIRE • XRELEASE
– RTM (Restrtd. TM) • XBEGIN • XEND
http://www.anandtech.com/show/6355/intels-haswell-architecture
Intel Haswell
• Three frequency domains – Cores – GPU – LLC and Ring
• On old Ivy Bridge – Only 2 domains – Cores and LLC together – Only GPU à CPU Fz é
• OpenCL driver only for Win. • PCM as power monitor
15
http://www.anandtech.com/show/7744/intel-reveals-new-haswell-details-at-isscc-2014
Intel Iris Graphics
16
https://software.intel.com/en-us/articles/opencl-fall-webinar-series
Intel Iris Graphics
17
• GPU slice – 2 sub slices – 20 EU (GPU cores) – Local L3 cache (256KB) – 16 barriers per sub slice – 2 x 64KB Local mem.
• 2 GPU slices = 40 EU • Up to 7 in-flight EU-th • 8, 16 or 32 SIMD per EU-th • In flight:7x40x32=8960 work it • Each EU à 2 x 4-wide FPU
– 40x8x2 (fmadd) = 640 op sim. – 1.3GHz à 832GFlops
Intel Iris GPU
18
Matrix work-group ≈ block
EU-threads (SIMD16) ≈ warp ≈ wavefront
AMD Kaveri
• Steamroller microarch (2 – 4 “Cores”) + 8 GCN Cores.
19 http://wccftech.com/
AMD Kaveri
• Steamroller microarch. – Each moduleà 2 “Cores”. – 2 threads, each with
• 4x superscalar INT • 2x SIMD4 FP
– 3.7GHz
• Max GFLOPS: • 3.7 GHz x • 4 threads x • 4 wide x • 2 fmad = • 118 GFLOPS
20
AMD Graphics Core Next (GCN)
• In Kaveri, GCG takes 47% of the die – 8 Compute Units (CU) – Each CU: 4 SIMD16 – Each SIMD16: 16 lines – Total: 512 FPUs – 720 MHz
• Max GFLOPS= • 0.72 GHz x • 512 FPUs x • 2 fmad = • 737 GFLOPS
• CPU+GPU à 855 GFLOPS
21
OpenCL execution on GCN
Work-group à wavefronts (64 work-items) à pools
22
WG
CU0
SIMD0-0
SIMD0 SIMD1 SIMD2 SIMD3
SIMD1-0 SIMD2-0 SIMD3-0 SIMD0-1 SIMD1-1 SIMD2-1 SIMD3-1 SIMD0-2 SIMD1-2
4 pools: 4 wavefronts in flight per SIMD
4 ck to execute each wavefront
pool number
Wavefronts
HSA (Heterogeneous System Architecture)
– CPU, GPU, DSPs..
• Scheduled on three phasesà
• Second phase: Kaveri – hUMA – Same pointers used on CPU
and GPU – Cache coherency
23
• HSA Foundation’s goal: Productivity on heterogeneous HW
Kaveri’s main HSA features
• hUMA – Shared and coherent view of up to 32GB
• Heterogeneous queuing (hQ) – CPU and GPU can create and dispatch work
24
HSA Motivation
• Too many steps to get the job done
25
Application OS GPU
Transfer buffer to GPU
Copy/Map Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule Application
Get Buffer
Copy/Map Memory
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
Requirements
• To enable lower overhead job dispatch requires four mechanisms: – Shared Virtual Memory
• Send pointers (not data) back and forth between HSA agents. – System Coherency
• Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance
– Signaling • HSA Agents can directly create/access signal objects.
– Signaling a signal object (this will wake up HSA agents waiting upon the object)
– Query current object – Wait on the current object (various conditions supported).
– User mode queueing • Enables user space applications to directly, without OS intervention,
enqueue jobs (“Dispatch Packets”) for HSA agents.
26
Non-HSA Shared Virtual Memory
• Multiple Virtual memory address spaces
27
PHYSICAL MEMORY
CPU0 GPU
VIRTUAL MEMORY1
PHYSICAL MEMORY
VA1->PA1 VA2->PA1
VIRTUAL MEMORY2
HSA Shared Virtual Memory
• Common Virtual Memory for all HSA agents
28
CPU0 GPU
VIRTUAL MEMORY
PHYSICAL MEMORY
VA->PA VA->PA
After adding SVM
• With SVM we get rid of copy/map memory back and forth
29
Application OS GPU
Transfer buffer to GPU
Copy/Map Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule Application
Get Buffer
Copy/Map Memory
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
After adding coherency
• If the CPU allocates a global pointer, the GPU see that value
30
Application OS GPU
Transfer buffer to GPU
Copy/Map Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule Application
Get Buffer
Copy/Map Memory
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
After adding signaling
• The CPU can wait on a signal object
31
Application OS GPU
Transfer buffer to GPU
Copy/Map Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule Application
Get Buffer
Copy/Map Memory
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
After adding user-level enqueuing
• The user directly enqueues the job without OS intervention
32
Application OS GPU
Transfer buffer to GPU
Copy/Map Memory
Queue Job
Schedule Job Start Job
Finish Job
Schedule Application
Get Buffer
Copy/Map Memory
http://www.hsafoundation.com/hot-chips-2013-hsa-foundation-presented-deeper-detail-hsa-hsail/
Success!!
• That’s definitely way simpler and with less overhead
33
Application OS GPU
Queue Job
Start Job
Finish Job
OpenCL 2.0
• OpenCL 2.0 will contain most of the features of HSA – Intel’s version supports HSA for Core M (Broadwell). Windows. – AMD’s version does not support SVM fine grain.
• AMD 1.2 beta driver – http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-1-2-beta-driver/
– Only for Windows 8.1 – Example: allocating “Coherent Host Memory” on Kaveri:
34
#include <CL/cl_ext.h> // Implements SVM #include "hsa_helper.h” // AMD helper functions … cl_svm_mem_flags_amd flags = CL_MEM_READ_WRITE |
CL_MEM_SVM_FINE_GRAIN_BUFFER_AMD | CL_MEM_SVM_ATOMICS_AMD; volatile std::atomic_int * data; data = (volatile std::atomic_int *) clSVMAlloc(context,
flags, MAX_DATA * sizeof(volatile std::atomic_int), 0);
Samsung Exynos 5
35
• Odroid XU-E and XU3 bareboards • Sports a Exynos 5 Octa
– big.LITTLE architecture – big: Cortex-A15 quad – LITTLE: Cortex-A7 quad
• Exynos 5 Octa 5410 – Only 4 CPU-cores active at a time – GPU: Power VR SGX544MP3 (Imagination Technologies)
• 3 GPU-cores at 533MHz à 51 GFLOPS
• Exynos 5 Octa 5422 – All 8 CPU-cores can be working simultaneously – GPU: ARM Mali-T628 MP6
• 6 GPU-cores at 533MHz à 102 GFLOPS
180$
Power VR SGX544MP3
• OpenCL 1.1 for Android
• Some limitations: – Compute units: 1 – Max WG Size: 1 – Local Mem: 1KB – Pick MFLOPS:
• 12 Ops per ck • 3 SIMD-ALUs x 4-wide
• Power monitor: – 4 x INA231 monitors
• A15, A7, GPU, Mem. • Instant Power • Every 260ms
36
SGX architecture
…N
Texas Instrument INA231
37
ARM Mali-T628 MP6
• Supporting: – OpenGL® ES 3.0 – OpenCL™ 1.1 – DirectX® 11 – Renderscript™
• Cache L2 size – 32 – 256KB per core
• 6 Cores – 16 FP units – 2 SIMD4 each
• Other – Built-in MMU – Standard ARM Bus
• AMBA 4 ACE-Lite
38
Mali architecture
Qualcomm Snapdragon
39
Snapdragon 800
40
Snapdragon 800
• CPU: Quad-core Krait 400 up to 2.26GHz (ARMv7 ISA) – Similar to Cortex-A15. 11 stage integer pipeline with 3-way
decode and 4-way out-of-order speculative issue superscalar execution
– Pipelined VFPv4[2] and 128-bit wide NEON (SIMD) – 4 KB + 4 KB direct mapped L0 cache – 16 KB + 16 KB 4-way set associative L1 cache – 2 MB (quad-core) L2 cache
• GPU: Adreno 330, 450MHz – OpenGL ES 3.0, DirectX, OpenCL 1.2, RenderScript – 32 Execution Units. Each with 2 x SIMD4 units
• DSP: Hexagon 600MHz
41
Measuring power
• Snapdragon Performance Visualizer
• Trepn Profiler
• Power Tutor – Tuned for Nexus One – Model with 5%
precision – Open Source
42
More development boards
• Jetson TK1 board – Tegra K1 – Kepler GPU with 192 CUDA cores – 4-Plus-1 quad-core ARM Cortex A15 – Linux + CUDA – 180$
• Arndale – Exynos 5420 – big.LITTLE (A15 + A7) – GPU Mali T628 MP6 – Linux + OpenCL – 200$
• …
43
Advantages of integrated GPUs
• Discrete and integrated GPUs: different goals – NVIDIA Kepler: 2880 CUDA cores, 235W, 4.3 TFLOPS – Intel Iris 5200: 40 EU x 8 SIMD, 15-28W, 0.83 TFLOPS – PowerVR: 3 EU x 16 SIMD, < 1W, 0.051 TFLOPS
• Higher bandwidth between CPU and GPU. – Shared DRAM
• Avoid PCI data transfer – Shared LLC (Last Level Cache)
• Data coherence in some cases…
• CPU and GPU may have similar performance – It’s more likely that they can collaborate
• Cheaper!
44
Integrated GPUs are also improving
45
Agenda
• Motivation
• Hardware – Heterogeneous chips – Integrated GPUs – Advantages
• Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB
46
Programming models for heterogeneous
• Targeted at single device – CUDA (NVIDIA) – OpenCL (Khronos Group Standard) – OpenACC (C, C++ or Fortran + Directives à OpenMP 4.0) – C++AMP (Windows’ extension of C++. Recently HSA announced own ver.) – RenderScript (Google’s Java API for Android) – ParallDroid (Java + Directives from ULL, Spain) – Many more (Sycl, Numba Python, IBM Java, Matlab, R, JavaScript, …)
• Targeted at several devices (discrete GPUs) – Qilin (C++ and Qilin API compiled to TBB+CUDA) – OmpSs (OpenMP-like directives + Nanos++ runtime + Mercurium compiler) – XKaapi – StarPU
• Targeted at several devices (integrated GPUs) – Qualcomm MARE – Intel Concord
47
OpenCL on mobile devices
48
http://streamcomputing.eu/blog/2014-06-30/opencl-support-recent-android-smartphones/
OpenCL running on CPU
49
75% 41% 45%
15%
9%
34%
10%
0
5
10
15
20
25
30
35
40
45
50
Base Auto T-Auto SSE AVX-SSE AVX OpenCL
Exe
cutio
n Ti
me
(ms)
AVX code version is - 1.8x times faster than OpenCL - 1.8x more Halstead effort
CPU Ivy Bridge 3.3 GHz
“Easy, Fast and Energy Efficient Object Detection on Heterogeneous On-Chip Architectures”, E. Totoni, M. Dikmen, M. J. Garzaran, ACM Transactions on Architecture and Code Optimization (TACO),10(4), December 2013.
Complexities of AVX Intrinsics __m256 image_cache0 = _mm256_broadcast_ss(&fr_ptr[pixel_offsets[0]]);!curr_filter = _mm256_load_ps(&fb_array[fi]); !temp_sum = _mm256_add_ps(_mm256_mul_ps(image_cache7, "
" curr_filter), temp_sum); !temp_sum2 = _mm256_insertf128_ps(temp_sum, !
" _mm256_extractf128_ps(temp_sum, 1), 0); !cpm = _mm256_cmp_ps(temp_sum2, max_fil, _CMP_GT_OS); !r = _mm256_movemask_ps(cpm); !!if(r&(1<<1)) { !
"best_ind = filter_ind+2; !"int control = 1|(1<<2)|(1<<4)|(1<<6);; !"max_fil = _mm256_permute_ps(temp_sum2, control); !" " " " " " ""r=_mm256_movemask_ps( _mm256_cmp_ps(temp_sum2, !" "max_fil, _CMP_GT_OS)); !
} !
50
Load Multiply-add
Copy high to low
Compare
Store max Store index
OpenCL doesn’t have to be tough
51
Courtesy: Khronos Group
Libraries and languages using OpenCL
52
Courtesy: AMD
Libraries and languages using OpenCL
53
Courtesy: AMD
Libraries and languages using OpenCL (cont.)
54
Courtesy: AMD
Libraries and languages using OpenCL (cont.)
55
Courtesy: AMD
C++AMP
• C++ Accelerated Massive Parallelism • Pioneered by Microsoft
– Requirements: Windows 7 + Visual Studio 2012 • Followed by Intel's experimental implementation
– C++ AMP on Clang/LLVM and OpenCL (AWOL since 2013) • Now HSA Foundation taking the lead • Keywords: restrict(device), array_view, parallel_for_each,…
– Example: SUM = A + B; // (2D arrays)
56
OpenCL Ecosystem
57
Courtesy: Khronos Group
SYCL’s flavour: A[i]=B[i]*2
58
Work in progress developments: - AMD: trySYCL à https://github.com/amd/triSYCL - Codeplay: http://www.codeplay.com/ Advantages: 1. Easy to understand the concept of work-groups 2. Performance-portable between CPU and GPU 3. Barriers are automatically deduced
StarPU
• A runtime system for heterogeneous architectures
• Dynamically schedule tasks on all processing units – See a pool of heterogeneous
cores
• Avoid unnecessary data transfers between accelerators – Software SVM for
heterogeneous machines
59
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
M
M
M GPU
M GPU
M GPU
M GPU
A = A+B
B
A B
Overview of StarPU
• Maximizing PU occupancy, minimizing data transfers • Ideas:
60
– Accept tasks that may have multiple implementations
• Together with potential inter-dependencies – Leads to a dynamic acyclic graph of
tasks
– Provide a high-level data management layer (Virtual Shared Memory VSM)
• Application should only describe – which data may be accessed by tasks – how data may be divided
Applica0ons
Parallel Compilers
Parallel Libraries
StarPU
Drivers (CUDA, OpenCL)
CPU GPU …
Tasks scheduling
• Dealing with heterogeneous hardware accelerators
61
• Tasks = – Data input & output – Dependencies with other tasks – Multiple implementations
• E.g. CUDA + CPU • Scheduling hints
• StarPU provides an Open Scheduling platform – Scheduling algorithm = plug-ins – Predefined set of popular policies
Applica0ons
Parallel Compilers
Parallel Libraries
StarPU
Drivers (CUDA, OpenCL)
CPU GPU … (ARW, BR) f cpu
gpu spu
Tasks scheduling
• Predefined set of popular policies
62
• Eager Scheduler – First come, first served policy – Only one queue
• Work Stealing Scheduler – Load balancing policy – One queue per worker
• Priority Scheduler – Describe the relative importance
of tasks – One queue per priority
CPU CPU CPU GPU GPU
Eag
er S
ched
uler
task
CPU CPU CPU GPU GPU
task
WS
Sch
edul
er
CPU CPU CPU GPU GPU
task
Prio
. Sch
edul
er
prio0 prio1 prio2
Tasks scheduling
• Predefined set of popular policies
63
• Dequeue Model (DM) Scheduler – Using codelet performance models
• Kernel calibration on each available computing device – Raw history model of kernels’ past execution times – Refined models using regression on kernels’ execution times history
• Dequeue Model Data Aware (DMDA) Scheduler – Data transfer cost vs kernel offload
benefit – Transfer cost modelling ( ) – Bus calibration
CPU CPU CPU GPU GPU
task
DM
Sch
edul
er
cpu1 cpu2 cpu3 gpu1 gpu2
cpu3 cpu2 cpu1
gpu1 gpu2
time
CPU CPU CPU GPU GPU
task
DM
DA
Sch
edul
er
cpu1 cpu2 cpu3 gpu1 gpu2
cpu3 cpu2 cpu1
gpu1 gpu2
time
Some results (MxV, 4 CPUs, 1 GPU)
64
SPU config: Eager, 4 CPUs, 1GPU SPU config: DMDA, 4 CPUs, 1GPU
SPU config: Eager, 3 CPUs, 1GPU SPU config: DMDA, 3 CPUs, 1GPU
Terminology
• A Codelet. . . – . . . relates an abstract computation kernel to its implementation(s) – . . . can be instantiated into one or more tasks – . . . defines characteristics common to a set of tasks
• A Task. . . – . . . is an instantiation of a Codelet – . . . atomically executes a kernel from its beginning to its end – . . . receives some input – . . . produces some output
• A Data Handle. . . – . . . designates a piece of data managed by StarPU – . . . is typed (vector, matrix, etc.) – . . . can be passed as input/output for a Task
65
Basic Example: Scaling a Vector
66
123456
struct starpu_codelet scal_cl = { . cpu_funcs = { scal_cpu_f, NULL}, . cuda_funcs = { scal_cuda_f, NULL } , . nbuffers = 1, . modes = { STARPU_RW } , };
Declaring a Codelet kernel functions
data pieces
data mode access
1 2 3 4 5 6 7 8 9
void scal_cpu_f(void ∗buffers [] , void ∗cl_arg) { struct starpu_vector_interface ∗vector_handle = buffers [ 0 ] ; float ∗vector = STARPU_VECTOR_GET_PTR(vector_handle); float ∗ptr_factor = cl_arg ; for (i = 0; i < NX; i++) vector [ i ] ∗= ∗ptr_factor ; } void scal_cuda_f(void ∗buffers [] , void ∗cl_arg) { … }
Kernel functions
kernel function
prototype
get pointer from data handle
get small-size inline data
do computation
retrieve data handle
Basic Example: Scaling a Vector
67
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
float factor = 3.14; float vector1 [NX] ; float vector2 [NX] ; starpu_data_handle_t vector_handle1 ; starpu_data_handle_t vector_handle2 ; /∗ ..... ∗/ starpu_vector_data_register(&vector_handle1, 0, (uintptr_t)vector1, NX, sizeof(vector1[0])); starpu_vector_data_register(&vector_handle2, 0, (uintptr_t)vector2, NX, sizeof(vector2[0])); /∗ non−blocking task submits ∗/ starpu_task_insert (&scal_cl , STARPU_RW , vector_handle1 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; starpu_task_insert (&scal_cl , STARPU_RW , vector_handle2 , STARPU_VALUE , &factor , sizeof ( factor ) , 0) ; /∗ wait for all task submitted so far ∗/ starpu_task_wait_for_all () ; starpu_data_unregister ( vector_handle1 ) ; starpu_data_unregister ( vector_handle2 ) ; /∗ ..... ∗/
Main code
declare data handles
register pieces of data and get the
handles (now under StarPU control)
submit tasks (param: codelet, StarPU-managed data, small-size
inline data)
wait for all submitted tasks
Unregister pieces of data
(the handles are destroyed, the vectors are now back
under user control)
Qualcomm MARE
• MARE is a programming model and a runtime system that provides simple yet powerful abstractions for parallel, power-efficient software – Simple C++ API allows developers to express concurrency – User-level library that runs on any Android device, and on Linux,
Mac OS X, and Windows platforms
• The goal of MARE is to reduce the effort required to write apps that fully utilize heterogeneous SoCs
• Concepts: – Tasks are units of work that can be asynchronously executed – Groups are sets of tasks that can be canceled or waited on
68
Basic Example: Hello World
69
More complex example: C=A+B on GPU
70
More complex example: C=A+B on GPU
71
MARE departures
• Similarities with TBB – Based on tasks and 2-level API (task level and templates)
• pfor_each, ptransform, pscan, … • Synchronous Dataflow classes ≈ TBB’s Flow Graphs
– Concurrent data structures: queue, stack, … • Departures
– Expression of dependencies is first class – Flexible group membership and work or group cancelation – Optimized for some Qualcomm chips
• Power classes: – Static: mare::power::mode {efficient, saver, …} – Dynamic: mare:power::set_goal(desired, tolerance)
• Aware of the mobile architecture: agressive power mangmt. – Cores can be shutdown or affected by DVFS
72
MARE results
• Zoomm web browser implemented on top of MARE
73
C. Cascaval, et al.. ZOOMM: a parallel web browser engine for multicore mobile devices. In Symposium on Principles and practice of parallel programming, PPoPP ’13, pages 271–280, 2013.
MARE results
• Bullet Physics parallelized with MARE
74
Courtesy: Calin Cascaval
Intel Concord
• C++ heterogeneous programming framework for integrated CPU and GPU processors – Shared Virtual Memory (SVM) in software – Adapts existing data-parallel C++ constructs to heterogeneous
computing using TBB – Available open source as Intel Heterogeneous Research
Compiler (iHRC) at https://github.com/IntelLabs/iHRC/
• Papers: – Rajkishore Barik, Tatiana Shpeisman, et al. Efficient mapping of
irregular C++ applications to integrated GPUs. CGO 2014. – Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian
Lewis, Chunling Hu, and Keshav Pingali. Adaptive heterogeneous scheduling on integrated GPUs. PACT 2014.
75
Intel Concord
• Extend TBB API: – parallel_for_hetero (int numiters, const Body &B, bool device); – parallel_reduce_hetero (int numiters, const Body &B, bool
device);
76 Courtesy: Intel
Example: Parellel_for_hetero
• Concord compiler generates OpenCL version – Automatically takes care of the data thanks to SVM
77 Courtesy: Intel
Concord framework
78 Courtesy: Intel
SVM SW implementation on Haswell
79
SVM translation in OpenCL code
• svm_const is a runtime constant and is computed once • Every CPU pointer before dereference on the GPU is
converted into GPU address space using AS_GPU_PTR
80
Concord results
81
Speedup & Energy savings vs multicore CPU
82
Heterogeneous execution on both devices
• Iteration space distributed among available devices • Problem: find the best data partition • Example: Barnes Hut and Facedetect relative exect. time
– Varying the amount of work offloaded to the GPU – For BH the optimum is 40% of the work carried out on the GPU – For FD the optimum is 0% of the work carried out on the GPU
83
Partitioning based on on-line profiling
Naïve profiling Asymmetric profiling
84
assign chunk to CPU and
GPU
compute chunk on
CPU
compute chunk on
GPU
barrier
according to relative speeds partition the rest of the iteration
space
assign chunk to just to
GPU
compute on CPU
compute chunk on
GPU
when the GPU is done
according to relative speeds partition the rest of the iteration
space
Agenda
• Motivation
• Hardware – Heterogeneous chips – Integrated GPUs – Advantages
• Software – Programming models for heterogeneous systems – Programming models for heterogeneous chips – Our approach based on TBB
85
Our heterogeneous parallel_for
86
Angeles Navarro, Antonio Vilches, Francisco Corbera and Rafael Asenjo Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures, The Journal of Supercomputing, May 2014
Comparison with StarPU
• MxV benchmark – Three schedulers tested: greedy, work-stealing, HEFT – Static chunk size: 2000, 200 and 20 matrix rows
91
Choosing the GPU block size
• Belviranli, M. E., Bhuyan, L. N., & Gupta, R. (2013). A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans. Archit. Code Optim., 9(4), 57:1–57:20.
92
GPU block size for irregular codes
• Adapt between time-steps and inside the time-step
93
0 40 160 640 2560 10240 40980 819600
10
20
30
40
50
60
70
80
90
100Barnes−Hut: Average throughput per chunk size
Chunk size
Ave
rage
Thr
ough
put
static, ts=0static, ts=30
adap., ts=0adap., ts=30
GPU block size for irregular codes
• Throughput variation along the iteration space – For two different time-steps – Different GPU chunk-sizes
94
0 2 4 6 8 10x 104
20
40
60
80
100
120
140
160
Barnes−Hut: Throughput variation (time step=0)
Iteration Space
Thro
ughp
ut
chunk=320chunk=640chunk=1280chunk=2560
0 2 4 6 8 10x 104
20
40
60
80
100
120
140
160
Barnes−Hut: Throughput variation (time step=5)
Iteration Space
Thro
ughp
ut
chunk=320chunk=640chunk=1280chunk=2560
Adapting the GPU chunk-size
• Assumption: – Irregular behavior as a sequence of regimes of regular behavior
95
thro
ughp
ut
xG(t-1)*2G(t-1)/2
increasechunk size
decreasechunk size
higher λG
lower λG
samples G=a/thld
a·ln(x)+bλGλG
x
x
x
0 2 4 6 8 10x 104
0
20
40
Iteration spaceG
PU T
hrou
ghpu
t
GPU Thr. & Chunk size: LogFit
0 2 4 6 8 10x 104
350
700
1050
1400
1750
GPU
Chu
nk s
izeThroughput
Chunk Size
0 2 4 6 8 10x 104
0
50
100
Iteration space
GPU
Thr
ough
put
GPU Thr. & Chunk size: LogFit
0 2 4 6 8 10x 104
700
1400
2100
2800
3500
GPU
Chu
nk s
izeThroughput
Chunk Size
Preliminary results: Energy-Performance
• Static: Oracle-Like static partition of the work based on profiling • Concord: Intel approach: GPU size computed once • HDSS: Belviranli et al. approach: GPU size computed once • LogFit: our dynamic CPU and GPU chunk size partitioner 96
On Haswell
65 70 75 80 85 90 95 100 105 110
3
4
5
6
x 10−4
Performance (iterations per ms.)
Ener
gy p
er it
erat
ion
(Jou
les)
Barnes Hut: Energy − Performance
StaticConcordHDSSLogFit
0"
50"
100"
150"
0%" 10%" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"100%"Percentage)of)the)itera.on)space)offloaded)to)the)GPU)
Barnes)Hut:)Offline)search)for)sta.c)par..on)
Execu2on"2me"in"seconds"
Preliminary results: Energy-Performance
97 100 150 200 250 300 350 400 450 500
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2 x 10−4
Performance (iterations per ms.)
Ener
gy p
er it
erat
ion
(Jou
les)
SpMV: Enery �−��3erformance
StaticConcordHDSSLogFit
45 50 55 60 65 70 75 80 853
4
5
6
7
8
9 x 10−4
Performance (iterations per ms.)
Ener
gy p
er it
em (J
oule
s)
P(3&: Energy ��Performance
StaticConcordHDSSLogFit
4000 5000 6000 7000 8000 9000 100002
3
4
5
6
7
8
9
10 x 10−6
Performance (iterations per ms.)
Ener
gy p
er it
em (J
oule
s)
CFD: Energy − Performance
StaticConcordHDSSLogFit
30 35 40 45 50 55 60 655
6
7
8
9
10
11
12
13x 10−4
Performance (iterations per ms.)
Ener
gy p
er it
erat
ion
(Jou
les)
Nbody: Energy − Performance
StaticConcordHDSSLogFit
• w.r.t. Static: up to 52% (18% on average)
• w.r.t. Concord and HDSS: up to 94% and 69% of (28% and 27% on average)
Our heterogeneous pipeline • ViVid, an object detection application • Contains three main kernels that form a pipeline
• Would like to answer the following questions:
– Granularity: coarse or fine grained parallelism? – Mapping of stages: where do we run them (CPU/GPU)? – Number of cores: how many of them when running on CPU? – Optimum: what metric do we optimize (time, energy, both)?
98
Stag
e 1
Stag
e 2
Stag
e 3
Inputframe Filter Histogram Classifier
Output
response mtx.
index mtx.
histogramsdetect. response
Granularity
• Coarse grain: – CG
• Medium grain: – MG
• Fine grain:
– Fine grain also in the CPU via AVX intrinsics
99
Input Stage Output Stage
CPU
Citem
CPU
Citem
CPU
Stage 1
Citem
CPU
Stage 2
Citem
CPU
Stage 3
Citem
C =core
Input Stage
CPU
Citem
Output Stage
CPU
Citem
CPU
Stage 1
Citem
CC C
CPU
Stage 2
Citem
CC C
CPU
Stage 3
Citem
CC C
Input Stage Output Stage
GPUC C CC C C
itemCPU
Citem
CPU
Citem
GPUC C CC C C
itemGPUC C CC C C
item
Stage 1 Stage 2 Stage 3
More mappings
100 Input Stage Output Stage
GPUC C CC C C
itemCPU
Citem
CPU
Citem
CPU
Stage 2
Citem
CC C
CPU
Stage 3
Citem
CC C
Stage 1
CPU
GPU
CPU
GPU
CPU
GPU
CPU CPU CPU
GPU
CPU CPU CPU CPU
CPU CPU
GPU
CPU
GPU
CPU CPU CPU
GPU
CPU CPU
GPU
CPU CPU
CPU CPU CPU CPU CPU
CPU CPU
GPU
CPU
GPU
CPU CPU
GPU
CPU CPU CPU CPU CPU
CPU CPU CPU CPU CPU
GPU
Accounting for all alternatives
• In general: nC CPU cores, 1 GPU and p pipeline stages
# alternatives = 2p x (nC +2)
• For Rodinia’s SRAD benchmark (p=6,nC=4) à 384 alternatives
101
Input Stage
GPUC C CC C C
item
CPU
Citem
Output Stage
CPU
Citem
GPUC C CC C C
itemGPUC C CC C C
item
CPU
Stage 1
itemCPU
Stage 2
itemCPU
Stage p
itemC C
nC coresC C
nC coresC C
nC cores
Framework and Model
• Key idea: 1. Run only on GPU 2. Run only on CPU 3. Analytically extrapolate for heterogeneous execution 4. Find out the best configuration à RUN
102
Input Stage
GPUC C CC C C
item
CPU
Citem
Output Stage
CPU
Citem
GPUC C CC C C
itemGPUC C CC C C
item
CPU
Stage 1
Citem
CC C
CPU
Stage 2
Citem
CC C
CPU
Stage 3
Citem
CC C
DP-MG
collect λ and E (homogeneous values)
Environmental Setup: Benchmarks
• Four Benchmarks – ViVid (Low and High Definition inputs)
– SRAD
– Tracking
– Scene Recognition
103
InptFilter Histogram Classifier
St 1 St 2 St 3 Out
InptExtrac. Prep. Reduct.
St 1 St 2 St 3 St 4Comp.1 Comp. 2
St 5 St 6 OutStatist.
InptTrack.
St 1 Out
InptFeature. SVM
St 1 St 2 Out
ViVid: throughput/energy on Ivy Bridge
106
LD (600x416), Ivy Bridge HD (1920x1080), Ivy Bridge
0.0E+00
1.0E-‐02
2.0E-‐02
3.0E-‐02
4.0E-‐02
5.0E-‐02
6.0E-‐02
7.0E-‐02
8.0E-‐02
9.0E-‐02
1.0E-‐01
Num-‐Threads (CG)
0.0E+00
1.0E-‐04
2.0E-‐04
3.0E-‐04
4.0E-‐04
5.0E-‐04
6.0E-‐04
7.0E-‐04
8.0E-‐04
Num-‐Threads (CG)
Input Stage Output Stage
GPUC C CC C C
item
C = Core
CPU
Citem
CPU
Stage 1
Citem
CPU
Citem
CPU
Stage 2
Citem
CPU
Stage 3
Citem
CP-CG GPU-CPU PathCPU Path
Input Stage Output Stage
GPUC C CC C C
item
CPU
Citem
CPU
Citem
CPU
Stage 1
Citem
CC C
CPU
Stage 2
Citem
CC C
CPU
Stage 3
Citem
CC C
CP-MGC = Core
ViVid: throughput/energy on Haswell
107
LD, Haswell HD, Haswell
0.0E+00
2.0E-‐02
4.0E-‐02
6.0E-‐02
8.0E-‐02
1.0E-‐01
1.2E-‐01
1.4E-‐01
1.6E-‐01
1.8E-‐01
Num-‐Threads (CG)
0.0E+00
2.0E-‐04
4.0E-‐04
6.0E-‐04
8.0E-‐04
1.0E-‐03
Num-‐Threads (CG)
Input Stage Output Stage
GPUC C CC C C
item
CPU
Citem
CPU
Citem
CPU
Stage 1
Citem
CC C
CPU
Stage 2
Citem
CC C
CPU
Stage 3
Citem
CC C
CP-MGC = Core
Does Higher Throughput imply Lower Energy?
108
1.0E-‐03 2.0E-‐03 3.0E-‐03 4.0E-‐03 5.0E-‐03 6.0E-‐03 7.0E-‐03 8.0E-‐03 9.0E-‐03
Num-‐Threads (CG)
6.0
8.0
10.0
12.0
14.0
16.0
Num-‐Threads (CG)
0.0E+00
2.0E-‐04
4.0E-‐04
6.0E-‐04
8.0E-‐04
1.0E-‐03
Num-‐Threads (CG)
Haswell HD
SRAD on Ivy Bridge
109
0.0E+00
1.0E-‐01
2.0E-‐01
3.0E-‐01
4.0E-‐01
5.0E-‐01
6.0E-‐01
Num-‐Threads (CG)
0.0E+00
2.0E-‐02
4.0E-‐02
6.0E-‐02
8.0E-‐02
1.0E-‐01
1.2E-‐01
1.4E-‐01
Num-‐Threads (CG)
1.0E-‐01 2.0E-‐01 3.0E-‐01 4.0E-‐01 5.0E-‐01 6.0E-‐01 7.0E-‐01 8.0E-‐01 9.0E-‐01 1.0E+00 1.1E+00
Num-‐Threads (CG)
Input Stage
GPUC C C
C C C
item
CPU
Citem
CPU
Stage 1
CitemC
C C
CPU
Stage 2
CitemC
C COutput Stage
CPU
Citem
GPUC C C
C C C
item
CPU
Stage 6
CitemC
C C
CP-MGGPUC C C
C C C
item
CPU
Stage 3
CitemC
C C
GPUC C C
C C C
item
CPU
Stage 4
CitemC
C C
GPUC C C
C C C
item
CPU
Stage 5
CitemC
C C
Input Stage
GPUC C C
C C C
item
CPU
Citem
CPU
Stage 1
CitemC
C C
GPUC C C
C C C
item
CPU
Stage 2
CitemC
C COutput Stage
CPU
Citem
GPUC C C
C C C
item
CPU
Stage 6
CitemC
C C
DP-MGGPUC C C
C C C
item
CPU
Stage 3
CitemC
C C
GPUC C C
C C C
item
CPU
Stage 4
CitemC
C C
GPUC C C
C C C
item
CPU
Stage 5
CitemC
C C
1.0E-‐01
3.0E-‐01
5.0E-‐01
7.0E-‐01
9.0E-‐01
1.1E+00
Num-‐Threads (CG)
SRAD on Haswell
110
0.0E+00
1.0E-‐01
2.0E-‐01
3.0E-‐01
4.0E-‐01
5.0E-‐01
6.0E-‐01
7.0E-‐01
8.0E-‐01
Num-‐Threads (CG)
0.0E+00 2.0E-‐02 4.0E-‐02 6.0E-‐02 8.0E-‐02 1.0E-‐01 1.2E-‐01 1.4E-‐01 1.6E-‐01 1.8E-‐01
Num-‐Threads (CG)
Input Stage
GPUC C C
C C C
item
CPU
Citem
CPU
Stage 1
Citem
GPUC C C
C C C
item
CPU
Stage 2
Citem
Output Stage
CPU
Citem
GPUC C C
C C C
item
CPU
Stage 6
Citem
DP-CGGPUC C C
C C C
item
CPU
Stage 3
Citem
GPUC C C
C C C
item
CPU
Stage 4
Citem
GPUC C C
C C C
item
CPU
Stage 5
Citem
On-going work
• Test the model in other heterogeneous chips • ViVid LD running on Odroid XU-E
– Ivy Bridge: 65 fps, 0.7 J/frame and 92 fps/J with CP-CG(5) – Exynos 5: 0.7 fps, 11 J/frame and 0.06 fps/J with CP-CG(4)
112
0
0.0001
0.0002
0.0003
0.0004
0.0005
0.0006
0.0007
0.0008
1 2 3 4
Throughput
CP-‐CG Est.
CP-‐CG Mea.
DP-‐CG Est.
DP-‐CG Mea.
CPU-‐CG 0
0.00001
0.00002
0.00003
0.00004
0.00005
0.00006
0.00007
1 2 3 4
Throughput/Energy
CP-‐CG Est.
CP-‐CG Mea.
DP-‐CG Est.
DP-‐CG Mea.
CPU-‐CG
Future Work • Consider energy for parallel loops scheduling/partitioning
• Consider MARE as an alternative to our TBB-based implem.
• Apply DVFS to reduce energy consumption – For video applications: no need to go faster than 33 fps
• Also explore other parallel patterns: – Reduce – Parallel_do – …
113
Conclusions • Plenty of heterogeneous on-chip architectures out there
• It is important to use both devices – Need to find the best mapping/distribution/scheduling out of the
many possible alternatives.
• Programming models and runtimes aimed at this goal are in their infancy: it may have huge impact in mobile market
• Challenges: – Hide hardware complexity – Consider energy in the partition/scheduling decisions – Minimize overhead of adaptation policies
114
Collaborators
• Mª Ángeles González Navarro (UMA, Spain) • Fracisco Corbera (UMA, Spain) • Antonio Vilches (UMA, Spain) • Andrés Rodríguez (UMA, Spain) • Alejandro Villegas (UMA, Spain) • Rubén Gran (U. Zaragoza, Spain) • Maria Jesús Garzarán (UIUC, USA) • Mert Dikmen (UIUC, USA) • Kurt Fellows (UIUC, USA) • Ehsan Totoni (UIUC, USA)
115