heterogeneous - eindhoven university of technologyheterogeneous cpu+gpu computing ana lucia...
TRANSCRIPT
![Page 1: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/1.jpg)
HETEROGENEOUS CPU+GPU COMPUTING
Ana Lucia Varbanescu – University of Amsterdam [email protected]
Significant contributions by: Stijn Heldens (U Twente), Jie Shen (NUDT, China), Basilio Fraguela (A Coruna
University, ESP),
![Page 2: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/2.jpg)
• Preliminaries • Part I: Introduction to CPU+GPU heterogeneous computing
• Performance promise vs. challenges
• Part II: Programing models • Part III: Workload partitioning models
• Static vs. Dynamic partitioning • Part IV: Static partitioning and Glinda • Part V: Tools for (programming) heterogeneous systems
• Low-level to high-level
• Take home message
Today’s agenda
![Page 3: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/3.jpg)
Goal • Discuss heterogeneous computing as a promising
solution for efficient resource utilization • And performance!
• Introduce methods for efficient heterogeneous computing • Programming • Partitioning
• Provide comparisons & selection criteria
• Current challenges and open research questions.
• Fair to others, but we advertise our research J
![Page 4: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/4.jpg)
Heterogeneous platforms • Systems combining main processors and accelerators
• e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC • Everywhere from supercomputers to mobile devices
![Page 5: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/5.jpg)
Heterogeneous platforms • Host-accelerator hardware model
Accelerator
Accelerator
AcceleratorHost
PCIe / Shared memoryMICs
FPGAs
Accelerator
CPUs
GPUs
...
![Page 6: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/6.jpg)
Heterogeneous platforms • Top 500 (June 2015)
Accelerated!
Accelerated!
All systems are based on multi-cores. 90 systems have accelerators (18%). Of those, 50% are NVIDIA GPUs, 30% are Intel MICs (Xeon Phi).
195 cores/node
![Page 7: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/7.jpg)
Thousands of Cores
Few cores
Our focus today … • A heterogeneous platform = CPU + GPU
• Most solutions work for other/multiple accelerators
• An application workload = an application + its input dataset
• Workload partitioning = workload distribution among the processing units of a heterogeneous system
![Page 8: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/8.jpg)
BEFORE WE START … Basic knowledge about CPUs and GPUs
![Page 9: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/9.jpg)
Generic multi-core CPU
9
Hardware threads SIMD units (vector lanes)
L1 and L2 dedicated caches
Shared L3/L4 cache Main memory, I/O
Peak performance
Bandwidth
![Page 10: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/10.jpg)
Multi-core CPUs
10
• Architecture • Few large cores • (Integrated GPUs) • Vector units
• Streaming SIMD Extensions (SSE) • Advanced Vector Extensions (AVX)
• Stand-alone • Memory
• Shared, multi-layered • Per-core caches + shared caches
• Programming • Multi-threading • OS Scheduler
![Page 11: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/11.jpg)
Parallelism
11
• Core-level parallelism ~ task/data parallelism (coarse) • 4-12 of powerful cores
• Hardware hyperthreading (2x) • Local caches • Symmetrical or asymmetrical threading model • Implemented by programmer
• SIMD parallelism = data parallelism (fine) • 4-SP/2-DP floating point operations per second
• 256-bit vectors • Run same instruction on different data • Sensitive to divergence
• NOT the same instruction => performance loss • Implemented by programmer OR compiler
![Page 12: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/12.jpg)
Programming models
12
• Pthreads + intrinsics • TBB – Thread building blocks
• Threading library
• OpenCL • To be discussed …
• OpenMP • Traditional parallel library • High-level, pragma-based
• Cilk • Simple divide-and-conquer model Le
vel o
f abs
trac
tion
incr
ease
s
![Page 13: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/13.jpg)
A GPU Architecture
13
![Page 14: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/14.jpg)
Integration into host system • Typically PCI Express 2.0 • Theoretical speed 8 GB/s
• Effective ≤ 6 GB/s • In reality: 4 – 6 GB/s
• V3.0 recently available • Double bandwidth • Less protocol overhead
14
![Page 15: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/15.jpg)
(NVIDIA) GPUs • Architecture
• Many (100s) slim cores • Sets of (32 or 192) cores grouped into “multiprocessors” with
shared memory • SM(X) = stream multiprocessors
• Work as accelerators • Memory
• Shared L2 cache • Per-core caches + shared caches • Off-chip global memory
• Programming • Symmetric multi-threading • Hardware scheduler
15
![Page 16: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/16.jpg)
GPU Parallelism • Data parallelism (fine-grain) • SIMT (Single Instruction Multiple Thread) execution
• Many threads execute concurrently • Same instruction • Different data elements • HW automatically handles divergence
• Not same as SIMD because of multiple register sets, addresses, and flow paths*
• Hardware multithreading • HW resource allocation & thread scheduling
• Excess of threads to hide latency • Context switching is (basically) free
16
*http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
![Page 17: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/17.jpg)
Specific programming model: CUDA • CUDA: Compute Unified Device Architecture
• C/C++ extensions • Other wrappers exist
• Straightforward mapping onto hardware • Hierarchy of threads (map to cores)
• Configurable at logical level • Various memory spaces (map to physical mem. spaces)
• Usable via variable scopes
• SIMT: single instruction multiple threads • Have 1000s threads running concurrently • Hardware multi-threading
• GPU threads are lightweight
17
![Page 18: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/18.jpg)
CUDA: Hierarchy of threads • Each thread executes the kernel code
• One thread runs on one CUDA core
• Threads are logically grouped into thread blocks • Threads in the same block can cooperate • Threads in different blocks cannot cooperate
• All thread blocks are logically organized in a Grid • 1D or 2D or 3D • Threads and blocks have unique IDs
• A grid specifies in how many instances the kernel is being run
![Page 19: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/19.jpg)
CUDA Model of Parallelism
19
![Page 20: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/20.jpg)
Hierarchy of threads
20
Thread
Block
Grid
![Page 21: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/21.jpg)
• CUDA virtualizes the physical hardware • A block is a virtualized streaming multiprocessor
• threads, shared memory • A thread is a virtualized scalar processor
• registers, PC, state
• Execution model: • Threads execute in warps (32 threads per warp)
• Called “wavefronts” by AMD (64 threads) • All threads in a warp execute the same code
• On different data • Blocks = multiple warps
• Scheduled independently on the same SM
21
CUDA Model of Parallelism
![Page 22: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/22.jpg)
CPU vs. GPU 22
Control
ALU ALU
ALU ALU
Cache
CPU Low latency, high flexibility. Excellent for irregular codes with limited parallelism.
GPU High
throughput. Excellent for
massively parallel
workloads.
![Page 23: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/23.jpg)
PART I Heterogeneous processing: pro’s and con’s
![Page 24: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/24.jpg)
Hardware Performance metrics • Clock frequency [GHz] = absolute hardware speed
• Memories, CPUs, interconnects
• Operational speed [GFLOPs] • Instructions per cycle + frequency
• Memory bandwidth [GB/s] • differs a lot between different memories on chip
• Power [Watt]
• Derived metrics • FLOP/Byte, FLOP/Watt
24
![Page 25: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/25.jpg)
Peak = chips * cores * threads/core * vector_lanes * FLOPs/cycle * clockFrequency
• Some examples: • Intel Core i7 CPU
2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs • NVIDIA GTX 580 GPU
1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle * 1.544 GhZ = 1581 GFLOPs • ATI HD 6970
1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle * 0.880 GHz = 2703 GFLOPs
25
Theoretical peak performance
Performance ratio (CPU:GPU): 1:10 !!!
![Page 26: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/26.jpg)
Bandwidth = memory bus frequency * bits per cycle * bus width
• Memory clock != CPU clock! • In bits, divide by 8 for GB/s
• Some Examples:
• Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s • NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s • ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s
26
DRAM Memory bandwidth
Performance ratio (CPU:GPU): 1:8 !!!
![Page 27: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/27.jpg)
• Chip manufactures specify Thermal Design Power (TDP) • We can measure dissipated power
• Whole system • Typically (much) lower than TDP
• Power efficiency • FLOPS / Watt
• Examples (with theoretical peak and TDP) • Intel Core i7: 154 / 160 = 1.0 GFLOPs/W • NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W • ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W
27
Power
![Page 28: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/28.jpg)
Cores Threads/ALUs GFLOPS Bandwidth Sun Niagara 2 8 64 11.2 76 IBM BG/P 4 8 13.6 13.6 IBM Power 7 8 32 265 68 Intel Core i7 4 16 85 25.6 AMD Barcelona 4 8 37 21.4 AMD Istanbul 6 6 62.4 25.6 AMD Magny-Cours 12 12 125 25.6 Cell/B.E. 8 8 205 25.6 NVIDIA GTX 580 16 512 1581 192 NVIDIA GTX 680 8 1536 3090 192 AMD HD 6970 384 1536 2703 176 AMD HD 7970 32 2048 3789 264 Intel Xeon Phi 7120 61 240 2417 352
Summary
![Page 29: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/29.jpg)
T12
NV30 NV40 G70
G80
GT200
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
GPU vs. CPU performance
29
1 GFLOP = 10^9 ops
These are theoretical numbers! In practice, efficiency is much lower!
![Page 30: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/30.jpg)
GPU vs. CPU performance
30
1 GB = 8 x10^9 bits
These are theoretical numbers! In practice, efficiency is much lower!
![Page 31: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/31.jpg)
Heterogeneity vs. Homogeneity • Increase performance
• Both devices work in parallel • Gain is much more than 10%
• Decrease data communication • Which is often the bottleneck of the system
• Different devices for different roles
• Increase flexibility and reliability • Choose one/all *PUs for execution • Fall-back solution when one *PU fails
• Increase power efficiency • Cheaper per flop
![Page 32: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/32.jpg)
Example 1: dot product • Dot product
• Compute the dot product of 2 (1D) arrays
• Performance • TG = execution time on GPU • TC = execution time on CPU • TD = data transfer time CPU-GPU
• GPU best or CPU best?
![Page 33: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/33.jpg)
Example 1: dot product
0
50
100
150
200
250
Execution�time�(ms)
TG TD TC TMax
![Page 34: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/34.jpg)
Example 2: separable convolution • Separable convolution (CUDA SDK)
• Apply a convolution filter (kernel) on a large image. • Separable kernel allows applying
• Horizontal first • Vertical second
• Performance • TG = execution time on GPU • TC = execution time on CPU • TD = data transfer time
• GPU best or CPU best?
![Page 35: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/35.jpg)
Example 2: separable convolution
0
20
40
60
80
100
120
140
160
180
Execution�time�(ms)
TG TD TC TMax
![Page 36: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/36.jpg)
Example 3: matrix multiply • Matrix multiply
• Compute the product of 2 matrices
• Performance • TG = execution time on GPU • TC = execution time on CPU • TD = data transfer time CPU-GPU
• GPU best or CPU best?
![Page 37: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/37.jpg)
Example 3: matrix multiply
0
50
100
150
200
250
300
350
400
450
Execution�time�(ms)
TG TD TC TMax
![Page 38: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/38.jpg)
Example 4: Sound ray tracing
38
1 2
![Page 39: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/39.jpg)
Example 4: Sound ray tracing
39
![Page 40: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/40.jpg)
Which hardware? • Our application has … • Massive data-parallelism … • No data dependency between rays … • Compute-intensive per ray …
• … clearly, this is a perfect GPU workload !!!
![Page 41: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/41.jpg)
Results [1]
41
0
20
40
60
80
100
120
140
160
180
W9(1.3GB)
Only GPU
Only CPU
Execu3
on 3me (s)
Only 2.2x performance improvement! We expected 100x …
![Page 42: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/42.jpg)
330 340 350
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Sound speed, m/s
Altitude,km
1 2 3 4 5Relative distance fromsource, km
Launch
Angle[deg]
−40
−30
−20
−10
0
10
20
30
40
(a) (b)
aircraft
Workload profile
0 1000 2000 3000 4000 5000 6000 7000
0 1000 2000
3000 4000
5000 6000
#Sim
ulat
ion
itera
tions
Ray ID
Peak Processing iterations: ~7000
Bottom Processing iterations: ~500
![Page 43: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/43.jpg)
Results [2] Execu3
on 3me (m
s)
62% performance improvement compared to “Only-GPU”
![Page 44: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/44.jpg)
• Graph traversal (Breadth First Search, BFS) • Traverses all vertices “in levels”
Example 5: Graph processing (BFS)
![Page 45: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/45.jpg)
• … Is data-dependent • … has poor locality • … has low computation-to-memory-ops ratio …
• CPU or GPU?
Graph processing
![Page 46: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/46.jpg)
BFS – normalized
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
WT CR 1M SW EU CH ST ES 64K WV 4K
Performance depends on the diameter and degree: Large diameter => CPU
High degree => GPU
Xeon (CPU) Tesla (GPU) GTX (GPU)
Using both will allow the right balance!
![Page 47: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/47.jpg)
So … • There are very few GPU-only applications
• CPU – GPU communication bottleneck. • Increasing performance of CPUs
• A part of the computation can be done by the CPU. • How to program an application to enable this? • Which part?
Main challenges: programming and workload partitioning!
![Page 48: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/48.jpg)
PART II Challenge 1: Programming
![Page 49: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/49.jpg)
Programming models (PMs) • Heterogeneous computing = a mix of different processors
on the same platform. • Programming
• Mix of programming models • One(/several?) for CPUs – OpenMP • One(/several?) for GPUs – CUDA
• Single programming model (unified) • OpenCL is a popular choice
49
OmpSs
4.0
High level Low level
![Page 50: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/50.jpg)
OpenCL in a nutshell • Open standard for portable multi-core programming • Architecture independent
• Explicit support for multi-/many-cores
• Low-level host API • High-level bindings (e.g., Java, Python)
• Separate kernel language • Run-time compilation • Supports (some) architecture-dependent optimizations
• Explicit & implicit
50
![Page 51: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/51.jpg)
Compute kernels
A host program
Work-item
Work-group
The OpenCL platform model
![Page 52: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/52.jpg)
The OpenCL memory model
![Page 53: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/53.jpg)
The OpenCL virtual platform
![Page 54: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/54.jpg)
Programming in OpenCL • Kernels are the main functional units in OpenCL
• Kernels are executed by work-items • Work-items are mapped transparently on the hardware platform
• Functional portability is guaranteed • Programs run correctly on different families of hardware • Explicit platform-specific optimizations are dangerous
• Performance portability is NOT guaranteed • Performance portability is NOT guaranteed
OpenCL is an efficient programming model for heterogeneous platforms iff we specialize the code to fit
different processors.
![Page 55: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/55.jpg)
• Functional portability guaranteed by the standard • Performance portability is NOT guaranteed
• vs. CUDA: • Used to be comparable (2012) • Lagging behind due to lack of support from NVIDIA
• vs. OpenMP/other CPU models: 3 challenges • GPU-like programming styles
• Memory access patterns, data transfer, (local memory) => MUST remove • Parallelism granularity
• Architectural mismatches with CPUs => MUST tune • OpenCL compilers/runtime libraries/…
• Still under development => MUST test
OpenCL for heterogeneous platforms
OpenCL is an efficient programming model for heterogeneous platforms iff we specialize the code to fit
different processors.
![Page 56: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/56.jpg)
Heterogeneous Computing PMs
56
High level
Low level
Generic Specific
OpenACC, OpenMP 4.0 OmpSS, StarPU, … HPL
HyGraph, Cashmere, GlassWing
TOTEM OpenCL OpenMP+CUDA
Domain and/or application specific.
Focus on: productivity and performance
Domain specific, focus on performance. More difficult to use. Quite rare.
The most common atm. Useful for performance, more difficult to use in practice
High productivity; not all applications are easy to implement.
![Page 57: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/57.jpg)
Heterogeneous computing PMs • CUDA + OpenMP/TBB
• Typical combination for NVIDIA GPUs • Individual development per *PU • Glue code can be challenging
• OpenCL (KHRONOS group) • Functional portability => can be used as a unified model • Performance portability via code specialization
• HPL (University of A Coruna, Spain) • Library on top of OpenCL, to automate code specialization
![Page 58: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/58.jpg)
Heterogeneous computing PMs • StarPU (INRIA, France)
• Special API for coding • Runtime system for scheduling
• OmpSS (UPC + BSC, Spain) • C + OpenCL/CUDA kernels • Runtime system for scheduling and communication optimization
![Page 59: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/59.jpg)
Heterogeneous computing PMs • Cashmere (VU Amsterdam + NLeSC)
• Dedicated to Divide-and-conquer solutions • OpenCL backend.
• GlassWing (VU Amsterdam) • Dedicated to MapReduce applications
• TOTEM (U. of British Columbia, Canada) • Graph processing • CUDA+Multi-threading
• HyGraph (TUDelft, UTwente, UvA, NL) • Graph processing • Based on CUDA+OpenMP
![Page 60: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/60.jpg)
• Questions ?
End of part II
![Page 61: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/61.jpg)
PART III Challenge 2: Workload partitioning
![Page 62: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/62.jpg)
Workload • DAG (directed acyclic graph) of “kernels”
...I II III IV V
k0 k0
k0
k1
kn
k0
k1
kn
k0
k1k2
k3 k4
k5
...
SK-One SK-Loop MK-Seq MK-Loop MK-DAG
![Page 63: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/63.jpg)
Determining the partition • Static partitioning (SP) vs. Dynamic partitioning (DP)
63
Thousands of Cores
Mul3ple Cores
Thousands of Cores
Mul3ple Cores
![Page 64: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/64.jpg)
Static vs. dynamic • Static partitioning
• + can be computed before runtime => no overhead • + can detect GPU-only/CPU-only cases • + no unnecessary CPU-GPU data transfers • -- does not work for all applications
• Dynamic partitioning • + responds to runtime performance variability • + works for all applications • -- incurs (high) runtime scheduling overhead • -- might introduce (high) CPU-GPU data-transfer overhead • -- might not work for CPU-only/GPU-only cases
64
![Page 65: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/65.jpg)
Determining the partition • Static partitioning (SP) vs. Dynamic partitioning (DP)
65
Thousands of Cores
Mul3ple Cores
Thousands of Cores
Mul3ple Cores
(near-) Optimal Low applicability
Often sub-ptimal High applicability
![Page 66: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/66.jpg)
Heterogeneous Computing today
66
Single kernel
Multi-kernel (complex) DAG
Static Dynamic
Systems/frameworks: Qilin, Insieme, SKMD, Glinda, ... Libraries: HPL, …
Sporradic attempts and light runtime systems
Run-time based systems: StarPU OmpSS …
Glinda 2.0
Not interesting, given that static &
run-time based systems exist.
High Applicability, high overhead
Low overhead => high performance Still limited in applicability.
Limited applicability. Low overhead => high performance
![Page 67: HETEROGENEOUS - Eindhoven University of TechnologyHETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by:](https://reader034.vdocument.in/reader034/viewer/2022042220/5ec6fc5b4064415b1e73057a/html5/thumbnails/67.jpg)
• Questions ?
End of part II