isca final presentation - applications
Post on 19-Jun-2015
471 Views
Preview:
TRANSCRIPT
HSA APPLICATIONSWEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS
WITH J.P. BORDES AND JUAN GOMEZ
USE CASES SHOWING HSA ADVANTAGE
Programming Technique Use Case Description HSA Advantage
Pointer-based Data Structures
Binary tree searchesGPU performs parallel searches in a CPU created binary tree.
CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers.
Platform Atomics
Work-Group Dynamic Task ManagementGPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads
Binary tree updatesCPU and GPU operating simultaneously on the tree, both doing modifications
CPU and GPU can synchronize using Platform AtomicsHigher performance through parallel operations reducing the need for data copying and reconciling.
Large Data SetsHierarchical data searchesApplications include object recognition, collision detection, global illumination, BVH
CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead.
CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernelSimpler programming does not require “split kernels”Higher performance through parallel operations
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY FOR POINTER-BASED DATA STRUCTURES
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
L R
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
L R
L R
L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
Legacy
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
GPU MEMORY
RESULT BUFFER
FLAT TREE
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
KERNEL
GPU
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA and full OpenCL 2.0
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES
HSA
SYSTEM MEMORY
KERNEL
GPU
TREE RESULTBUFFER
L R
L R L R
© Copyright 2014 HSA Foundation. All Rights Reserved
POINTER DATA STRUCTURES - CODE COMPLEXITY
HSA Legacy
© Copyright 2014 HSA Foundation. All Rights Reserved
POINTER DATA STRUCTURES- PERFORMANCE
1M 5M 10M 25M0
10,000
20,000
30,000
40,000
50,000
60,000
Binary Tree Search
CPU (1 core)
CPU (4 core)
Legacy APU
HSA APU
Tree size ( # nodes )
Se
arc
h r
ate
(
no
de
s /
ms
)
Measured in AMD labs Jan 1-3 on system shown in back up slide
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS FOR DYNAMIC TASK MANAGEMENT
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
0
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
0
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
0
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
0
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Asynchronous transfer
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
0
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Asynchronous transfer
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
1
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
1
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
2
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
2
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
3
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICS
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
3
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
4
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
4
SYSTEM MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
GPU MEMORY
QUEUE 2QUEUE 1
0
4
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
4
NUM. WRITTEN
TASKS
0
4
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 3
WORK-GROUP 4
TASKS POOL
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
Legacy*
*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010
Zero-copy
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
0
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
0
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
memcpy
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
0
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
1
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
1
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
2
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
2
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
3
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
3
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT
4
HOST COHERENT MEMORY
WORK-GROUP 1
GPU
NUM. WRITTEN
TASKS
QUEUE 2QUEUE 1
TASKS POOL
0
4
NUM. CONSUMED
TASKS
0
QUEUE 1
QUEUE 2
WORK-GROUP 2
WORK-GROUP 3
WORK-GROUP 4
HSA and full OpenCL 2.0
GPU MEMORY
Platform atomic add
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS – CODE COMPLEXITY
HSALegacy
Host enqueue function: 20 lines of code
Host enqueue function: 102 lines of code
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS - PERFORMANCE
64 128 256 512 64 128 256 5124096 16384
0
100
200
300
400
500
600
700
Legacy implementation (ms)
HSA implementation (ms)
Tasks per insertionTasks pool size
Exe
cuti
on
tim
e (m
s)
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS FOR CPU/GPU COLLABORATION
PLATFORM ATOMICSENABLING EFFICIENT GPU/CPU COLLABORATION
Legacy
Only GPU can work on input
arrayConcurre
nt processin
g not possible
TREEINPUTBUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
Legacy
Only GPU can work on input
arrayConcurre
nt processin
g not possible
TREEINPUTBUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
PLATFORM ATOMICS
Legacy
Only GPU can work on input
arrayConcurre
nt processin
g not possible
TREEINPUTBUFFER
GPU
KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
KERNEL
PLATFORM ATOMICS
Both CPU+GPU
operating on same
data structure
concurrently
TREEINPUTBUFFER
CPU 0
CPU 1
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
KERNEL
PLATFORM ATOMICS
Both CPU+GPU
operating on same
data structure
concurrently
TREEINPUTBUFFER
CPU 0
CPU 1
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
UNIFIED COHERENT MEMORY FOR LARGEDATA SETS
PROCESSING LARGE DATA SETS
The CPU creates a large data structure in System Memory. Computations
using the data are offloaded to the GPU.
SYSTEM MEMORY
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
Level 1
Level 2
Level 3
Level 4
Level 5
PROCESSING LARGE DATA SETS
Larg
e 3D
spa
tial d
ata
stru
ctur
e
GPU
The CPU creates a large data structure in System Memory. Computations
using the data are offloaded to the GPU.
Compare HSA and Legacy methods
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
LEGACY ACCESS USING GPU MEMORY
Legacy
GPU Memory is smaller
Have to copy and process in chunks
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
LEGACY ACCESS TO LARGE STRUCTURES
Larg
e 3D
spa
tial d
ata
stru
ctur
e
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
KERNEL
Copy of top 2 levels of hierarchy
Larg
e 3D
spa
tial d
ata
stru
ctur
e
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
GPU
GPU MEMORY
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
FIRSTKERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
GPU MEMORY
FIRSTKERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
?
GPU
GPU MEMORY
FIRSTKERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
KERNEL
Copy of bottom 3 levels of one branch of the hierarchy
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
KERNEL
GPU MEMORY
SECOND KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
KERNEL
GPU MEMORY
SECOND KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
KERNEL
GPU MEMORY
SECOND KERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
COPY ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
Copy of bottom 3 levels of a different branch of the
hierarchy
GPU MEMORY
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
KERNEL
GPU MEMORY
NthKERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
KERNEL
GPU MEMORY
NthKERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
PROCESS ONE CHUNK AT A TIME
Legacy
Level 1
Level 2
Level 3
Level 4
Level 5
GPU
KERNEL
GPU MEMORY
NthKERNEL
© Copyright 2014 HSA Foundation. All Rights Reserved
LARGE SPATIAL DATA STRUCTURE
Level 1
Level 2
Level 3
Level 4
Level 5
Larg
e 3D
spa
tial d
ata
stru
ctur
eSYSTEM MEMORY
KERNEL
GPUHSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Level 1
Level 2
Level 3
Level 4
Level 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Level 1
Level 2
Level 3
Level 4
Level 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Level 1
Level 2
Level 3
Level 4
Level 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Level 1
Level 2
Level 3
Level 4
Level 5
HSA
KERNEL
GPU
© Copyright 2014 HSA Foundation. All Rights Reserved
SYSTEM MEMORY
GPU CAN TRAVERSE ENTIRE HIERARCHY
Level 1
Level 2
Level 3
Level 4
Level 5
KERNEL
HSAGPU
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
CALLBACKS
Parallel processing algorithm with branches A seldom taken branch requires new data from the CPU
On legacy systems, the algorithm must be split: Process Kernel 1 on GPU Check for CPU callbacks and if any, process on CPU Process Kernel 2 on GPU
Example algorithm from Image Processing Perform a filter Calculate average LUMA in each tile Compare LUMA against threshold and call CPU callback if exceeded (rare) Perform special processing on tiles with callbackx\s
COMMON SITUATION IN HC
Input Image Output Image
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
Legacy
1st KERNEL
END
STAR
T
GP
U T
HR
EA
DS
0
1
2
N
.
.
.
.
.
.
.
.
.
CPU callbacks
Early term
ination
due to need fo
r
callback
2nd KERNEL
END
START Continuation kernel
finishes up kernel works results in poor GPU utilization
TIME
TIME
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
Input Image
1 Tile = 1 OpenCL Work Item
Output Image
GPU• Work items compute average RGB value
of all the pixels in a tile • Work items also compute average Luma
from the average RGB• If average Luma > threshold, workgroup
invokes CPU CALLBACK• In parallel with callback, continue compute
CPU • For selected tiles, update average Luma
value (set to RED)
GPU• Work items apply the Luma value to all
pixels in the tile
GPU to CPU callbacks use Shared Virtual Memory (SVM) Semaphores, implemented using Platform Atomic Compare-and-Swap.
© Copyright 2014 HSA Foundation. All Rights Reserved
CALLBACKS
A few kernel threads need CPU callback services but serviced immediately
KERNEL
END
STAR
T
GP
U T
HR
EA
DS
0
1
2
N
.
.
.
.
.
.
.
.
.
TIME
CPU callbacks
HSA and full OpenCL 2.0
© Copyright 2014 HSA Foundation. All Rights Reserved
SUMMARY - HSA ADVANTAGE
Programming Technique Use Case Description HSA Advantage
Pointer-based Data Structures
Binary tree searchesGPU performs parallel searches in a CPU created binary tree.
CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers.
Platform Atomics
Work-Group Dynamic Task ManagementGPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads
Binary tree updatesCPU and GPU operating simultaneously on the tree, both doing modifications
CPU and GPU can synchronize using Platform AtomicsHigher performance through parallel operations reducing the need for data copying and reconciling.
Large Data SetsHierarchical data searchesApplications include object recognition, collision detection, global illumination, BVH
CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead.
CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require a call to a CPU function to fetch new data
GPU can invoke CPU functions from within a GPU kernelSimpler programming does not require “split kernels”Higher performance through parallel operations
© Copyright 2014 HSA Foundation. All Rights Reserved
QUESTIONS?
top related