isca final presentation - applications

Post on 19-Jun-2015

471 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HSA APPLICATIONSWEN-MEI HWU, PROFESSOR, UNIVERSITY OF ILLINOIS

WITH J.P. BORDES AND JUAN GOMEZ

USE CASES SHOWING HSA ADVANTAGE

Programming Technique Use Case Description HSA Advantage

Pointer-based Data Structures

Binary tree searchesGPU performs parallel searches in a CPU created binary tree.

CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers.

Platform Atomics

Work-Group Dynamic Task ManagementGPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads

Binary tree updatesCPU and GPU operating simultaneously on the tree, both doing modifications

CPU and GPU can synchronize using Platform AtomicsHigher performance through parallel operations reducing the need for data copying and reconciling.

Large Data SetsHierarchical data searchesApplications include object recognition, collision detection, global illumination, BVH

CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead.

CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require a call to a CPU function to fetch new data

GPU can invoke CPU functions from within a GPU kernelSimpler programming does not require “split kernels”Higher performance through parallel operations

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORY FOR POINTER-BASED DATA STRUCTURES

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

L R

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

L R

L R

L R

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

L R

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

Legacy

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

GPU MEMORY

RESULT BUFFER

FLAT TREE

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

KERNEL

GPU

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA and full OpenCL 2.0

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORYMORE EFFICIENT POINTER DATA STRUCTURES

HSA

SYSTEM MEMORY

KERNEL

GPU

TREE RESULTBUFFER

L R

L R L R

© Copyright 2014 HSA Foundation. All Rights Reserved

POINTER DATA STRUCTURES - CODE COMPLEXITY

HSA Legacy

© Copyright 2014 HSA Foundation. All Rights Reserved

POINTER DATA STRUCTURES- PERFORMANCE

1M 5M 10M 25M0

10,000

20,000

30,000

40,000

50,000

60,000

Binary Tree Search

CPU (1 core)

CPU (4 core)

Legacy APU

HSA APU

Tree size ( # nodes )

Se

arc

h r

ate

(

no

de

s /

ms

)

Measured in AMD labs Jan 1-3 on system shown in back up slide

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICS FOR DYNAMIC TASK MANAGEMENT

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

0

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

0

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Asynchronous transfer

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

0

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Asynchronous transfer

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICS

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

4

SYSTEM MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

GPU MEMORY

QUEUE 2QUEUE 1

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

4

NUM. WRITTEN

TASKS

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 3

WORK-GROUP 4

TASKS POOL

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

Legacy*

*Chen et al., Dynamic load balancing on single- and multi-GPU systems, IPDPS 2010

Zero-copy

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

0

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

0

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

memcpy

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

0

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

1

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

2

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

3

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICSENABLING MORE EFFICIENT DYNAMIC TASK MANAGEMENT

4

HOST COHERENT MEMORY

WORK-GROUP 1

GPU

NUM. WRITTEN

TASKS

QUEUE 2QUEUE 1

TASKS POOL

0

4

NUM. CONSUMED

TASKS

0

QUEUE 1

QUEUE 2

WORK-GROUP 2

WORK-GROUP 3

WORK-GROUP 4

HSA and full OpenCL 2.0

GPU MEMORY

Platform atomic add

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICS – CODE COMPLEXITY

HSALegacy

Host enqueue function: 20 lines of code

Host enqueue function: 102 lines of code

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICS - PERFORMANCE

64 128 256 512 64 128 256 5124096 16384

0

100

200

300

400

500

600

700

Legacy implementation (ms)

HSA implementation (ms)

Tasks per insertionTasks pool size

Exe

cuti

on

tim

e (m

s)

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICS FOR CPU/GPU COLLABORATION

PLATFORM ATOMICSENABLING EFFICIENT GPU/CPU COLLABORATION

Legacy

Only GPU can work on input

arrayConcurre

nt processin

g not possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICS

Legacy

Only GPU can work on input

arrayConcurre

nt processin

g not possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

PLATFORM ATOMICS

Legacy

Only GPU can work on input

arrayConcurre

nt processin

g not possible

TREEINPUTBUFFER

GPU

KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

GPU

KERNEL

PLATFORM ATOMICS

Both CPU+GPU

operating on same

data structure

concurrently

TREEINPUTBUFFER

CPU 0

CPU 1

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

GPU

KERNEL

PLATFORM ATOMICS

Both CPU+GPU

operating on same

data structure

concurrently

TREEINPUTBUFFER

CPU 0

CPU 1

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

UNIFIED COHERENT MEMORY FOR LARGEDATA SETS

PROCESSING LARGE DATA SETS

The CPU creates a large data structure in System Memory. Computations

using the data are offloaded to the GPU.

SYSTEM MEMORY

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

Level 1

Level 2

Level 3

Level 4

Level 5

PROCESSING LARGE DATA SETS

Larg

e 3D

spa

tial d

ata

stru

ctur

e

GPU

The CPU creates a large data structure in System Memory. Computations

using the data are offloaded to the GPU.

Compare HSA and Legacy methods

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

LEGACY ACCESS USING GPU MEMORY

Legacy

GPU Memory is smaller

Have to copy and process in chunks

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

LEGACY ACCESS TO LARGE STRUCTURES

Larg

e 3D

spa

tial d

ata

stru

ctur

e

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

Copy of top 2 levels of hierarchy

Larg

e 3D

spa

tial d

ata

stru

ctur

e

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

GPU

GPU MEMORY

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

FIRSTKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

GPU MEMORY

FIRSTKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

?

GPU

GPU MEMORY

FIRSTKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

Copy of bottom 3 levels of one branch of the hierarchy

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

SECOND KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

SECOND KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

SECOND KERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

COPY ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

Copy of bottom 3 levels of a different branch of the

hierarchy

GPU MEMORY

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

NthKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

NthKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

PROCESS ONE CHUNK AT A TIME

Legacy

Level 1

Level 2

Level 3

Level 4

Level 5

GPU

KERNEL

GPU MEMORY

NthKERNEL

© Copyright 2014 HSA Foundation. All Rights Reserved

LARGE SPATIAL DATA STRUCTURE

Level 1

Level 2

Level 3

Level 4

Level 5

Larg

e 3D

spa

tial d

ata

stru

ctur

eSYSTEM MEMORY

KERNEL

GPUHSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

HSA

KERNEL

GPU

© Copyright 2014 HSA Foundation. All Rights Reserved

SYSTEM MEMORY

GPU CAN TRAVERSE ENTIRE HIERARCHY

Level 1

Level 2

Level 3

Level 4

Level 5

KERNEL

HSAGPU

© Copyright 2014 HSA Foundation. All Rights Reserved

CALLBACKS

CALLBACKS

Parallel processing algorithm with branches A seldom taken branch requires new data from the CPU

On legacy systems, the algorithm must be split: Process Kernel 1 on GPU Check for CPU callbacks and if any, process on CPU Process Kernel 2 on GPU

Example algorithm from Image Processing Perform a filter Calculate average LUMA in each tile Compare LUMA against threshold and call CPU callback if exceeded (rare) Perform special processing on tiles with callbackx\s

COMMON SITUATION IN HC

Input Image Output Image

© Copyright 2014 HSA Foundation. All Rights Reserved

CALLBACKS

Legacy

1st KERNEL

END

STAR

T

GP

U T

HR

EA

DS

0

1

2

N

.

.

.

.

.

.

.

.

.

CPU callbacks

Early term

ination

due to need fo

r

callback

2nd KERNEL

END

START Continuation kernel

finishes up kernel works results in poor GPU utilization

TIME

TIME

© Copyright 2014 HSA Foundation. All Rights Reserved

CALLBACKS

Input Image

1 Tile = 1 OpenCL Work Item

Output Image

GPU• Work items compute average RGB value

of all the pixels in a tile • Work items also compute average Luma

from the average RGB• If average Luma > threshold, workgroup

invokes CPU CALLBACK• In parallel with callback, continue compute

CPU • For selected tiles, update average Luma

value (set to RED)

GPU• Work items apply the Luma value to all

pixels in the tile

GPU to CPU callbacks use Shared Virtual Memory (SVM) Semaphores, implemented using Platform Atomic Compare-and-Swap.

© Copyright 2014 HSA Foundation. All Rights Reserved

CALLBACKS

A few kernel threads need CPU callback services but serviced immediately

KERNEL

END

STAR

T

GP

U T

HR

EA

DS

0

1

2

N

.

.

.

.

.

.

.

.

.

TIME

CPU callbacks

HSA and full OpenCL 2.0

© Copyright 2014 HSA Foundation. All Rights Reserved

SUMMARY - HSA ADVANTAGE

Programming Technique Use Case Description HSA Advantage

Pointer-based Data Structures

Binary tree searchesGPU performs parallel searches in a CPU created binary tree.

CPU and GPU have access to entire unified coherent memory. GPU can access existing data structures containing pointers.

Platform Atomics

Work-Group Dynamic Task ManagementGPU directly operate on a task pool managed by the CPU for algorithms with dynamic computation loads

Binary tree updatesCPU and GPU operating simultaneously on the tree, both doing modifications

CPU and GPU can synchronize using Platform AtomicsHigher performance through parallel operations reducing the need for data copying and reconciling.

Large Data SetsHierarchical data searchesApplications include object recognition, collision detection, global illumination, BVH

CPU and GPU have access to entire unified coherent memory. GPU can operate on huge models in place, reducing copy and kernel launch overhead.

CPU CallbacksMiddleware user-callbacksGPU processes work items, some of which require a call to a CPU function to fetch new data

GPU can invoke CPU functions from within a GPU kernelSimpler programming does not require “split kernels”Higher performance through parallel operations

© Copyright 2014 HSA Foundation. All Rights Reserved

QUESTIONS?

top related