8-hegde

8/10/2019 8-Hegde

1/74

HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)

AND THESOFTWARE ECOSYSTEM

MANJU HEGDE, CORPORATE VP, PRODUCTS GROUP, AMD

8/10/2019 8-Hegde

2/74

OUTLINE

Motivation

HSA architecture v1

Software stack

Workload analysis

Software Ecosystem

8/10/2019 8-Hegde

3/74

PARADIGM SHIFTS.

?

S i n g

l e - t

h r e a d

P e r f o r m a n c e

Time

we arehere

Enabled by:MooresLawVoltageScaling

Constrained by:Power Complexity

Single

-

Core Era

M o

d e r n

A p p

l i c a

t i o n


Time (D

wh

Heter

SysEnabled by:

Abundant dataparallelismPower efficientGPUs

T h r o u g

h p u

t


Time (# of processors)

we arehere

Enabled by:Moores LawSMParchitecture

Constrained by:Power Parallel SWScalability

Multi

-

Core Era

Assembly C/C++ Java

pthreads OpenMP / TBB Shader CUD

8/10/2019 8-Hegde

4/74

WITNESS DISCRETE CPU AND DISCRETE GPU COMPUTE

CPU Memory (Coherent )

CPU1

CPUN

GPU Memory

GPUCPU2

PCIe

Compute acceleration works well for large offloadSlow data transfer between CPU and GPUExpert programming necessary to take advantage of theGPU compute

8/10/2019 8-Hegde

5/74

FIRST AND SECOND GENERATION APU S

CPU Partition (Coherent)

CPU1

CPUN

GPU Partition

GPUCPU

2

Highspeed

InternalBus

First integration of CPU and GPU on-chipCommon physical memory but not to programmer Faster transfer of data between CPU and GPU to enablemore code to run on the GPU

8/10/2019 8-Hegde

6/74

GPUCPU

CPU Memory GPU Memory

| || ||

| || ||

| || ||

| || ||

CPU explicitly copies data to GPU memoryGPU completes computationCPU explicitly copies result back to CPU memory

COMMON PHYSICAL MEMORY BUT NOT TO PROGRAMMER

8/10/2019 8-Hegde

7/74

WHAT ARE THE PROBLEMS WE ARE TRYING TO SOLVE

SOCs are quickly following into the samemany CPU core bottlenecks of the PC

To move beyond this we need to look atright processor(s) and/or execution devicefor given workload at reasonable power

While addressing the core issues of Easier to program

Easier to optimizeEasier to load balance

High performance

Lower power

8/10/2019 8-Hegde

8/74

COMBINE INTO UNIFIED PROGRAMMING MODEL

CPU

GPU

Shared Memory, Coherency, User Mode Queues

AudioProcessor

VideoHardware

DSPImagSign

Process

FixedFunction

Accelerator

Encode

DecodeEngines

8/10/2019 8-Hegde

9/74

WHO IS DOING THIS?HSA FOUNDATION MEMBERSHIP JUNE 2013

9

Founders

Promoters

Supporters

Contributors

Academic

Associates
http://www.multicorewareinc.com/index.phphttp://www.apical.co.uk/

8/10/2019 8-Hegde

10/74

HSA FOUNDATIONS FOCUS

Identify design features to make accelerators first class processors

Attract mainstream programmers

Create a platform architecture for ALL accelerators

8/10/2019 8-Hegde

11/74

CPU

GPU

AudioProcessor

VideoHardware

DSPPr

FixedFunction

Acctr

EnDe

S h a r e

d M e m o r y

C o h e r e n c y ,

U s e r

M o

d e

Q u e u e s

GPU compute C++ support

User Mode Scheduling

Fully coherent memorybetween CPU & GPU

GPU uses pageable system

memory via CPU pointers

GPU graphics pre-emption

GPU compute context switch

HSA ARCHITECTURE V1

8/10/2019 8-Hegde

12/74

Physical Memory

GPU

HWCoherency

Virtual Memory

CPU

Entire memory space:Both CPU and GPU can access and

allocate any location in the systemsvirtual memory space

CacheCache

Coherent Memory:

Ensures CPU andGPU

caches both seean up-to-date viewof data Pageable memory

The GPU seamles

access virtual memoryaddresses that are not

present in physicamem

HSA KEY FEATURES

8/10/2019 8-Hegde

13/74

CPU / GPU Uniform Memory

| || ||

| || ||

WITH HSACPU simply passes a pointer to GPUGPU completes computationCPU can read the result directly no copying needed!

GPUCPU

HSA S ft St k

8/10/2019 8-Hegde

14/74

CPU

GPU

Audio

Processor

Video

Hardware

DSPPr

FixedFunction

Acctr

En

De

S h a r e

d M e m o r y

C o

h e r e n c y ,

U s e r

M o

d e

Q u e u e s

GPU compute C++ support

User Mode Scheduling

Fully coherent memorybetween CPU & GPU

GPU uses pageable systemmemory via CPU pointers

GPU graphics pre-emption

GPU compute context switch

HSA ARCHITECTURE V1HSA Software Stack

Task QueuingLibrarie s

HSA Domain Libraries,

OpenCL 2.x Runtime

HSA KernelMode Driver

HSA Runtime

HSA JIT

AppsApps

AppsApps

AppsApps

8/10/2019 8-Hegde

15/74

HETEROGENEOUS COMPUTE DISPATCH

How compute dispatch operatestoday in the dr iver model

How compute dispatchimproves un der HSA

8/10/2019 8-Hegde

16/74

TODAYS COMMAND AND DISPATCH FLOWCommand Flow Data Flow

SoftQueue

KernelMode

Driver

ApplicationA

Command Buffer

UserMode

Driver

Direct3D

DMA Buffer

HardwareQueue

A

8/10/2019 8-Hegde

17/74

Command Flow Data Flow

SoftQueue

KernelMode

Driver

ApplicationA

Command Buffer

UserMode

Driver

Direct3D

DMA Buffer

TODAYS COMMAND AND DISPATCH FLOW


SoftQueue

KernelModeDriver

ApplicationC

Command Buffer

UserModeDriver

Direct3D

DMA Buffer

HardwareQueue

A


SoftQueue

KernelModeDriver

ApplicationB

Command Buffer

UserModeDriver

Direct3D

DMA Buffer

8/10/2019 8-Hegde

18/74


HardwareQueue

A

CB

A B


SoftQueue

KernelMode

Driver

ApplicationA

Command Buffer

UserMode

Driver

Direct3D

DMA Buffer


SoftQueue

KernelModeDriver

ApplicationC

Command Buffer

UserModeDriver

Direct3D

DMA Buffer


SoftQueue

KernelModeDriver

ApplicationB

Command Buffer

UserModeDriver

Direct3D

DMA Buffer

8/10/2019 8-Hegde

19/74


SoftQueue

KernelMode

Driver

ApplicationA

Command Buffer

UserMode

Driver

Direct3D

DMA Buffer


SoftQueue

KernelModeDriver

ApplicationC

Command Buffer

UserModeDriver

Direct3D

DMA Buffer


SoftQueue

KernelModeDriver

ApplicationB

Command Buffer

UserModeDriver

Direct3D

DMA Buffer


HardwareQueue

A

CB

A B

HSA COMMAND AND DISPATCH FLOW

8/10/2019 8-Hegde

20/74

HSA COMMAND AND DISPATCH FLOW

ApplicationA

ApplicationB

ApplicationC

Optional DispatchBuffer

GPUHARDWARE

Hardware Queue

A

A A

Hardware Queue

B

B B

Hardware Queue

C

C C

C

C

No APIs

No Soft Qu

No User M

No Kernel

No Overhe

Applicationhardware

User mode

Hardware s

Low dispat

COMMAND AND DISPATCH CPU GPU

8/10/2019 8-Hegde

21/74

Application / Runtime

COMMAND AND DISPATCH CPU GPU

CPU2CPU1 GPU

MAKING GPUS AND APUS EASIER TO

8/10/2019 8-Hegde

22/74

MAKING GPUS AND APUS EASIER TOPROGRAM: TASK QUEUING RUNTIMES

Popular pattern for task and data parallelprogramming on SMP systems today

Characterized by:

A work queue per core

Runtime library that divides largeloops into tasks and distributes toqueues

A work stealing runtime that keepsthe system balanced

HSA is designed to extend this pattern torun on heterogeneous systems

8/10/2019 8-Hegde

23/74

TASK QUEUING RUNTIME ON CPU S

CPU Threads GPU Threads Memory

Work Stealing Runtime

CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

X86 CPU X86 CPU X86 CPU X86 CPU

8/10/2019 8-Hegde

24/74

TASK QUEUING RUNTIME ON THE HSA PLATFORM

SI

MD

SI

MD

SI

MD

Work Stealing Runtime

CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

CPUWorker

Q

GMan

Fetch an

X86 CPU X86 CPU X86 CPU X86 CPU

CPU Threads GPU Threads Memory

Driver Stack

HSA Software Stack

8/10/2019 8-Hegde

25/74

Hardware - APUs, CPUs, GPUs

Driver Stack

Domain Libraries

OpenCL 1.x, DX Runtimes ,User Mode Drivers

Graphics Kernel Mode Driver

AppsApps

AppsApps

AppsApps

HSA Software Stack

Task QueuingLibraries

HSA Domain Libraries,OpenCL 2.x Runtime

H

HSA JIT

AppsApps

AppsApps

AppsApps

User mode component Kernel mode component Components contributed by third partie

8/10/2019 8-Hegde

26/74

HSA INTERMEDIATE LANGUAGE - HSAIL

HSAIL is the intermediate language for parallel compute in HSAGenerated by a high level compiler (LLVM, gcc, Java VM, etc)Compiled down to GPU ISA or other parallel processor ISA by an IHVFinalizer Finalizer may execute at run time, install time or build time, dependingon platform type

HSAIL is a low level instruction set designed for parallel compute in ashared virtual memory environment. HSAIL is SIMT in form and doesnot dictate hardware microarchitecture

HSAIL is designed for fast compile time, moving most optimizations toHL compiler

HSAIL is at the same level as PTX: an intermediate assembly orVirtual Machine Target

Represented as bit-code in in a Brig file format with support latebinding of libraries

HSA BRINGS A MODERN OPEN COMPILATION

8/10/2019 8-Hegde

27/74

HSA BRINGS A MODERN OPEN COMPILATIONFOUNDATION

This bring about fully competitive rich complete compilation stack architectuthe creation of a broader set of GPU Computing tools, languages and librarie

HSAIL supports LLVM and other compilers GCC, Java VM

EDG or CLANG EDG or CLANG

NVVM IR SPIR

LLVM LLVM

PTX HSAIL

Hardware HARDWARE

Cuda OpenCL

OPENCL AND HSA

8/10/2019 8-Hegde

28/74

OPENCL AND HSA

HSA is an optimized platform architecture for OpenCL Not an alternative to OpenCL Focused on the hardware platform more than APIReady to support many more languages than C/C++

OpenCL on HSA will benefit from Avoidance of wasteful copiesLow latency dispatchImproved memory modelPointers shared between CPU and GPU

HSA also exposes a lower level programming interface

Optimized libraries may choose the lower level interface

8/10/2019 8-Hegde

29/74

8/10/2019 8-Hegde

30/74

AMDS OPEN SOURCE COMMITMENT TO HSAWe will open source our Linux execution and compilation stack

Jump start the ecosystem

Allow a single shared implementation where appropriateEnable university research in all areas

Component Name AMDSpecific

Rationale

HSA Bolt Library No Enable understanding and debug

HSAIL Code Generator No Enable researchLLVM Contributions No Industry and academic collaboration

HSA Assembler No Enable understanding and debug

HSA Runtime No Standardize on a single runtime

HSA Finalizer Yes Enable research and debug

HSA Kernel Driver Yes For inclusion in linux distros

8/10/2019 8-Hegde

31/74

WORKLOAD ANALYSIS

8/10/2019 8-Hegde

32/74

HAAR Face DetectionCORNERSTONE TECHNOLOGYFOR COMPUTER VISION

8/10/2019 8-Hegde

33/74

LOOKING FOR FACES IN ALL THE RIGHT PLACES

Quick HD CalculationsSearch square = 21 x 21Pixels = 1920 x 1080 = 2,073,600Search squares = 1900 x 1060 = ~2 Million

LOOKING FOR DIFFERENT SIZE FACES BY

8/10/2019 8-Hegde

34/74

LOOKING FOR DIFFERENT SIZE FACES BYSCALING THE VIDEO FRAME

More HD Calculations70% scaling in H and VTotal Pixels = 4.07 MillionSearch squares = 3.8 Million

8/10/2019 8-Hegde

35/74

Feature l

Feature m

Feature p

Feature r

Feature q

HAAR CASCADE STAGES

Feature k

Stage N

Stage N+1

Fp Yes

22 CASCADE STAGES EARLY OUT BETWEEN EACH

8/10/2019 8-Hegde

36/74

22 CASCADE STAGES, EARLY OUT BETWEEN EACH

STAGE 22STAGE 21STAGE 2STAGE 1

NO FACE

Final HD CalculationsSearch squares = 3.8 million

Average features per square = 124Calculations per feature = 100Calculations per frame = 47 GCalcs

Calculation Rate30 frames/sec = 1.4TCalcs/second60 frames/sec = 2.8TCalcs/second

and this only gets front -facing

8/10/2019 8-Hegde

37/74

CASCADE DEPTH ANALYSIS

05

10

15

20

25

Cascade Depth

20-25

15-20

10-15

5-10

0-5

PROCESSING TIME/STAGE

8/10/2019 8-Hegde

38/74

0

10

2030

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9-22

T i m e

( m s

)

Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)

GPU

CPU

PROCESSING TIME/STAGE

AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,6 compute units, 685MHz; 4GB RAM; Windows 7 (64- bit); OpenCL 1.1 (873.1)

Cascade Stage

PERFORMANCE CPU VS GPU

8/10/2019 8-Hegde

39/74

0

2

4

6

8

10

12

0 1 2 3 4 5 6 7 8 22

I m a g e s

/ S e c

Number of Cascade Stages on GPU

Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)

CPU

HSA

GPU

PERFORMANCE CPU-VS-GPU

AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,6 compute units, 685MHz; 4GB RAM; Windows 7 (64- bit); OpenCL 1.1 (873.1)

HAAR SOLUTION RUN DIFFERENT CASCADES

8/10/2019 8-Hegde

40/74

ON GPU AND CPUBy seamlessly sharing data between CPU and GPU,

HSA allows the right processor to handle its appropriate

workload+2.5x

-2.5x

INCREASEDPERFORMANCE

DECREASED ENERGYPER FRAME

8/10/2019 8-Hegde

41/74

GAMEPLAY RIGID BODYPHYSICS

RIGID BODY PHYSICS SIMULATION

8/10/2019 8-Hegde

42/74

RIGID BODY PHYSICS SIMULATION

Rigid-Body Physics Simulation is:a way to animate and interact with objects, widely used in games and movieproductionused to drive game play and for visual effects (eye candy)

Physics Simulation is used in many of todays software:Middleware Physics engines such as Bullet, Havok, PhysXGames ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 33D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D,LightwaveIndustrial applications such as Siemens NX8 Mechatronics Concept DesignMedical applications such as surgery trainersRobotics simulation

But GPU-accelera ted r ig id-body p hys ic s i s no t used in gam e play - only in e ffec ts

RIGID BODY PHYSICS - ALGORITHM

8/10/2019 8-Hegde

43/74

Find potential interacting object pairs using bounding shape approximations.

Perform full overlap ting between potentially interacting pairs

Compute exact contact information for a various shape types

Compute constraint forces for natural motion and stable stacking

Broad-PhaseCollisionDetection

Setupconstraints

Solvconstra

Computecontactpoints

A B0 B1 C0 C1 D1 D1

1 1 2 2 3 3 4

B D

A1

2 3

4

Mid-PhaseCollisionDetection

Narrow-PhaseCollisionDetection

8/10/2019 8-Hegde

44/74

RIGID BODY PHYSICS - CHALLENGES & SOLUTIONS

8/10/2019 8-Hegde

45/74

Simulation is a pipeline of manydifferent algorithms, some of whichare more suitable for CPU whileothers are more suitable for GPU

Many CPU optimizations (eg . earlyouts) arent efficient on GPUs,requiring the use of more brute-forcebut GPU-friendly algorithmsDiversity of intersection algorithmscause load balancing challenges

Varying object sizes require morecomplex and difficult to parallelizebroad-phase algorithms

sweep -and- prune uses incrementalsorting and traversal of lists

Narrow-phase algorithms (such asSAT or GJK) cause threaddivergence

Implementation Challenges

Avoidance of the data copy to/fromGPU and of the overhead ofmaintaining two copies of simulationstate

SMA, COH

Usage of early out optimizations andmore efficient load balancing

ENQ

More efficient serial aspects of broad-phase can run on the CPU

SMA, COH

Improved handling of threaddivergence

ENQ

Benefits of HSA

EMS : Entire Memory Space; PM : Pageable Memory; COH : Bidirectional Platform CoherencySMA : Shared Virtual Memory; DYN: Dynamic Memory Allocation; ENQ : GPU ENQueue;USD : USer Mode Dispatch

8/10/2019 8-Hegde

46/74

GESTURE RECOGNITION

GESTURE RECOGNITION

8/10/2019 8-Hegde

47/74

\:

An emerging natural way of interacting with a computer

Compute intensive where the computational complexity depends on the number andcomplexity of recognized gestures.

Strongly benefits from availability of depth informationBrowsing (previous/next, scroll), media players (next/previous song/video/image,pause/start), collaboration tools, such as slideshows, gaming (finger/hand as thecontroller), immersive environments, virtual reality

Todays systems are tuned to todays HW, lacking in robustness and usability, which can only beachieved by use of special-purpose HW. They do not do well for

A wide variety of useful gestures (one or two hand, multiple finger, arm or full body)

Motion dependent gestures (e.g. finger pinch), which requires correlatinginformation from multiple frames

Adaptability to variable lighting conditions

Larger region/distance of input, enabled by processing higher resolution video

ALGORITHM PIPELINE

8/10/2019 8-Hegde

48/74

Image processing:

adaptive light normalization

Edge and corner detection

Erode/dilate/threshold filter, to produce a

feature image.Depth analysis (for fg/bg segmentation, if usingstereo cameras)

Sparse approach, correlate salient points inthe feature image, and validate via localhistogram matching in the original image.

Connected components analysis, for handidentification (based on level sets)

GPU can recognize local connectivity with aparallel scan. CPU can apply transitivity oflabels (the neighbor of your neighbor is yourneighbor).

Feature vector (local histogram) extraction

Global: HOG on tiles; or

Contextual: SURF/SIFT keypoints

Find best match of histogram, with the t raining set(support vector machine), optionally update thetraining set.

Update temporal model state machine

GESTURE RECOGNITION CHALLENGES AND SOLUTIONS

8/10/2019 8-Hegde

49/74

GESTURE RECOGNITION CHALLENGES AND SOLUTIONS

Transfer of raw image data from CPUto GPU adds latency

Feature matching and depthreconstruction is a divergent workload,as images are sparsely populated bykeypoints, which require extensiveprocessing.Connected component analysis onGPU uses parallel scan, of which thelast stages of reduction are moreefficiently performed on the CPU.

High overhead of the per-frameupdates to the GPU copy of the featuredatabase, for unsupervised learningalgorithms (e.g. Ojas rule).


Avoidance the latency of duplicating data inGPU memory SMA

Higher GPU utilization is achieved viawavefront reshaping - ENQ

Reduction is most optimally implemented byusing both CPU and GPU - COH, SMA

CPU can update the database, while theGPU is accessing it SMA, COH

Benefits of HSA

EMS : Entire Memory Space; PM : Pageable Memory; COH : Bidirectional Platform CohSMA : Shared Virtual Memory; DYN: Dynamic Memory Allocation; ENQ : GPU ENQueUSD : USer Mode Dispatch

8/10/2019 8-Hegde

50/74

RAY TRACING

RAY TRACING

8/10/2019 8-Hegde

51/74

Photo-realistic visualization method that is widely used in movieproduction and high-fidelity visual effects

Used in many of todays photorealistic rendering packages

Maxwell Render (photorealistic high-end renderer)

Nvidias Optix (Nvidia GPU ray tracing renderer)

POV-Ray (popular CPU-only ray tracer)

Luxmark (popular ray tracing benchmark)

Rendering method that is friendly to parallelism, however not trivially

ported to parallel architectures, due to the complexity of an efficientimplementation.

However it is not used in interactive applications due to performancelimitations

8/10/2019 8-Hegde

52/74

RAY TRACING - CHALLENGES & SOLUTIONS

8/10/2019 8-Hegde

53/74

Scene database and acceleration datastructure can be huge

Eg . A power plant scene (shown

left) contains 12.7M polygons, has asize of 500MBytes, and anacceleration data structure of250MB-1.5GB (depending onrenderer)

Todays GPUs have problems fittingthem into video memory

Acceleration data structure has to be built

and updated using the CPU andtransferred to video memory

8ms time to transfer above datastructure (250MB) to the GPU


GPU Compute Units can access scene andacceleration data structure from main memory

SMA, PM

Avoidance of acceleration data structure copy toGPU memory

SMA

Benefits of HSA

EMS : Entire Memory Space; PM : Pageable Memory; COSMA : Shared Virtual Memory; DYN: Dynamic Memory AllocaUSD : USer Mode Dispatch

RAY TRACING - CHALLENGES & SOLUTIONS

8/10/2019 8-Hegde

54/74

Dynamic Scenes are impractical withcurrent GPU compute implementations

Data structure build time too long forinteractive frame rates

Simple data structures can be built fast, butare difficult to traverse

Faster traversal requires complexstructures that require a long time tocompute and are difficult to transfer to theGPU

Ray divergence caused by child rayshitting different object types with differentshading models (both GPUs & APUs likeregular operations) results in lower

utilization of CUsThe amount of rays can be immense (in thebillions), and the ray intersection process iscompute intensive

power plant scene at 1080p conservativeest. 2 billion rays.


CPU updates to scene are transparently andimmediately available (without any transferpenalty) to the GPU SMA, PM

Casting of child rays with no CPU-GPU roundtrip ENQ

Wavefront reshaping can improve CUutilization ENQ

Benefits of HSA

EMS : Entire Memory Space; PGM : Pageable Memory; COH : Bidirectional CoherencySMA : System Memory Access; DYN: Dynamic Memory Allocation;ENQ : GPU ENQueue; USD : USer Mode Dispatch

8/10/2019 8-Hegde

55/74

ACCELERATING MEMCACHEDCLOUD S ERVER WORKLOAD

MEMCACHED

8/10/2019 8-Hegde

56/74

MEMCACHED

A Distributed Memory Object Caching System Used in Cloud Servers

Generally used for short-term storage and caching, handling requests that wouldotherwise require database or file system accesses

Used by Facebook, YouTube, Twitter, Wikipedia, Flickr, and others

Effectively a large distributed hash table

Responds to store and get requests received over the network

Conceptually:

store(key, object)

object = get(key)

OFFLOADING MEMCACHED KEY LOOKUP TO THE GPU

8/10/2019 8-Hegde

57/74

0

1

2

3

4Key Look Up Performance Execution Br

Data Transfer Execution

T. H. Hetherington, T. G. Rogers, L. Hsu, M. OConnor, and T. M. Aamodt , Characterizing and Evaluating a Key -Value Store Application on Heterogeneous CPU- GPU Systems,Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2012) , April 2012.http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209

Multithreaded CPU Radeon HD 5870 Trinity A10 -5800K Zacate E
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209

8/10/2019 8-Hegde

58/74

ACCELERATING JAVAGOING BEYOND NATIVE LANGUAGES

GPU PROGRAMMING OPTIONS FOR JAVAPROGRAMMERS

8/10/2019 8-Hegde

59/74

PROGRAMMERSExisting Java GPU ( OpenCL /CUDA) bindings require coding a Kernel

in a domain-specific language.// JOCL/OpenCL kernel code

__kernel void squares(__global const float *in, __global float *out){int gid = get_global_id(0);out[gid] = in[gid] * in[gid];

}

Along with the Java host code to:Initialize the data

Select/Initialize execution device

Allocate or define memory buffers for args/parameters

Compile 'Kernel' for a selected device

Enqueue/Send arg buffers to device

Execute the kernel

Read results buffers back from the device

Cleanup (remove buffers/queues/device handles)

Use the results

import static org.jocl.CL.import org.jocl.*;

public class Sample { public static void main(Str

// Create input- andint size =float inArr[] = nfloat outArray[]for (int i=0; i

8-hegde

Documents