8-hegde

Upload: anonymous-bwvvqpx

Post on 02-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 8-Hegde

    1/74

    HETEROGENEOUS SYSTEM ARCHITECTURE (HSA)

    AND THESOFTWARE ECOSYSTEM

    MANJU HEGDE, CORPORATE VP, PRODUCTS GROUP, AMD

  • 8/10/2019 8-Hegde

    2/74

    OUTLINE

    Motivation

    HSA architecture v1

    Software stack

    Workload analysis

    Software Ecosystem

  • 8/10/2019 8-Hegde

    3/74

    PARADIGM SHIFTS.

    ?

    S i n g

    l e - t

    h r e a d

    P e r f o r m a n c e

    Time

    we arehere

    Enabled by:MooresLawVoltageScaling

    Constrained by:Power Complexity

    Single

    -

    Core Era

    M o

    d e r n

    A p p

    l i c a

    t i o n

    P e r f o r m a n c e

    Time (D

    wh

    Heter

    SysEnabled by:

    Abundant dataparallelismPower efficientGPUs

    T h r o u g

    h p u

    t

    P e r f o r m a n c e

    Time (# of processors)

    we arehere

    Enabled by:Moores LawSMParchitecture

    Constrained by:Power Parallel SWScalability

    Multi

    -

    Core Era

    Assembly C/C++ Java

    pthreads OpenMP / TBB Shader CUD

  • 8/10/2019 8-Hegde

    4/74

    WITNESS DISCRETE CPU AND DISCRETE GPU COMPUTE

    CPU Memory (Coherent )

    CPU1

    CPUN

    GPU Memory

    GPUCPU2

    PCIe

    Compute acceleration works well for large offloadSlow data transfer between CPU and GPUExpert programming necessary to take advantage of theGPU compute

  • 8/10/2019 8-Hegde

    5/74

    FIRST AND SECOND GENERATION APU S

    CPU Partition (Coherent)

    CPU1

    CPUN

    GPU Partition

    GPUCPU

    2

    Highspeed

    InternalBus

    First integration of CPU and GPU on-chipCommon physical memory but not to programmer Faster transfer of data between CPU and GPU to enablemore code to run on the GPU

  • 8/10/2019 8-Hegde

    6/74

    GPUCPU

    CPU Memory GPU Memory

    | || ||

    | || ||

    | || ||

    | || ||

    CPU explicitly copies data to GPU memoryGPU completes computationCPU explicitly copies result back to CPU memory

    COMMON PHYSICAL MEMORY BUT NOT TO PROGRAMMER

  • 8/10/2019 8-Hegde

    7/74

    WHAT ARE THE PROBLEMS WE ARE TRYING TO SOLVE

    SOCs are quickly following into the samemany CPU core bottlenecks of the PC

    To move beyond this we need to look atright processor(s) and/or execution devicefor given workload at reasonable power

    While addressing the core issues of Easier to program

    Easier to optimizeEasier to load balance

    High performance

    Lower power

  • 8/10/2019 8-Hegde

    8/74

    COMBINE INTO UNIFIED PROGRAMMING MODEL

    CPU

    GPU

    Shared Memory, Coherency, User Mode Queues

    AudioProcessor

    VideoHardware

    DSPImagSign

    Process

    FixedFunction

    Accelerator

    Encode

    DecodeEngines

  • 8/10/2019 8-Hegde

    9/74

    WHO IS DOING THIS?HSA FOUNDATION MEMBERSHIP JUNE 2013

    9

    Founders

    Promoters

    Supporters

    Contributors

    Academic

    Associates

    http://www.multicorewareinc.com/index.phphttp://www.apical.co.uk/
  • 8/10/2019 8-Hegde

    10/74

    HSA FOUNDATIONS FOCUS

    Identify design features to make accelerators first class processors

    Attract mainstream programmers

    Create a platform architecture for ALL accelerators

  • 8/10/2019 8-Hegde

    11/74

    CPU

    GPU

    AudioProcessor

    VideoHardware

    DSPPr

    FixedFunction

    Acctr

    EnDe

    S h a r e

    d M e m o r y

    C o h e r e n c y ,

    U s e r

    M o

    d e

    Q u e u e s

    GPU compute C++ support

    User Mode Scheduling

    Fully coherent memorybetween CPU & GPU

    GPU uses pageable system

    memory via CPU pointers

    GPU graphics pre-emption

    GPU compute context switch

    HSA ARCHITECTURE V1

  • 8/10/2019 8-Hegde

    12/74

    Physical Memory

    GPU

    HWCoherency

    Virtual Memory

    CPU

    Entire memory space:Both CPU and GPU can access and

    allocate any location in the systemsvirtual memory space

    CacheCache

    Coherent Memory:

    Ensures CPU andGPU

    caches both seean up-to-date viewof data Pageable memory

    The GPU seamles

    access virtual memoryaddresses that are not

    present in physicamem

    HSA KEY FEATURES

  • 8/10/2019 8-Hegde

    13/74

    CPU / GPU Uniform Memory

    | || ||

    | || ||

    WITH HSACPU simply passes a pointer to GPUGPU completes computationCPU can read the result directly no copying needed!

    GPUCPU

    HSA S ft St k

  • 8/10/2019 8-Hegde

    14/74

    CPU

    GPU

    Audio

    Processor

    Video

    Hardware

    DSPPr

    FixedFunction

    Acctr

    En

    De

    S h a r e

    d M e m o r y

    C o

    h e r e n c y ,

    U s e r

    M o

    d e

    Q u e u e s

    GPU compute C++ support

    User Mode Scheduling

    Fully coherent memorybetween CPU & GPU

    GPU uses pageable systemmemory via CPU pointers

    GPU graphics pre-emption

    GPU compute context switch

    HSA ARCHITECTURE V1HSA Software Stack

    Task QueuingLibrarie s

    HSA Domain Libraries,

    OpenCL 2.x Runtime

    HSA KernelMode Driver

    HSA Runtime

    HSA JIT

    AppsApps

    AppsApps

    AppsApps

  • 8/10/2019 8-Hegde

    15/74

    HETEROGENEOUS COMPUTE DISPATCH

    How compute dispatch operatestoday in the dr iver model

    How compute dispatchimproves un der HSA

  • 8/10/2019 8-Hegde

    16/74

    TODAYS COMMAND AND DISPATCH FLOWCommand Flow Data Flow

    SoftQueue

    KernelMode

    Driver

    ApplicationA

    Command Buffer

    UserMode

    Driver

    Direct3D

    DMA Buffer

    HardwareQueue

    A

  • 8/10/2019 8-Hegde

    17/74

    Command Flow Data Flow

    SoftQueue

    KernelMode

    Driver

    ApplicationA

    Command Buffer

    UserMode

    Driver

    Direct3D

    DMA Buffer

    TODAYS COMMAND AND DISPATCH FLOW

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationC

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    HardwareQueue

    A

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationB

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

  • 8/10/2019 8-Hegde

    18/74

    TODAYS COMMAND AND DISPATCH FLOW

    HardwareQueue

    A

    CB

    A B

    Command Flow Data Flow

    SoftQueue

    KernelMode

    Driver

    ApplicationA

    Command Buffer

    UserMode

    Driver

    Direct3D

    DMA Buffer

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationC

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationB

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

  • 8/10/2019 8-Hegde

    19/74

    Command Flow Data Flow

    SoftQueue

    KernelMode

    Driver

    ApplicationA

    Command Buffer

    UserMode

    Driver

    Direct3D

    DMA Buffer

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationC

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    Command Flow Data Flow

    SoftQueue

    KernelModeDriver

    ApplicationB

    Command Buffer

    UserModeDriver

    Direct3D

    DMA Buffer

    TODAYS COMMAND AND DISPATCH FLOW

    HardwareQueue

    A

    CB

    A B

    HSA COMMAND AND DISPATCH FLOW

  • 8/10/2019 8-Hegde

    20/74

    HSA COMMAND AND DISPATCH FLOW

    ApplicationA

    ApplicationB

    ApplicationC

    Optional DispatchBuffer

    GPUHARDWARE

    Hardware Queue

    A

    A A

    Hardware Queue

    B

    B B

    Hardware Queue

    C

    C C

    C

    C

    No APIs

    No Soft Qu

    No User M

    No Kernel

    No Overhe

    Applicationhardware

    User mode

    Hardware s

    Low dispat

    COMMAND AND DISPATCH CPU GPU

  • 8/10/2019 8-Hegde

    21/74

    Application / Runtime

    COMMAND AND DISPATCH CPU GPU

    CPU2CPU1 GPU

    MAKING GPUS AND APUS EASIER TO

  • 8/10/2019 8-Hegde

    22/74

    MAKING GPUS AND APUS EASIER TOPROGRAM: TASK QUEUING RUNTIMES

    Popular pattern for task and data parallelprogramming on SMP systems today

    Characterized by:

    A work queue per core

    Runtime library that divides largeloops into tasks and distributes toqueues

    A work stealing runtime that keepsthe system balanced

    HSA is designed to extend this pattern torun on heterogeneous systems

  • 8/10/2019 8-Hegde

    23/74

    TASK QUEUING RUNTIME ON CPU S

    CPU Threads GPU Threads Memory

    Work Stealing Runtime

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    X86 CPU X86 CPU X86 CPU X86 CPU

  • 8/10/2019 8-Hegde

    24/74

    TASK QUEUING RUNTIME ON THE HSA PLATFORM

    SI

    MD

    SI

    MD

    SI

    MD

    Work Stealing Runtime

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    CPUWorker

    Q

    GMan

    Fetch an

    X86 CPU X86 CPU X86 CPU X86 CPU

    CPU Threads GPU Threads Memory

    Driver Stack

    HSA Software Stack

  • 8/10/2019 8-Hegde

    25/74

    Hardware - APUs, CPUs, GPUs

    Driver Stack

    Domain Libraries

    OpenCL 1.x, DX Runtimes ,User Mode Drivers

    Graphics Kernel Mode Driver

    AppsApps

    AppsApps

    AppsApps

    HSA Software Stack

    Task QueuingLibraries

    HSA Domain Libraries,OpenCL 2.x Runtime

    H

    HSA JIT

    AppsApps

    AppsApps

    AppsApps

    User mode component Kernel mode component Components contributed by third partie

  • 8/10/2019 8-Hegde

    26/74

    HSA INTERMEDIATE LANGUAGE - HSAIL

    HSAIL is the intermediate language for parallel compute in HSAGenerated by a high level compiler (LLVM, gcc, Java VM, etc)Compiled down to GPU ISA or other parallel processor ISA by an IHVFinalizer Finalizer may execute at run time, install time or build time, dependingon platform type

    HSAIL is a low level instruction set designed for parallel compute in ashared virtual memory environment. HSAIL is SIMT in form and doesnot dictate hardware microarchitecture

    HSAIL is designed for fast compile time, moving most optimizations toHL compiler

    HSAIL is at the same level as PTX: an intermediate assembly orVirtual Machine Target

    Represented as bit-code in in a Brig file format with support latebinding of libraries

    HSA BRINGS A MODERN OPEN COMPILATION

  • 8/10/2019 8-Hegde

    27/74

    HSA BRINGS A MODERN OPEN COMPILATIONFOUNDATION

    This bring about fully competitive rich complete compilation stack architectuthe creation of a broader set of GPU Computing tools, languages and librarie

    HSAIL supports LLVM and other compilers GCC, Java VM

    EDG or CLANG EDG or CLANG

    NVVM IR SPIR

    LLVM LLVM

    PTX HSAIL

    Hardware HARDWARE

    Cuda OpenCL

    OPENCL AND HSA

  • 8/10/2019 8-Hegde

    28/74

    OPENCL AND HSA

    HSA is an optimized platform architecture for OpenCL Not an alternative to OpenCL Focused on the hardware platform more than APIReady to support many more languages than C/C++

    OpenCL on HSA will benefit from Avoidance of wasteful copiesLow latency dispatchImproved memory modelPointers shared between CPU and GPU

    HSA also exposes a lower level programming interface

    Optimized libraries may choose the lower level interface

  • 8/10/2019 8-Hegde

    29/74

  • 8/10/2019 8-Hegde

    30/74

    AMDS OPEN SOURCE COMMITMENT TO HSAWe will open source our Linux execution and compilation stack

    Jump start the ecosystem

    Allow a single shared implementation where appropriateEnable university research in all areas

    Component Name AMDSpecific

    Rationale

    HSA Bolt Library No Enable understanding and debug

    HSAIL Code Generator No Enable researchLLVM Contributions No Industry and academic collaboration

    HSA Assembler No Enable understanding and debug

    HSA Runtime No Standardize on a single runtime

    HSA Finalizer Yes Enable research and debug

    HSA Kernel Driver Yes For inclusion in linux distros

  • 8/10/2019 8-Hegde

    31/74

    WORKLOAD ANALYSIS

  • 8/10/2019 8-Hegde

    32/74

    HAAR Face DetectionCORNERSTONE TECHNOLOGYFOR COMPUTER VISION

  • 8/10/2019 8-Hegde

    33/74

    LOOKING FOR FACES IN ALL THE RIGHT PLACES

    Quick HD CalculationsSearch square = 21 x 21Pixels = 1920 x 1080 = 2,073,600Search squares = 1900 x 1060 = ~2 Million

    LOOKING FOR DIFFERENT SIZE FACES BY

  • 8/10/2019 8-Hegde

    34/74

    LOOKING FOR DIFFERENT SIZE FACES BYSCALING THE VIDEO FRAME

    More HD Calculations70% scaling in H and VTotal Pixels = 4.07 MillionSearch squares = 3.8 Million

  • 8/10/2019 8-Hegde

    35/74

    Feature l

    Feature m

    Feature p

    Feature r

    Feature q

    HAAR CASCADE STAGES

    Feature k

    Stage N

    Stage N+1

    Fp Yes

    22 CASCADE STAGES EARLY OUT BETWEEN EACH

  • 8/10/2019 8-Hegde

    36/74

    22 CASCADE STAGES, EARLY OUT BETWEEN EACH

    STAGE 22STAGE 21STAGE 2STAGE 1

    NO FACE

    Final HD CalculationsSearch squares = 3.8 million

    Average features per square = 124Calculations per feature = 100Calculations per frame = 47 GCalcs

    Calculation Rate30 frames/sec = 1.4TCalcs/second60 frames/sec = 2.8TCalcs/second

    and this only gets front -facing

  • 8/10/2019 8-Hegde

    37/74

    CASCADE DEPTH ANALYSIS

    05

    10

    15

    20

    25

    Cascade Depth

    20-25

    15-20

    10-15

    5-10

    0-5

    PROCESSING TIME/STAGE

  • 8/10/2019 8-Hegde

    38/74

    0

    10

    2030

    40

    50

    60

    70

    80

    90

    100

    1 2 3 4 5 6 7 8 9-22

    T i m e

    ( m s

    )

    Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)

    GPU

    CPU

    PROCESSING TIME/STAGE

    AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,6 compute units, 685MHz; 4GB RAM; Windows 7 (64- bit); OpenCL 1.1 (873.1)

    Cascade Stage

    PERFORMANCE CPU VS GPU

  • 8/10/2019 8-Hegde

    39/74

    0

    2

    4

    6

    8

    10

    12

    0 1 2 3 4 5 6 7 8 22

    I m a g e s

    / S e c

    Number of Cascade Stages on GPU

    Trinity A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)

    CPU

    HSA

    GPU

    PERFORMANCE CPU-VS-GPU

    AMD A10 4600M APU with Radeon HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,6 compute units, 685MHz; 4GB RAM; Windows 7 (64- bit); OpenCL 1.1 (873.1)

    HAAR SOLUTION RUN DIFFERENT CASCADES

  • 8/10/2019 8-Hegde

    40/74

    ON GPU AND CPUBy seamlessly sharing data between CPU and GPU,

    HSA allows the right processor to handle its appropriate

    workload+2.5x

    -2.5x

    INCREASEDPERFORMANCE

    DECREASED ENERGYPER FRAME

  • 8/10/2019 8-Hegde

    41/74

    GAMEPLAY RIGID BODYPHYSICS

    RIGID BODY PHYSICS SIMULATION

  • 8/10/2019 8-Hegde

    42/74

    RIGID BODY PHYSICS SIMULATION

    Rigid-Body Physics Simulation is:a way to animate and interact with objects, widely used in games and movieproductionused to drive game play and for visual effects (eye candy)

    Physics Simulation is used in many of todays software:Middleware Physics engines such as Bullet, Havok, PhysXGames ranging from Angry Birds and Cut the Rope to Tomb Raider and Crysis 33D authoring tools such as Autodesk Maya, Unity 3D, Houdini, Cinema 4D,LightwaveIndustrial applications such as Siemens NX8 Mechatronics Concept DesignMedical applications such as surgery trainersRobotics simulation

    But GPU-accelera ted r ig id-body p hys ic s i s no t used in gam e play - only in e ffec ts

    RIGID BODY PHYSICS - ALGORITHM

  • 8/10/2019 8-Hegde

    43/74

    Find potential interacting object pairs using bounding shape approximations.

    Perform full overlap ting between potentially interacting pairs

    Compute exact contact information for a various shape types

    Compute constraint forces for natural motion and stable stacking

    Broad-PhaseCollisionDetection

    Setupconstraints

    Solvconstra

    Computecontactpoints

    A B0 B1 C0 C1 D1 D1

    1 1 2 2 3 3 4

    B D

    A1

    2 3

    4

    Mid-PhaseCollisionDetection

    Narrow-PhaseCollisionDetection

  • 8/10/2019 8-Hegde

    44/74

    RIGID BODY PHYSICS - CHALLENGES & SOLUTIONS

  • 8/10/2019 8-Hegde

    45/74

    Simulation is a pipeline of manydifferent algorithms, some of whichare more suitable for CPU whileothers are more suitable for GPU

    Many CPU optimizations (eg . earlyouts) arent efficient on GPUs,requiring the use of more brute-forcebut GPU-friendly algorithmsDiversity of intersection algorithmscause load balancing challenges

    Varying object sizes require morecomplex and difficult to parallelizebroad-phase algorithms

    sweep -and- prune uses incrementalsorting and traversal of lists

    Narrow-phase algorithms (such asSAT or GJK) cause threaddivergence

    Implementation Challenges

    Avoidance of the data copy to/fromGPU and of the overhead ofmaintaining two copies of simulationstate

    SMA, COH

    Usage of early out optimizations andmore efficient load balancing

    ENQ

    More efficient serial aspects of broad-phase can run on the CPU

    SMA, COH

    Improved handling of threaddivergence

    ENQ

    Benefits of HSA

    EMS : Entire Memory Space; PM : Pageable Memory; COH : Bidirectional Platform CoherencySMA : Shared Virtual Memory; DYN: Dynamic Memory Allocation; ENQ : GPU ENQueue;USD : USer Mode Dispatch

  • 8/10/2019 8-Hegde

    46/74

    GESTURE RECOGNITION

    GESTURE RECOGNITION

  • 8/10/2019 8-Hegde

    47/74

    \:

    An emerging natural way of interacting with a computer

    Compute intensive where the computational complexity depends on the number andcomplexity of recognized gestures.

    Strongly benefits from availability of depth informationBrowsing (previous/next, scroll), media players (next/previous song/video/image,pause/start), collaboration tools, such as slideshows, gaming (finger/hand as thecontroller), immersive environments, virtual reality

    Todays systems are tuned to todays HW, lacking in robustness and usability, which can only beachieved by use of special-purpose HW. They do not do well for

    A wide variety of useful gestures (one or two hand, multiple finger, arm or full body)

    Motion dependent gestures (e.g. finger pinch), which requires correlatinginformation from multiple frames

    Adaptability to variable lighting conditions

    Larger region/distance of input, enabled by processing higher resolution video

    ALGORITHM PIPELINE

  • 8/10/2019 8-Hegde

    48/74

    Image processing:

    adaptive light normalization

    Edge and corner detection

    Erode/dilate/threshold filter, to produce a

    feature image.Depth analysis (for fg/bg segmentation, if usingstereo cameras)

    Sparse approach, correlate salient points inthe feature image, and validate via localhistogram matching in the original image.

    Connected components analysis, for handidentification (based on level sets)

    GPU can recognize local connectivity with aparallel scan. CPU can apply transitivity oflabels (the neighbor of your neighbor is yourneighbor).

    Feature vector (local histogram) extraction

    Global: HOG on tiles; or

    Contextual: SURF/SIFT keypoints

    Find best match of histogram, with the t raining set(support vector machine), optionally update thetraining set.

    Update temporal model state machine

    GESTURE RECOGNITION CHALLENGES AND SOLUTIONS

  • 8/10/2019 8-Hegde

    49/74

    GESTURE RECOGNITION CHALLENGES AND SOLUTIONS

    Transfer of raw image data from CPUto GPU adds latency

    Feature matching and depthreconstruction is a divergent workload,as images are sparsely populated bykeypoints, which require extensiveprocessing.Connected component analysis onGPU uses parallel scan, of which thelast stages of reduction are moreefficiently performed on the CPU.

    High overhead of the per-frameupdates to the GPU copy of the featuredatabase, for unsupervised learningalgorithms (e.g. Ojas rule).

    Implementation Challenges

    Avoidance the latency of duplicating data inGPU memory SMA

    Higher GPU utilization is achieved viawavefront reshaping - ENQ

    Reduction is most optimally implemented byusing both CPU and GPU - COH, SMA

    CPU can update the database, while theGPU is accessing it SMA, COH

    Benefits of HSA

    EMS : Entire Memory Space; PM : Pageable Memory; COH : Bidirectional Platform CohSMA : Shared Virtual Memory; DYN: Dynamic Memory Allocation; ENQ : GPU ENQueUSD : USer Mode Dispatch

  • 8/10/2019 8-Hegde

    50/74

    RAY TRACING

    RAY TRACING

  • 8/10/2019 8-Hegde

    51/74

    Photo-realistic visualization method that is widely used in movieproduction and high-fidelity visual effects

    Used in many of todays photorealistic rendering packages

    Maxwell Render (photorealistic high-end renderer)

    Nvidias Optix (Nvidia GPU ray tracing renderer)

    POV-Ray (popular CPU-only ray tracer)

    Luxmark (popular ray tracing benchmark)

    Rendering method that is friendly to parallelism, however not trivially

    ported to parallel architectures, due to the complexity of an efficientimplementation.

    However it is not used in interactive applications due to performancelimitations

  • 8/10/2019 8-Hegde

    52/74

    RAY TRACING - CHALLENGES & SOLUTIONS

  • 8/10/2019 8-Hegde

    53/74

    Scene database and acceleration datastructure can be huge

    Eg . A power plant scene (shown

    left) contains 12.7M polygons, has asize of 500MBytes, and anacceleration data structure of250MB-1.5GB (depending onrenderer)

    Todays GPUs have problems fittingthem into video memory

    Acceleration data structure has to be built

    and updated using the CPU andtransferred to video memory

    8ms time to transfer above datastructure (250MB) to the GPU

    Implementation Challenges

    GPU Compute Units can access scene andacceleration data structure from main memory

    SMA, PM

    Avoidance of acceleration data structure copy toGPU memory

    SMA

    Benefits of HSA

    EMS : Entire Memory Space; PM : Pageable Memory; COSMA : Shared Virtual Memory; DYN: Dynamic Memory AllocaUSD : USer Mode Dispatch

    RAY TRACING - CHALLENGES & SOLUTIONS

  • 8/10/2019 8-Hegde

    54/74

    Dynamic Scenes are impractical withcurrent GPU compute implementations

    Data structure build time too long forinteractive frame rates

    Simple data structures can be built fast, butare difficult to traverse

    Faster traversal requires complexstructures that require a long time tocompute and are difficult to transfer to theGPU

    Ray divergence caused by child rayshitting different object types with differentshading models (both GPUs & APUs likeregular operations) results in lower

    utilization of CUsThe amount of rays can be immense (in thebillions), and the ray intersection process iscompute intensive

    power plant scene at 1080p conservativeest. 2 billion rays.

    Implementation Challenges

    CPU updates to scene are transparently andimmediately available (without any transferpenalty) to the GPU SMA, PM

    Casting of child rays with no CPU-GPU roundtrip ENQ

    Wavefront reshaping can improve CUutilization ENQ

    Benefits of HSA

    EMS : Entire Memory Space; PGM : Pageable Memory; COH : Bidirectional CoherencySMA : System Memory Access; DYN: Dynamic Memory Allocation;ENQ : GPU ENQueue; USD : USer Mode Dispatch

  • 8/10/2019 8-Hegde

    55/74

    ACCELERATING MEMCACHEDCLOUD S ERVER WORKLOAD

    MEMCACHED

  • 8/10/2019 8-Hegde

    56/74

    MEMCACHED

    A Distributed Memory Object Caching System Used in Cloud Servers

    Generally used for short-term storage and caching, handling requests that wouldotherwise require database or file system accesses

    Used by Facebook, YouTube, Twitter, Wikipedia, Flickr, and others

    Effectively a large distributed hash table

    Responds to store and get requests received over the network

    Conceptually:

    store(key, object)

    object = get(key)

    OFFLOADING MEMCACHED KEY LOOKUP TO THE GPU

  • 8/10/2019 8-Hegde

    57/74

    0

    1

    2

    3

    4Key Look Up Performance Execution Br

    Data Transfer Execution

    T. H. Hetherington, T. G. Rogers, L. Hsu, M. OConnor, and T. M. Aamodt , Characterizing and Evaluating a Key -Value Store Application on Heterogeneous CPU- GPU Systems,Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2012) , April 2012.http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209

    Multithreaded CPU Radeon HD 5870 Trinity A10 -5800K Zacate E

    http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6189209
  • 8/10/2019 8-Hegde

    58/74

    ACCELERATING JAVAGOING BEYOND NATIVE LANGUAGES

    GPU PROGRAMMING OPTIONS FOR JAVAPROGRAMMERS

  • 8/10/2019 8-Hegde

    59/74

    PROGRAMMERSExisting Java GPU ( OpenCL /CUDA) bindings require coding a Kernel

    in a domain-specific language.// JOCL/OpenCL kernel code

    __kernel void squares(__global const float *in, __global float *out){int gid = get_global_id(0);out[gid] = in[gid] * in[gid];

    }

    Along with the Java host code to:Initialize the data

    Select/Initialize execution device

    Allocate or define memory buffers for args/parameters

    Compile 'Kernel' for a selected device

    Enqueue/Send arg buffers to device

    Execute the kernel

    Read results buffers back from the device

    Cleanup (remove buffers/queues/device handles)

    Use the results

    import static org.jocl.CL.import org.jocl.*;

    public class Sample { public static void main(Str

    // Create input- andint size =float inArr[] = nfloat outArray[]for (int i=0; i