programming intel® processor graphics · 3 types of svm coarse-grain buffers (intel 5th gen...

91
PROGRAMMING Intel® Processor Graphics Chi-Keung (CK) Luk - Intel Principal Engineer Intel Software & Services Group

Upload: others

Post on 28-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • PROGRAMMING Intel® Processor Graphics

    Chi-Keung (CK) Luk - Intel Principal Engineer

    Intel Software & Services Group

  • Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    • Compute programming on Intel Graphics with:

    • OpenCL

    • CilkPlus

    • Tools

    • Workload performance

    2

    Agenda

  • Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    • OpenCL:

    Khronos

    Uri Levy, Yuval Eshkol, Doron Singer

    Robert Ioffe, Aaron Kunze, Ben Ashbaugh, Stephen Junkins, Michal Mrozek

    • CilkPlus:

    Knud J Kirkegaard, Anoop Madhusoodhanan Prabha, Konstantin Bobrovsky, Sergey Dmitriev

    • VTune:

    Alexandr Kurylev, Julia Fedorova

    • Workload Performance:

    Sharad Tripathi, Chinang Ma, Akhila Vidiyala

    Edward Ching, Norbert Egi, Masood Mortazavi, Vivent Cheng, Guangyu Shi

    3

    Acknowledgments for Slide Sources

  • Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice4

    OpenCL

    1. Introduction

    2. Optimizing OpenCL for Intel GPUs

    3. Using Shared Virtual Memory (SVM)

  • Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    • An open standard managed by the Khronos group

    • A set of C-based APIs on the host that defines a run-time environment

    • Programs written in a C-based language (C++ support since OpenCL 2.1) that run on the device(s)

    OpenCL* (Open Computing Language)

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    Mostly C-like

    Kernels (functions which execute a work-item) have a kernel prefix, and a void return type

    No support for library functions

    No stdio.h / stdlib.h / math.h / etc..

    But printf is supported

    Based on C99

    kernel void foo (global int* ptr)

    {

    for (int i=0; i

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    Scalar data types

    char / uchar / short / ushort / int / uint / long / ulong

    float / double

    size_t

    Pointers

    Derived data types

    Arrays

    Structures

    Vector data types

    Supported data types

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    Vectors exist for all scalar types

    Vector widths are 2, 3, 4, 8, 16

    All arithmetic operations work on vector types

    Component access (XYZW)

    Vectors > 4 use numeric (hexadecimal) indices

    Working with vectors

    uint3 vec0, vec1;

    uint3 result = vec0 + vec1;

    double res1 = dvec0.x + dvec1.z;

    double2 res2 = dvec0.wy + dvec1.xx; // Swizzle

    float res1 = vec16.s5 + vec16.sf;

    float2 res2 = vec8.s37 + vec16.sca; // Swizzle

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    Functions used for querying the index

    get_global_id(int dimension); // Index of work-item in entire execution

    get_local_id(int dimension); // index of work-item in work-group

    get_group_id(int dimension); // index of work-group

    A few others..

    Memory

    No support for dynamic memory allocation

    When passing buffers as arguments, specify global / local / constant

    Additional details

    kernel

    void foo (global const int* ptr, local float* scratch)

    {

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    Many functions supported

    Overloaded to all relevant types and vector widths

    A tiny bit for example:

    Math: sin / cos / min / max / log / pow / sqrt / ….

    Geometric: dot / cross / distance / length / …

    Relational: isEqual / isGreater / all / any / select / …

    Built-in functions

    kernel void

    dot_product foo(global const int4* a, global const int4* b, global int* out)

    {

    size_t tid = get_global_id(0);

    out[tid] = dot(a[tid], b[tid]);

    }

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    • Scalar values

    • Use “normal” (C-style) casts

    • Vector values

    • Vector conversions must be done explicitly

    • Source and destination types must have the same vector width

    • Example:

    • Vector construction (from scalars)

    Casts and type conversions

    dstValue = convert_destType(srcValue)

    int8 intVec;

    double8 dVec;

    float8 fVec = convert_float8(intVec);

    float8 fVec2 = convert_float8(dVec);

    ushort3 vecUshort = (ushort3)(0, 12, 7);

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    Based on C99

    With some extra features:

    Vector data types

    Extensive Built-in functions library

    Image handling, Work-group synchronization

    Minus others:

    Recursion

    Function pointers

    Pointers to pointers

    Dynamic memory allocation

    Language summary

  • Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice18

    OpenCL1. Introduction

    2. Optimizing OpenCL for Intel GPUs

    Use zero copying to transfer data between CPU and GPU

    Maximize EU occupancy

    Maximize compute performance

    Avoid Divergent Control Flow

    Take advantage of large register space

    Optimize data accesses

    3. Using Shared Virtual Memory (SVM)

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice19

    Optimizing Host to Device Transfers: Zero Copying

    • Host (CPU) and Device (GPU) share the same physical memory

    • For buffers allocated through the OpenCL™ runtime:

    - Let the OpenCL runtime allocate system memory

    Create buffer with system memory pointer and CL_MEM_ALLOC_HOST_PTR

    - OR, Use pre-allocated system memory

    Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR

    Allocate system memory aligned to a page (4096 bytes) (e.g., use _aligned_malloc or memalign to allocate)

    Allocate a multiple of cache line size (64 bytes)

    No transfer needed (zero copy)!

    - Use clEnqueueMapBuffer() to access data

    No transfer needed (zero copy)!

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice20

    Maximizing EU Occupancy

    • Occupancy is a measure of EU thread utilization

    • Two primary things to consider:

    - Launch enough work items to keep EU threads busy

    - In short kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost

    For example, color conversion:

    __global uchar* src, dst;p = src[src_idx] * B2Y +

    src[src_idx + 1] * G2Y + src[src_idx + 2] * R2Y;

    dst[dst_idx] = p;

    __global uchar* src_ptr, dst_ptr;uchar16 src = vload16(0, src_ptr);uchar4 c0 = src.s048c;uchar4 c1 = src.s159d;uchar4 c2 = src.s26ae;uchar4 Y = c0 * B2Y +

    c1 * G2Y + c2 * R2Y;

    vstore4(Y, 0, dst_ptr);Before:One pixel per work item After:

    Four pixels per work item

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice21

    Maximize Compute Performance

    • Use floats instead of integer data types

    - Because an EU can issue two float operations per cycle

    • Floating-point throughput depends on the data width

    - float16 throughput = 2 x float32 throughput

    - float32 throughput = 4 x float64 throughput

    • Trade accuracy for speed, where appropriate

    - Use “native” built-ins (or use -cl-fast-relaxed-math)

    - Use mad() / fma()(or use -cl-mad-enable)

    x = cos(i); x = native_cos(i);

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice22

    Avoid Divergent Control Flow

    “SIMT” ISA with Predication and Branching

    “Divergent” code executes both branches

    Reduced SIMD Efficiency, Increased Power and Exec Time

    this();

    if ( x )

    that();

    else

    another();

    finish();

    SIMD lane

    time

    Example: “x” sometimes true

    SIMD lane

    time

    Example: “x” never true

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice23

    Optimizing Data Accesses

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice24

    Take Advantage of Large Register Space

    • Each work item in an OpenCL™ kernel has access to up to 256-512 bytes of register space

    • Bandwidth to registers faster than any memory

    • Loading and processing blocks of pixels in registers is very efficient!

    float sum[PX_PER_WI_X] = { 0.0f };float k[KERNEL_SIZE_X];float d[PX_PER_WI_X + KERNEL_SIZE_X];// Load filter kernel in k, input data in d...// Compute convolutionfor (px = 0; px < PX_PER_WI_X; ++px)

    for (sx = 0; sx < KERNEL_SIZE_X; ++sx)sum[px]= mad(k[sx], d[px + sx], sum[px]);

    Use available registers (up to 512 bytes) instead of

    memory, where possible!

    allocated in registers

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice25

    Global and Constant Memory

    Global Memory Accesses go through the L3 Cache

    L3 cache line is 64 bytes

    EU thread accesses to the same cache line are collapsed

    • Order of data within cache line does not matter

    • Bandwidth determined by number of cache lines accessed

    • Maximum Bandwidth (L3 EU): 64 bytes / clock / sub slice

    Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address

    Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address

    • Loading more than 4 x 32-bits of data is not beneficial

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice26

    Example: Global and Constant Memory Accesses

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

    • Local memory accesses go thru. SLM

    • Sits next to L3 cache in the architecture

    • Key difference: SLM is banked

    • Banked at 4-byte granularity, with 16 banks in total

    • Maximum bandwidth: still 64 bytes / clock / sub slice

    • Supports more access patterns with full bandwidth than Global memory:

    • Reading the same address from a bank => not a bank conflict

    • Reading different addresses from a bank => bank conflict

    • Maximum bandwidth achieved when there is no bank conflict27

    Local Memory Accesses

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice28

    Example: Local Memory Accesses

  • Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice29

    OpenCL

    1. Introduction

    2. Optimizing OpenCL for Intel GPUs

    3. Using Shared Virtual Memory (SVM)

  • Shared Virtual Memory (Pre-history)

    Builds upon “shared physical memory” feature SPM established with OpenCL 1.0 => CL_MEM_USE_HOST_PTR flag Supported on Intel 3rd Gen processors with HD Graphics Eliminated buffer copy costs, aka “zero-copy” buffers* Buffer must have 4k byte alignment and size divisible by 64

    SPM available since 2011, but still not

    used by many OpenCL apps…

    * See “Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics”

    https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

  • Shared Virtual Memory (SVM) - Basics

    31

  • 3 types of SVMCoarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueSVMMap/Unmap commands

    Fine-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5500+) SVM buffers can be accessed from either CPU or GPU at any time Use atomics to control access (if CPU & GPU may try to modify the

    same memory location) Check CL_DEVICE_SVM_FINE_GRAIN_BUFFER for fine-grained buffer

    SVM support, CL_DEVICE_SVM_ATOMICS is for atomics support

    Fine-grain system memory (Future Intel Processors) CPU & GPU can share anything allocated from the C-runtime ‘heap’

    (i.e. malloc/new)

  • 3 types of SVM

    Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueMap/Unmap commands

    Un-mapped state: Only GPU can access buffer

  • 3 types of SVM

    Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueMap/Unmap commands

    Mapped state: Only CPU can access buffer

  • 3 types of SVM

    Fine-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5500+) SVM buffers can be accessed from both CPU and GPU at any time Can use atomics to avoid ‘race’ conditions

    Check if device supports (CL_DEVICE_SVM_FINE_GRAIN_BUFFER & CL_DEVICE_SVM_ATOMICS flags)

    Fine grain SVM buffer allows simultaneous access from CPU & GPU

  • 3 types of SVM

    Fine-grain system memory (Future Intel Processors) CPU & GPU can share anything allocated from the C-runtime ‘heap’

    (i.e. malloc/new) Ideal end-state – requires convergence of OS, H/W, and API support

    Full CPU/GPU memory

    coherency for all heap

    allocations

  • SVM --- API Basic

    37

  • SVM --- Kernel Setup

    38

  • 39

    Agenda

    • Compute programming on Intel Graphics with:

    • OpenCL

    • CilkPlus

    • Tools

    • Workload performance

  • Intel CilkPlus for Parallel Programming

    40

    • Add language extensions to C++ programs to exploit parallelism. Come with

    two flavors:

    • MIT Cilk

    • OpenMP

    • Kinds of parallelism exploited:

    • Task-level

    • Loop-level

    • SIMD-level

    • Originally designed for CPU, now also support Intel Graphics

    • Unlike OpenCL, no separation of host and device programs

  • Example: Serial C++ version

    void vecadd(int n, float *a, float *b, float *c)

    {

    for (int i=0; i < n; i++) {

    a[i] = b[i] + c[i];

    }

    }

    41

  • Example: Parallel CPU version (Cilk flavor)

    42

    void vecadd(int n, float *a, float *b, float *c)

    {

    _Cilk_for (int i=0; i < n; i++) {

    a[i] = b[i] + c[i];

    }

    }

  • Example: Intel Graphics version (Cilk flavor)

    43

    void vecadd(int n, float *a, float *b, float *c)

    {

    #pragma offload target(gfx) pin(a, b, c : length(n))

    _Cilk_for (int i=0; i < n; i++) {

    a[i] = b[i] + c[i];

    }

    }

  • Example: Parallel CPU version (OpenMP flavor)

    44

    void vecadd(int n, float *a, float *b, float *c)

    {

    #pragma omp parallel for

    for (int i=0; i < n; i++) {

    a[i] = b[i] + c[i];

    }

    }

  • Example: Intel Graphics version (OpenMP flavor)

    45

    void vecadd(int n, float *a, float *b, float *c)

    {

    #pragma omp target(gfx) \

    map(tofrom: a[0:n], b[0:n], c[0:n]) map(to: n)

    #pragma omp parallel for

    for (int i=0; i < n; i++) {

    a[i] = b[i] + c[i];

    }

    }

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    CilkPlus keywords

    • Cilk Plus adds three keywords to C and C++:

    _Cilk_spawn

    _Cilk_sync

    _Cilk_for

    • If you #include , you can write the keywords as cilk_spawn,

    cilk_sync, and cilk_for.

    • Cilk Plus runtime controls thread creation and scheduling.

    • For GFX offload _Cilk_for is supported

    – No Cilk Plus runtime on the target

    – Scheduling happens on the host side

    46

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    cilk_for loop

    • Looks like a normal for loop.

    cilk_for (int x = 0; x < 1000000; ++x) { … }

    • Any or all iterations may execute in parallel with one another.

    • All iterations complete before program continues.

    • Constraints:

    – Limited to a single control variable.

    – Must be able to jump to the start of any iteration at random.

    – Iterations should be independent of one another.

    47

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    Array Notation (1)

    • Use a “:” in array subscripts to operate on multiple elements

    A[:] // all of array A

    A[lower_bound : length]

    A[lower_bound : length : stride]

    48

    Explicit Data Parallelism Based on C/C++ Arrays

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    Array Notation (2)

    • Example: A[:] = B[:]+C[:]

    • An extension to C/C++

    • Perform operations on sections of arrays in parallel

    • Extension has parallel semantics

    • Well suited for code that:

    – performs per-element operations on arrays,

    – without an implied order between them

    – with an intent to execute in vector instructions

    49

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    Operations on Array Sections (1)

    • C/C++ operators

    d[:] = a[:] + (b[:] * c[:])

    • Function calls

    b[:] = foo(a[:]); // Call foo() on each element of a[]

    • Reductions combine array elements to get a single result

    // Add all elements of a[]

    sum = __sec_reduce_add(a[:]);

    // More reductions exist...

    • If-then-else and conditional operators allow masked operations

    if (mask[:]) {

    a[:] = b[:]; // If mask[i] is true, a[i]=b[i]

    }

    50

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    Operations on Array Sections (2)

    • Implicit index fills array with index values

    – sec_implicit_index(0) // 1st rank index section index

    – sec_implicit_index(1) // 2nd rank section index

    51

    // fill A with values 0,1,2,3,4....

    A[:] = __sec_implicit_index(0);

    // fill B[i][j] with i+j

    B[:][:] = __sec_implicit_index(0) + __sec_implicit_index(1);

    // fill the lower-left triangle of C with 1

    C[0:n][0:__sec_implicit_index(0)] = 1;

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    SIMD Functions (1)

    • A general construct to express data parallelism:

    – Write a function to describe the operation on a single element

    • Annotate it with one of:

    __declspec(vector)__

    __attribute__((vector))

    – Invoke the function across a parallel data structure (arrays) or from

    within a vectorizable loop.

    52

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    SIMD Functions (2)

    • Polymorphic: a vectorizing compiler may create both array and

    scalar versions of the function.

    • Writing the function is independent of its invocation

    – The function can invoked on scalars, within serial for or cilk_for loops,

    using array notation, etc..

    53

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    SIMD Functions - Example

    • Defining an elemental function:__declspec (vector) double option_price_call_black_scholes(

    double S, double K, double r, double sigma, double time)

    {

    double time_sqrt = sqrt(time);

    double d1 = (log(S/K)+r*time)/(sigma*time_sqrt) +

    0.5*sigma*time_sqrt;

    double d2 = d1-(sigma*time_sqrt);

    return S*N(d1) - K*exp(-r*time)*N(d2);

    }

    • Invoking the elemental function:// The following loop can also use cilk_for

    call[0:N] = option_price_call_black_scholes(S[0:N], K[0:N], r, sigma, time[0:N]);

    Compiler breaks data into SIMD vectors

    and calls function on each vector

    54

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    SIMD Annotations

    • Loop annotation informs the compiler that vectorized loop will have same

    semantics as serial loop:

    void f(float *a, const float *b, const int *e, int n)

    {

    #pragma simd

    for (int i = 0; i < n; ++i)

    a[i] = 2 * b[e[i]];

    }

    • Currently implemented as a pragma, but other methods of annotating the

    loop can be considered.

    • Additional clauses for reductions and other vectorization guidance (matching

    OpenMP*)

    Potential aliasing and loop-carried dependencies would thwart auto-vectorization

    55

  • Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    CilkPlus Example: N-Body simulation

  • Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    N-Body simulation

    - N particles

    - Each has its own mass, position and velocity in 3D space

    - Particles are moving by influence of mutual gravitational forces

    - Classic simulation has O(N2) computational complexity

  • Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    N-Body simulation: main loop

    _Cilk_for (int i = start; i < end; ++i) {Vector3 acc = 0.0f;for (int j = 0; j < body_count; ++j) {

    Vector3 dist = old_pos[j] - old_pos[i];float len = sqrtf(dot(dist, dist) + epsilon);acc += dist * masses[j] / (len * len * len);

    }

    new_vel[i] = old_vel[i] + acc * time;new_pos[i] = old_pos[i] + old_vel[i] * time + acc * time * time / 2;

    }

    Straightforward approach is to add #pragma offload to the main loop

  • Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    N-Body simulation: #pragma offload

    #pragma offload target(gfx) pin(old_pos, old_vel, new_pos, new_vel: length(body_count))

    _Cilk_for (int i = start; i < end; ++i) {

    Vector3 acc = 0.0f;

    for (int j = 0; j < body_count; ++j) {

    Vector3 dist = old_pos[j] - old_pos[i];

    float len = sqrtf(dot(dist, dist) + epsilon);

    acc += dist * masses[j] / (len * len * len);

    }

    new_vel[i] = old_vel[i] + acc * time;

    new_pos[i] = old_pos[i] + old_vel[i] * time + acc * time * time / 2.0f;

    }

  • Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    N-Body simulation: Blocking#pragma offload target(gfx) pin(old_pos, old_vel, new_pos, new_vel: length(body_count))

    _Cilk_for (int i = start; i < end; i += TILE) {GFXVector3 pos[TILE];GFXVector3 acc[TILE];pos[0:TILE] = old_pos[i:TILE];acc[0:TILE] = 0.0f;

    for (int j = 0; j < body_count; j += TILE) {GFXVector3 tpos[TILE];GFXVector3 tmass[TILE];tpos[0:TILE] = old_pos[j:TILE];tmass[0:TILE] = masses[j:TILE];

    for (int t = 0; t < TILE; t++) {for (int k = 0; k < TILE; k++) {

    GFXVector3 dist = tpos[t] - pos[k];float inv_len = 1.0f / sqrtf(dot(dist, dist) + epsilon);acc[k] += dist * tmass[t] * inv_len * inv_len * inv_len;

    }}

    }new_vel[i:TILE] = old_vel[i:TILE] + acc[0:TILE] * time;new_pos[i:TILE] = pos[0:TILE] + old_vel[i:TIME] * time + acc[0:TILE] * time * time / 2.0f;

    }

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    CilkPlus advanced features

    • Static data declaration

    • Separate file compilation (linking)

    • Recursive functions

    • Shared Local Memory (SLM)

    • Shared Virtual Memory (SVM)

    61

  • Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

    *Other brands and names are the property of their respective owners.

    Optimization Notice

    Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

    Where to get Intel OpenCL and Intel CilkPlus?

    • Intel OpenCL:

    – https://software.intel.com/en-us/intel-opencl

    • Intel CilkPlus:

    – https://software.intel.com/en-us/intel-cilk-plus

    62

    https://software.intel.com/en-us/intel-openclhttps://software.intel.com/en-us/intel-cilk-plus

  • 63

    Agenda

    • Compute programming on Intel Graphics with:

    • Tools

    • VTune

    • GT-Pin

    • Workload performance

  • VTune

    64

    • A platform-wide performance profiler:

    • Memory accesses, storage and IO analysis, interrupts, CPU/GPU

    concurrency, …

    • Intel GPU analysis in VTune:

    What compute APIs are used (OpenCL, CilkPlus, Media SDK)

    When and what GPU units were utilized by those APIs

    Rich set of hardware metrics allowing to see how the actual

    machine was utilized

    Hints for performance issues

    OpenCL source/Gen assembly view and some ability to map

    performance data back to original code

    https://software.intel.com/en-us/intel-vtune-amplifier-xe

    https://software.intel.com/en-us/intel-vtune-amplifier-xe

  • CPU + GPU Utilization (on a Media + OpenCL app.)

    65

  • SW Queue and GPU engines utilization

    … and performance data attributed to them

    OpenCL kernels on timeline

    GEN GPU engines

    utilization

    OpenCLkernels…

    GPU HW metrics over

    time

    OpenCL host-side API calls

    OpenCL queue

    Intel Confidential

    Compute API Profiling (on an OpenCL application)

    6

  • Architecture Diagram

    67

  • EU Dynamic Instruction Count

    68

  • EU Instruction Latency

    69

  • GT-Pin

    70

    • A Pin-like binary instrumentation tool for the EU in Intel GPUs

    • Command-line interface

    • Used to build a wide range of tools for:

    • Performance analysis, workload tracing, debugging

    • Providing instruction count and latency data to VTune

    • Support:

    • OpenCL, CilkPlus, DirectX, OpenGL

    • Windows, Linux, Android, OS-X

    • Will provide API to allow users to write their own tools

    • Availability: first public release in 2016

  • Sample GT-Pin Tool: Opcodeprof

    71

  • Workload Analysis with Opcodeprof

    72

  • Sample GT-Pin Tool: Cacheprof (a cache simulator)

    73

    • Trace accesses to data ports from EU

    • Simulate Intel GPU cache hierarchy:

    • Can also dump memory traces to files for offline analysis

    L3 Cache

    LLC Cache

    Instruction Cache

    ConstantCache

    SamplerCache

    RenderCache

  • Working-set Size Analysis with Cacheprof

    74

    working set

    Benchmark = gaussian_blur filter

  • Advanced Tool: GT-PinPoints

    75

    • A tool to find representative regions in OpenCL traces for GPU simulations and workload analysis

    • Based on the proven Simpoint methodology

    • Overview:

    • Results:

    • 223x simulation speedup for 3.0% error or

    • 35x simulation speedup for 0.3% error

    http://www.cs.columbia.edu/~melanie/iiswc2015.pdf

    http://www.cs.columbia.edu/~melanie/iiswc2015.pdf

  • 76

    Agenda

    • Compute programming on Intel Graphics with:

    • Tools

    • Workload performance

    • Photoshop

    • Database

  • Adobe Photoshop

    77

    • The de facto industry standard raster graphics

    editing software developed by Adobe

    • Our workloads have 22 feature tests:

    • Each of which does image processing like

    applying filters, blurring images, adding effects

    etc.

  • Photoshop Experimental Setup

    78

    Compare the products with the highest theoretical FLOPs from Intel and Nvidia:

    • Intel Graphics (integrated)

    • Iris Pro 6200 (Broadwell GT3e)

    • Theoretical FLOPs = 883 GFlops

    • TDP = 47 W

    • Actual power measured via a public tool called GPU-Z

    • Nvidia GPU (discrete)

    • GTX Titan Z

    • Theoretical FLOPs = 8122 GFlops

    • TDP = 375 W

    • Actual power measured via a Nvidia tool called Nvidia-smi

  • Performance Comparison

    79

    Nvidia is 1-2.5x faster

    Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer.

  • GPU Power Measured

    80

    Intel consumes 8-14x less power

    Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer.

  • GPU Energy Consumed

    81

    Intel consumes 5-16x less energy

    Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer.

  • Comparing Integrated and Discrete GPUs on Database

    Processing

    82

    Paper:

    Unleashing the Hidden Power of Integrated-GPUs for Database Co-

    Processing, by E. Ching et al. from FutureWei Technologies

    http://subs.emis.de/LNI/Proceedings/Proceedings232/1755.pdf

    http://subs.emis.de/LNI/Proceedings/Proceedings232/1755.pdf

  • Experimental setup

    83

  • Data transfer time

    84Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.

  • TPC-H decision support benchmark (TPC-H Q1)

    85

    Execution time (lower is better) Throughput per Watt (higher is better)

    Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.

  • TPC-H decision support benchmark (TPC-H Q9)

    86

    Execution time (lower is better) Throughput per Watt (higher is better)

    Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.

  • Summary

    87

    • Intel processor graphics is integrated on die with the CPU:

    • No extra cost for GPU

    • High performance

    • Energy efficient

    • Intel processor graphics comes with a rich software ecosystem:

    • Support most standard graphics/compute programming APIs

    • Modern highly-optimizing compiler

    • Helpful tools

    • Call for actions:

    • Try to program the integrated graphics on your Intel-based laptop!

    • Visit this tutorial’s webpage for the slides and more information

  • FTC disclaimer

    88

    Software and workloads used in performance tests may have beenoptimized for performance only on Intel microprocessors. Performance tests,such as SYSmark and MobileMark, are measured using specific computersystems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult otherinformation and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

  • INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY

    INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL

    DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES

    RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR

    OTHER INTELLECTUAL PROPERTY RIGHT.

    Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,

    such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change

    to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating

    your contemplated purchases, including the performance of that product when combined with other products.

    © 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Atom, Core, Iris, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation

    in the U.S. and/or other countries.

    89

    Legal Disclaimer and Optimization Notice

    Optimization Notice

    Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

    Notice revision #20110804

  • Legal Notices and Disclaimers

    Intel technologies’ features and benefits depend on system configuration and may require enabled

    hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

    No computer system can be absolutely secure.

    Tests document performance of components on a particular test, in specific systems. Differences in

    hardware, software, or configuration will affect actual performance. Consult other sources of information

    to evaluate performance as you consider your purchase. For more complete information about

    performance and benchmark results, visit http://www.intel.com/performance.

    Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries.

    *Other names and brands may be claimed as the property of others.

    © 2015 Intel Corporation.

    http://www.intel.com/performance