programming intel® processor graphics · 3 types of svm coarse-grain buffers (intel 5th gen...
TRANSCRIPT
-
PROGRAMMING Intel® Processor Graphics
Chi-Keung (CK) Luk - Intel Principal Engineer
Intel Software & Services Group
-
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
• Compute programming on Intel Graphics with:
• OpenCL
• CilkPlus
• Tools
• Workload performance
2
Agenda
-
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
• OpenCL:
Khronos
Uri Levy, Yuval Eshkol, Doron Singer
Robert Ioffe, Aaron Kunze, Ben Ashbaugh, Stephen Junkins, Michal Mrozek
• CilkPlus:
Knud J Kirkegaard, Anoop Madhusoodhanan Prabha, Konstantin Bobrovsky, Sergey Dmitriev
• VTune:
Alexandr Kurylev, Julia Fedorova
• Workload Performance:
Sharad Tripathi, Chinang Ma, Akhila Vidiyala
Edward Ching, Norbert Egi, Masood Mortazavi, Vivent Cheng, Guangyu Shi
3
Acknowledgments for Slide Sources
-
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice4
OpenCL
1. Introduction
2. Optimizing OpenCL for Intel GPUs
3. Using Shared Virtual Memory (SVM)
-
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
• An open standard managed by the Khronos group
• A set of C-based APIs on the host that defines a run-time environment
• Programs written in a C-based language (C++ support since OpenCL 2.1) that run on the device(s)
OpenCL* (Open Computing Language)
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Mostly C-like
Kernels (functions which execute a work-item) have a kernel prefix, and a void return type
No support for library functions
No stdio.h / stdlib.h / math.h / etc..
But printf is supported
Based on C99
kernel void foo (global int* ptr)
{
…
for (int i=0; i
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Scalar data types
char / uchar / short / ushort / int / uint / long / ulong
float / double
size_t
Pointers
Derived data types
Arrays
Structures
Vector data types
Supported data types
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Vectors exist for all scalar types
Vector widths are 2, 3, 4, 8, 16
All arithmetic operations work on vector types
Component access (XYZW)
Vectors > 4 use numeric (hexadecimal) indices
Working with vectors
uint3 vec0, vec1;
…
uint3 result = vec0 + vec1;
double res1 = dvec0.x + dvec1.z;
double2 res2 = dvec0.wy + dvec1.xx; // Swizzle
float res1 = vec16.s5 + vec16.sf;
float2 res2 = vec8.s37 + vec16.sca; // Swizzle
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Functions used for querying the index
get_global_id(int dimension); // Index of work-item in entire execution
get_local_id(int dimension); // index of work-item in work-group
get_group_id(int dimension); // index of work-group
A few others..
Memory
No support for dynamic memory allocation
When passing buffers as arguments, specify global / local / constant
Additional details
kernel
void foo (global const int* ptr, local float* scratch)
{
…
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Many functions supported
Overloaded to all relevant types and vector widths
A tiny bit for example:
Math: sin / cos / min / max / log / pow / sqrt / ….
Geometric: dot / cross / distance / length / …
Relational: isEqual / isGreater / all / any / select / …
Built-in functions
kernel void
dot_product foo(global const int4* a, global const int4* b, global int* out)
{
size_t tid = get_global_id(0);
out[tid] = dot(a[tid], b[tid]);
}
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
• Scalar values
• Use “normal” (C-style) casts
• Vector values
• Vector conversions must be done explicitly
• Source and destination types must have the same vector width
• Example:
• Vector construction (from scalars)
Casts and type conversions
dstValue = convert_destType(srcValue)
int8 intVec;
double8 dVec;
float8 fVec = convert_float8(intVec);
float8 fVec2 = convert_float8(dVec);
ushort3 vecUshort = (ushort3)(0, 12, 7);
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Based on C99
With some extra features:
Vector data types
Extensive Built-in functions library
Image handling, Work-group synchronization
Minus others:
Recursion
Function pointers
Pointers to pointers
Dynamic memory allocation
Language summary
-
Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice18
OpenCL1. Introduction
2. Optimizing OpenCL for Intel GPUs
Use zero copying to transfer data between CPU and GPU
Maximize EU occupancy
Maximize compute performance
Avoid Divergent Control Flow
Take advantage of large register space
Optimize data accesses
3. Using Shared Virtual Memory (SVM)
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice19
Optimizing Host to Device Transfers: Zero Copying
• Host (CPU) and Device (GPU) share the same physical memory
• For buffers allocated through the OpenCL™ runtime:
- Let the OpenCL runtime allocate system memory
Create buffer with system memory pointer and CL_MEM_ALLOC_HOST_PTR
- OR, Use pre-allocated system memory
Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR
Allocate system memory aligned to a page (4096 bytes) (e.g., use _aligned_malloc or memalign to allocate)
Allocate a multiple of cache line size (64 bytes)
No transfer needed (zero copy)!
- Use clEnqueueMapBuffer() to access data
No transfer needed (zero copy)!
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice20
Maximizing EU Occupancy
• Occupancy is a measure of EU thread utilization
• Two primary things to consider:
- Launch enough work items to keep EU threads busy
- In short kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost
For example, color conversion:
__global uchar* src, dst;p = src[src_idx] * B2Y +
src[src_idx + 1] * G2Y + src[src_idx + 2] * R2Y;
dst[dst_idx] = p;
__global uchar* src_ptr, dst_ptr;uchar16 src = vload16(0, src_ptr);uchar4 c0 = src.s048c;uchar4 c1 = src.s159d;uchar4 c2 = src.s26ae;uchar4 Y = c0 * B2Y +
c1 * G2Y + c2 * R2Y;
vstore4(Y, 0, dst_ptr);Before:One pixel per work item After:
Four pixels per work item
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice21
Maximize Compute Performance
• Use floats instead of integer data types
- Because an EU can issue two float operations per cycle
• Floating-point throughput depends on the data width
- float16 throughput = 2 x float32 throughput
- float32 throughput = 4 x float64 throughput
• Trade accuracy for speed, where appropriate
- Use “native” built-ins (or use -cl-fast-relaxed-math)
- Use mad() / fma()(or use -cl-mad-enable)
x = cos(i); x = native_cos(i);
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice22
Avoid Divergent Control Flow
“SIMT” ISA with Predication and Branching
“Divergent” code executes both branches
Reduced SIMD Efficiency, Increased Power and Exec Time
this();
if ( x )
that();
else
another();
finish();
SIMD lane
time
Example: “x” sometimes true
SIMD lane
time
Example: “x” never true
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice23
Optimizing Data Accesses
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice24
Take Advantage of Large Register Space
• Each work item in an OpenCL™ kernel has access to up to 256-512 bytes of register space
• Bandwidth to registers faster than any memory
• Loading and processing blocks of pixels in registers is very efficient!
float sum[PX_PER_WI_X] = { 0.0f };float k[KERNEL_SIZE_X];float d[PX_PER_WI_X + KERNEL_SIZE_X];// Load filter kernel in k, input data in d...// Compute convolutionfor (px = 0; px < PX_PER_WI_X; ++px)
for (sx = 0; sx < KERNEL_SIZE_X; ++sx)sum[px]= mad(k[sx], d[px + sx], sum[px]);
Use available registers (up to 512 bytes) instead of
memory, where possible!
allocated in registers
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice25
Global and Constant Memory
Global Memory Accesses go through the L3 Cache
L3 cache line is 64 bytes
EU thread accesses to the same cache line are collapsed
• Order of data within cache line does not matter
• Bandwidth determined by number of cache lines accessed
• Maximum Bandwidth (L3 EU): 64 bytes / clock / sub slice
Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address
Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address
• Loading more than 4 x 32-bits of data is not beneficial
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice26
Example: Global and Constant Memory Accesses
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
• Local memory accesses go thru. SLM
• Sits next to L3 cache in the architecture
• Key difference: SLM is banked
• Banked at 4-byte granularity, with 16 banks in total
• Maximum bandwidth: still 64 bytes / clock / sub slice
• Supports more access patterns with full bandwidth than Global memory:
• Reading the same address from a bank => not a bank conflict
• Reading different addresses from a bank => bank conflict
• Maximum bandwidth achieved when there is no bank conflict27
Local Memory Accesses
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice28
Example: Local Memory Accesses
-
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice29
OpenCL
1. Introduction
2. Optimizing OpenCL for Intel GPUs
3. Using Shared Virtual Memory (SVM)
-
Shared Virtual Memory (Pre-history)
Builds upon “shared physical memory” feature SPM established with OpenCL 1.0 => CL_MEM_USE_HOST_PTR flag Supported on Intel 3rd Gen processors with HD Graphics Eliminated buffer copy costs, aka “zero-copy” buffers* Buffer must have 4k byte alignment and size divisible by 64
SPM available since 2011, but still not
used by many OpenCL apps…
* See “Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics”
https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics
-
Shared Virtual Memory (SVM) - Basics
31
-
3 types of SVMCoarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueSVMMap/Unmap commands
Fine-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5500+) SVM buffers can be accessed from either CPU or GPU at any time Use atomics to control access (if CPU & GPU may try to modify the
same memory location) Check CL_DEVICE_SVM_FINE_GRAIN_BUFFER for fine-grained buffer
SVM support, CL_DEVICE_SVM_ATOMICS is for atomics support
Fine-grain system memory (Future Intel Processors) CPU & GPU can share anything allocated from the C-runtime ‘heap’
(i.e. malloc/new)
-
3 types of SVM
Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueMap/Unmap commands
Un-mapped state: Only GPU can access buffer
-
3 types of SVM
Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueMap/Unmap commands
Mapped state: Only CPU can access buffer
-
3 types of SVM
Fine-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5500+) SVM buffers can be accessed from both CPU and GPU at any time Can use atomics to avoid ‘race’ conditions
Check if device supports (CL_DEVICE_SVM_FINE_GRAIN_BUFFER & CL_DEVICE_SVM_ATOMICS flags)
Fine grain SVM buffer allows simultaneous access from CPU & GPU
-
3 types of SVM
Fine-grain system memory (Future Intel Processors) CPU & GPU can share anything allocated from the C-runtime ‘heap’
(i.e. malloc/new) Ideal end-state – requires convergence of OS, H/W, and API support
Full CPU/GPU memory
coherency for all heap
allocations
-
SVM --- API Basic
37
-
SVM --- Kernel Setup
38
-
39
Agenda
• Compute programming on Intel Graphics with:
• OpenCL
• CilkPlus
• Tools
• Workload performance
-
Intel CilkPlus for Parallel Programming
40
• Add language extensions to C++ programs to exploit parallelism. Come with
two flavors:
• MIT Cilk
• OpenMP
• Kinds of parallelism exploited:
• Task-level
• Loop-level
• SIMD-level
• Originally designed for CPU, now also support Intel Graphics
• Unlike OpenCL, no separation of host and device programs
-
Example: Serial C++ version
void vecadd(int n, float *a, float *b, float *c)
{
for (int i=0; i < n; i++) {
a[i] = b[i] + c[i];
}
}
41
-
Example: Parallel CPU version (Cilk flavor)
42
void vecadd(int n, float *a, float *b, float *c)
{
_Cilk_for (int i=0; i < n; i++) {
a[i] = b[i] + c[i];
}
}
-
Example: Intel Graphics version (Cilk flavor)
43
void vecadd(int n, float *a, float *b, float *c)
{
#pragma offload target(gfx) pin(a, b, c : length(n))
_Cilk_for (int i=0; i < n; i++) {
a[i] = b[i] + c[i];
}
}
-
Example: Parallel CPU version (OpenMP flavor)
44
void vecadd(int n, float *a, float *b, float *c)
{
#pragma omp parallel for
for (int i=0; i < n; i++) {
a[i] = b[i] + c[i];
}
}
-
Example: Intel Graphics version (OpenMP flavor)
45
void vecadd(int n, float *a, float *b, float *c)
{
#pragma omp target(gfx) \
map(tofrom: a[0:n], b[0:n], c[0:n]) map(to: n)
#pragma omp parallel for
for (int i=0; i < n; i++) {
a[i] = b[i] + c[i];
}
}
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
CilkPlus keywords
• Cilk Plus adds three keywords to C and C++:
_Cilk_spawn
_Cilk_sync
_Cilk_for
• If you #include , you can write the keywords as cilk_spawn,
cilk_sync, and cilk_for.
• Cilk Plus runtime controls thread creation and scheduling.
• For GFX offload _Cilk_for is supported
– No Cilk Plus runtime on the target
– Scheduling happens on the host side
46
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
cilk_for loop
• Looks like a normal for loop.
cilk_for (int x = 0; x < 1000000; ++x) { … }
• Any or all iterations may execute in parallel with one another.
• All iterations complete before program continues.
• Constraints:
– Limited to a single control variable.
– Must be able to jump to the start of any iteration at random.
– Iterations should be independent of one another.
47
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Array Notation (1)
• Use a “:” in array subscripts to operate on multiple elements
A[:] // all of array A
A[lower_bound : length]
A[lower_bound : length : stride]
48
Explicit Data Parallelism Based on C/C++ Arrays
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Array Notation (2)
• Example: A[:] = B[:]+C[:]
• An extension to C/C++
• Perform operations on sections of arrays in parallel
• Extension has parallel semantics
• Well suited for code that:
– performs per-element operations on arrays,
– without an implied order between them
– with an intent to execute in vector instructions
49
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Operations on Array Sections (1)
• C/C++ operators
d[:] = a[:] + (b[:] * c[:])
• Function calls
b[:] = foo(a[:]); // Call foo() on each element of a[]
• Reductions combine array elements to get a single result
// Add all elements of a[]
sum = __sec_reduce_add(a[:]);
// More reductions exist...
• If-then-else and conditional operators allow masked operations
if (mask[:]) {
a[:] = b[:]; // If mask[i] is true, a[i]=b[i]
}
50
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Operations on Array Sections (2)
• Implicit index fills array with index values
– sec_implicit_index(0) // 1st rank index section index
– sec_implicit_index(1) // 2nd rank section index
51
// fill A with values 0,1,2,3,4....
A[:] = __sec_implicit_index(0);
// fill B[i][j] with i+j
B[:][:] = __sec_implicit_index(0) + __sec_implicit_index(1);
// fill the lower-left triangle of C with 1
C[0:n][0:__sec_implicit_index(0)] = 1;
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SIMD Functions (1)
• A general construct to express data parallelism:
– Write a function to describe the operation on a single element
• Annotate it with one of:
__declspec(vector)__
__attribute__((vector))
– Invoke the function across a parallel data structure (arrays) or from
within a vectorizable loop.
52
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SIMD Functions (2)
• Polymorphic: a vectorizing compiler may create both array and
scalar versions of the function.
• Writing the function is independent of its invocation
– The function can invoked on scalars, within serial for or cilk_for loops,
using array notation, etc..
53
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SIMD Functions - Example
• Defining an elemental function:__declspec (vector) double option_price_call_black_scholes(
double S, double K, double r, double sigma, double time)
{
double time_sqrt = sqrt(time);
double d1 = (log(S/K)+r*time)/(sigma*time_sqrt) +
0.5*sigma*time_sqrt;
double d2 = d1-(sigma*time_sqrt);
return S*N(d1) - K*exp(-r*time)*N(d2);
}
• Invoking the elemental function:// The following loop can also use cilk_for
call[0:N] = option_price_call_black_scholes(S[0:N], K[0:N], r, sigma, time[0:N]);
Compiler breaks data into SIMD vectors
and calls function on each vector
54
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
SIMD Annotations
• Loop annotation informs the compiler that vectorized loop will have same
semantics as serial loop:
void f(float *a, const float *b, const int *e, int n)
{
#pragma simd
for (int i = 0; i < n; ++i)
a[i] = 2 * b[e[i]];
}
• Currently implemented as a pragma, but other methods of annotating the
loop can be considered.
• Additional clauses for reductions and other vectorization guidance (matching
OpenMP*)
Potential aliasing and loop-carried dependencies would thwart auto-vectorization
55
-
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
CilkPlus Example: N-Body simulation
-
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
N-Body simulation
- N particles
- Each has its own mass, position and velocity in 3D space
- Particles are moving by influence of mutual gravitational forces
- Classic simulation has O(N2) computational complexity
-
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
N-Body simulation: main loop
_Cilk_for (int i = start; i < end; ++i) {Vector3 acc = 0.0f;for (int j = 0; j < body_count; ++j) {
Vector3 dist = old_pos[j] - old_pos[i];float len = sqrtf(dot(dist, dist) + epsilon);acc += dist * masses[j] / (len * len * len);
}
new_vel[i] = old_vel[i] + acc * time;new_pos[i] = old_pos[i] + old_vel[i] * time + acc * time * time / 2;
}
Straightforward approach is to add #pragma offload to the main loop
-
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
N-Body simulation: #pragma offload
#pragma offload target(gfx) pin(old_pos, old_vel, new_pos, new_vel: length(body_count))
_Cilk_for (int i = start; i < end; ++i) {
Vector3 acc = 0.0f;
for (int j = 0; j < body_count; ++j) {
Vector3 dist = old_pos[j] - old_pos[i];
float len = sqrtf(dot(dist, dist) + epsilon);
acc += dist * masses[j] / (len * len * len);
}
new_vel[i] = old_vel[i] + acc * time;
new_pos[i] = old_pos[i] + old_vel[i] * time + acc * time * time / 2.0f;
}
-
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
N-Body simulation: Blocking#pragma offload target(gfx) pin(old_pos, old_vel, new_pos, new_vel: length(body_count))
_Cilk_for (int i = start; i < end; i += TILE) {GFXVector3 pos[TILE];GFXVector3 acc[TILE];pos[0:TILE] = old_pos[i:TILE];acc[0:TILE] = 0.0f;
for (int j = 0; j < body_count; j += TILE) {GFXVector3 tpos[TILE];GFXVector3 tmass[TILE];tpos[0:TILE] = old_pos[j:TILE];tmass[0:TILE] = masses[j:TILE];
for (int t = 0; t < TILE; t++) {for (int k = 0; k < TILE; k++) {
GFXVector3 dist = tpos[t] - pos[k];float inv_len = 1.0f / sqrtf(dot(dist, dist) + epsilon);acc[k] += dist * tmass[t] * inv_len * inv_len * inv_len;
}}
}new_vel[i:TILE] = old_vel[i:TILE] + acc[0:TILE] * time;new_pos[i:TILE] = pos[0:TILE] + old_vel[i:TIME] * time + acc[0:TILE] * time * time / 2.0f;
}
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
CilkPlus advanced features
• Static data declaration
• Separate file compilation (linking)
• Recursive functions
• Shared Local Memory (SLM)
• Shared Virtual Memory (SVM)
61
-
Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Optimization Notice
Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Where to get Intel OpenCL and Intel CilkPlus?
• Intel OpenCL:
– https://software.intel.com/en-us/intel-opencl
• Intel CilkPlus:
– https://software.intel.com/en-us/intel-cilk-plus
62
https://software.intel.com/en-us/intel-openclhttps://software.intel.com/en-us/intel-cilk-plus
-
63
Agenda
• Compute programming on Intel Graphics with:
• Tools
• VTune
• GT-Pin
• Workload performance
-
VTune
64
• A platform-wide performance profiler:
• Memory accesses, storage and IO analysis, interrupts, CPU/GPU
concurrency, …
• Intel GPU analysis in VTune:
What compute APIs are used (OpenCL, CilkPlus, Media SDK)
When and what GPU units were utilized by those APIs
Rich set of hardware metrics allowing to see how the actual
machine was utilized
Hints for performance issues
OpenCL source/Gen assembly view and some ability to map
performance data back to original code
https://software.intel.com/en-us/intel-vtune-amplifier-xe
https://software.intel.com/en-us/intel-vtune-amplifier-xe
-
CPU + GPU Utilization (on a Media + OpenCL app.)
65
-
SW Queue and GPU engines utilization
… and performance data attributed to them
OpenCL kernels on timeline
GEN GPU engines
utilization
OpenCLkernels…
GPU HW metrics over
time
OpenCL host-side API calls
OpenCL queue
Intel Confidential
Compute API Profiling (on an OpenCL application)
6
-
Architecture Diagram
67
-
EU Dynamic Instruction Count
68
-
EU Instruction Latency
69
-
GT-Pin
70
• A Pin-like binary instrumentation tool for the EU in Intel GPUs
• Command-line interface
• Used to build a wide range of tools for:
• Performance analysis, workload tracing, debugging
• Providing instruction count and latency data to VTune
• Support:
• OpenCL, CilkPlus, DirectX, OpenGL
• Windows, Linux, Android, OS-X
• Will provide API to allow users to write their own tools
• Availability: first public release in 2016
-
Sample GT-Pin Tool: Opcodeprof
71
-
Workload Analysis with Opcodeprof
72
-
Sample GT-Pin Tool: Cacheprof (a cache simulator)
73
• Trace accesses to data ports from EU
• Simulate Intel GPU cache hierarchy:
• Can also dump memory traces to files for offline analysis
L3 Cache
LLC Cache
Instruction Cache
ConstantCache
SamplerCache
RenderCache
-
Working-set Size Analysis with Cacheprof
74
working set
Benchmark = gaussian_blur filter
-
Advanced Tool: GT-PinPoints
75
• A tool to find representative regions in OpenCL traces for GPU simulations and workload analysis
• Based on the proven Simpoint methodology
• Overview:
• Results:
• 223x simulation speedup for 3.0% error or
• 35x simulation speedup for 0.3% error
http://www.cs.columbia.edu/~melanie/iiswc2015.pdf
http://www.cs.columbia.edu/~melanie/iiswc2015.pdf
-
76
Agenda
• Compute programming on Intel Graphics with:
• Tools
• Workload performance
• Photoshop
• Database
-
Adobe Photoshop
77
• The de facto industry standard raster graphics
editing software developed by Adobe
• Our workloads have 22 feature tests:
• Each of which does image processing like
applying filters, blurring images, adding effects
etc.
-
Photoshop Experimental Setup
78
Compare the products with the highest theoretical FLOPs from Intel and Nvidia:
• Intel Graphics (integrated)
• Iris Pro 6200 (Broadwell GT3e)
• Theoretical FLOPs = 883 GFlops
• TDP = 47 W
• Actual power measured via a public tool called GPU-Z
• Nvidia GPU (discrete)
• GTX Titan Z
• Theoretical FLOPs = 8122 GFlops
• TDP = 375 W
• Actual power measured via a Nvidia tool called Nvidia-smi
-
Performance Comparison
79
Nvidia is 1-2.5x faster
Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer.
-
GPU Power Measured
80
Intel consumes 8-14x less power
Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer.
-
GPU Energy Consumed
81
Intel consumes 5-16x less energy
Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer.
-
Comparing Integrated and Discrete GPUs on Database
Processing
82
Paper:
Unleashing the Hidden Power of Integrated-GPUs for Database Co-
Processing, by E. Ching et al. from FutureWei Technologies
http://subs.emis.de/LNI/Proceedings/Proceedings232/1755.pdf
http://subs.emis.de/LNI/Proceedings/Proceedings232/1755.pdf
-
Experimental setup
83
-
Data transfer time
84Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.
-
TPC-H decision support benchmark (TPC-H Q1)
85
Execution time (lower is better) Throughput per Watt (higher is better)
Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.
-
TPC-H decision support benchmark (TPC-H Q9)
86
Execution time (lower is better) Throughput per Watt (higher is better)
Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.
-
Summary
87
• Intel processor graphics is integrated on die with the CPU:
• No extra cost for GPU
• High performance
• Energy efficient
• Intel processor graphics comes with a rich software ecosystem:
• Support most standard graphics/compute programming APIs
• Modern highly-optimizing compiler
• Helpful tools
• Call for actions:
• Try to program the integrated graphics on your Intel-based laptop!
• Visit this tutorial’s webpage for the slides and more information
-
FTC disclaimer
88
Software and workloads used in performance tests may have beenoptimized for performance only on Intel microprocessors. Performance tests,such as SYSmark and MobileMark, are measured using specific computersystems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult otherinformation and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
-
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating
your contemplated purchases, including the performance of that product when combined with other products.
© 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Atom, Core, Iris, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation
in the U.S. and/or other countries.
89
Legal Disclaimer and Optimization Notice
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
-
Legal Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled
hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in
hardware, software, or configuration will affect actual performance. Consult other sources of information
to evaluate performance as you consider your purchase. For more complete information about
performance and benchmark results, visit http://www.intel.com/performance.
Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2015 Intel Corporation.
http://www.intel.com/performance