programming intel® processor graphics · 3 types of svm coarse-grain buffers (intel 5th gen...

PROGRAMMING Intel® Processor Graphics

Chi-Keung (CK) Luk - Intel Principal Engineer

Intel Software & Services Group

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

• Compute programming on Intel Graphics with:

• OpenCL

• CilkPlus

• Tools

• Workload performance

2

Agenda


• OpenCL:

Khronos

Uri Levy, Yuval Eshkol, Doron Singer

Robert Ioffe, Aaron Kunze, Ben Ashbaugh, Stephen Junkins, Michal Mrozek

• CilkPlus:

Knud J Kirkegaard, Anoop Madhusoodhanan Prabha, Konstantin Bobrovsky, Sergey Dmitriev

• VTune:

Alexandr Kurylev, Julia Fedorova

• Workload Performance:

Sharad Tripathi, Chinang Ma, Akhila Vidiyala

Edward Ching, Norbert Egi, Masood Mortazavi, Vivent Cheng, Guangyu Shi

3

Acknowledgments for Slide Sources

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice4

OpenCL

1. Introduction

2. Optimizing OpenCL for Intel GPUs

3. Using Shared Virtual Memory (SVM)


• An open standard managed by the Khronos group

• A set of C-based APIs on the host that defines a run-time environment

• Programs written in a C-based language (C++ support since OpenCL 2.1) that run on the device(s)

OpenCL* (Open Computing Language)


Mostly C-like

Kernels (functions which execute a work-item) have a kernel prefix, and a void return type

No support for library functions

No stdio.h / stdlib.h / math.h / etc..

But printf is supported

Based on C99

kernel void foo (global int* ptr)

{

…

for (int i=0; i


Scalar data types

char / uchar / short / ushort / int / uint / long / ulong

float / double

size_t

Pointers

Derived data types

Arrays

Structures

Vector data types

Supported data types


Vectors exist for all scalar types

Vector widths are 2, 3, 4, 8, 16

All arithmetic operations work on vector types

Component access (XYZW)

Vectors > 4 use numeric (hexadecimal) indices

Working with vectors

uint3 vec0, vec1;

…

uint3 result = vec0 + vec1;

double res1 = dvec0.x + dvec1.z;

double2 res2 = dvec0.wy + dvec1.xx; // Swizzle

float res1 = vec16.s5 + vec16.sf;

float2 res2 = vec8.s37 + vec16.sca; // Swizzle


Functions used for querying the index

get_global_id(int dimension); // Index of work-item in entire execution

get_local_id(int dimension); // index of work-item in work-group

get_group_id(int dimension); // index of work-group

A few others..

Memory

No support for dynamic memory allocation

When passing buffers as arguments, specify global / local / constant

Additional details

kernel

void foo (global const int* ptr, local float* scratch)

{

…


Many functions supported

Overloaded to all relevant types and vector widths

A tiny bit for example:

Math: sin / cos / min / max / log / pow / sqrt / ….

Geometric: dot / cross / distance / length / …

Relational: isEqual / isGreater / all / any / select / …

Built-in functions

kernel void

dot_product foo(global const int4* a, global const int4* b, global int* out)

{

size_t tid = get_global_id(0);

out[tid] = dot(a[tid], b[tid]);

}


• Scalar values

• Use “normal” (C-style) casts

• Vector values

• Vector conversions must be done explicitly

• Source and destination types must have the same vector width

• Example:

• Vector construction (from scalars)

Casts and type conversions

dstValue = convert_destType(srcValue)

int8 intVec;

double8 dVec;

float8 fVec = convert_float8(intVec);

float8 fVec2 = convert_float8(dVec);

ushort3 vecUshort = (ushort3)(0, 12, 7);


Based on C99

With some extra features:

Vector data types

Extensive Built-in functions library

Image handling, Work-group synchronization

Minus others:

Recursion

Function pointers

Pointers to pointers

Dynamic memory allocation

Language summary


OpenCL1. Introduction


Use zero copying to transfer data between CPU and GPU

Maximize EU occupancy

Maximize compute performance

Avoid Divergent Control Flow

Take advantage of large register space

Optimize data accesses



Optimizing Host to Device Transfers: Zero Copying

• Host (CPU) and Device (GPU) share the same physical memory

• For buffers allocated through the OpenCL™ runtime:

- Let the OpenCL runtime allocate system memory

Create buffer with system memory pointer and CL_MEM_ALLOC_HOST_PTR

- OR, Use pre-allocated system memory

Create buffer with system memory pointer and CL_MEM_USE_HOST_PTR

Allocate system memory aligned to a page (4096 bytes) (e.g., use _aligned_malloc or memalign to allocate)

Allocate a multiple of cache line size (64 bytes)

No transfer needed (zero copy)!

- Use clEnqueueMapBuffer() to access data

No transfer needed (zero copy)!


Maximizing EU Occupancy

• Occupancy is a measure of EU thread utilization

• Two primary things to consider:

- Launch enough work items to keep EU threads busy

- In short kernels: use short vector data types and compute multiple pixels to better amortize thread launch cost

For example, color conversion:

__global uchar* src, dst;p = src[src_idx] * B2Y +

src[src_idx + 1] * G2Y + src[src_idx + 2] * R2Y;

dst[dst_idx] = p;

__global uchar* src_ptr, dst_ptr;uchar16 src = vload16(0, src_ptr);uchar4 c0 = src.s048c;uchar4 c1 = src.s159d;uchar4 c2 = src.s26ae;uchar4 Y = c0 * B2Y +

c1 * G2Y + c2 * R2Y;

vstore4(Y, 0, dst_ptr);Before:One pixel per work item After:

Four pixels per work item


Maximize Compute Performance

• Use floats instead of integer data types

- Because an EU can issue two float operations per cycle

• Floating-point throughput depends on the data width

- float16 throughput = 2 x float32 throughput

- float32 throughput = 4 x float64 throughput

• Trade accuracy for speed, where appropriate

- Use “native” built-ins (or use -cl-fast-relaxed-math)

- Use mad() / fma()(or use -cl-mad-enable)

x = cos(i); x = native_cos(i);


Avoid Divergent Control Flow

“SIMT” ISA with Predication and Branching

“Divergent” code executes both branches

Reduced SIMD Efficiency, Increased Power and Exec Time

this();

if ( x )

that();

else

another();

finish();

SIMD lane

time

Example: “x” sometimes true

SIMD lane

time

Example: “x” never true


Optimizing Data Accesses


Take Advantage of Large Register Space

• Each work item in an OpenCL™ kernel has access to up to 256-512 bytes of register space

• Bandwidth to registers faster than any memory

• Loading and processing blocks of pixels in registers is very efficient!

float sum[PX_PER_WI_X] = { 0.0f };float k[KERNEL_SIZE_X];float d[PX_PER_WI_X + KERNEL_SIZE_X];// Load filter kernel in k, input data in d...// Compute convolutionfor (px = 0; px < PX_PER_WI_X; ++px)

for (sx = 0; sx < KERNEL_SIZE_X; ++sx)sum[px]= mad(k[sx], d[px + sx], sum[px]);

Use available registers (up to 512 bytes) instead of

memory, where possible!

allocated in registers


Global and Constant Memory

Global Memory Accesses go through the L3 Cache

L3 cache line is 64 bytes

EU thread accesses to the same cache line are collapsed

• Order of data within cache line does not matter

• Bandwidth determined by number of cache lines accessed

• Maximum Bandwidth (L3 EU): 64 bytes / clock / sub slice

Good: Load at least 32-bits of data at a time, starting from a 32-bit aligned address

Best: Load 4 x 32-bits of data at a time, starting from a cache line aligned address

• Loading more than 4 x 32-bits of data is not beneficial


Example: Global and Constant Memory Accesses


• Local memory accesses go thru. SLM

• Sits next to L3 cache in the architecture

• Key difference: SLM is banked

• Banked at 4-byte granularity, with 16 banks in total

• Maximum bandwidth: still 64 bytes / clock / sub slice

• Supports more access patterns with full bandwidth than Global memory:

• Reading the same address from a bank => not a bank conflict

• Reading different addresses from a bank => bank conflict

• Maximum bandwidth achieved when there is no bank conflict27

Local Memory Accesses


Example: Local Memory Accesses


OpenCL

1. Introduction



Shared Virtual Memory (Pre-history)

Builds upon “shared physical memory” feature SPM established with OpenCL 1.0 => CL_MEM_USE_HOST_PTR flag Supported on Intel 3rd Gen processors with HD Graphics Eliminated buffer copy costs, aka “zero-copy” buffers* Buffer must have 4k byte alignment and size divisible by 64

SPM available since 2011, but still not

used by many OpenCL apps…

* See “Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics”

https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

Shared Virtual Memory (SVM) - Basics

31

3 types of SVMCoarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueSVMMap/Unmap commands

Fine-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5500+) SVM buffers can be accessed from either CPU or GPU at any time Use atomics to control access (if CPU & GPU may try to modify the

same memory location) Check CL_DEVICE_SVM_FINE_GRAIN_BUFFER for fine-grained buffer

SVM support, CL_DEVICE_SVM_ATOMICS is for atomics support

Fine-grain system memory (Future Intel Processors) CPU & GPU can share anything allocated from the C-runtime ‘heap’

(i.e. malloc/new)

3 types of SVM

Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueMap/Unmap commands

Un-mapped state: Only GPU can access buffer

3 types of SVM

Coarse-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5300) SVM buffers are mapped to either CPU or GPU at any given time Access is controlled by clEnqueueMap/Unmap commands

Mapped state: Only CPU can access buffer

3 types of SVM

Fine-grain buffers (Intel 5th Gen Processors w/ HD Graphics 5500+) SVM buffers can be accessed from both CPU and GPU at any time Can use atomics to avoid ‘race’ conditions

Check if device supports (CL_DEVICE_SVM_FINE_GRAIN_BUFFER & CL_DEVICE_SVM_ATOMICS flags)

Fine grain SVM buffer allows simultaneous access from CPU & GPU

3 types of SVM

Fine-grain system memory (Future Intel Processors) CPU & GPU can share anything allocated from the C-runtime ‘heap’

(i.e. malloc/new) Ideal end-state – requires convergence of OS, H/W, and API support

Full CPU/GPU memory

coherency for all heap

allocations

SVM --- API Basic

37

SVM --- Kernel Setup

38

39

Agenda


• OpenCL

• CilkPlus

• Tools


Intel CilkPlus for Parallel Programming

40

• Add language extensions to C++ programs to exploit parallelism. Come with

two flavors:

• MIT Cilk

• OpenMP

• Kinds of parallelism exploited:

• Task-level

• Loop-level

• SIMD-level

• Originally designed for CPU, now also support Intel Graphics

• Unlike OpenCL, no separation of host and device programs

Example: Serial C++ version

void vecadd(int n, float *a, float *b, float *c)

{

for (int i=0; i < n; i++) {

a[i] = b[i] + c[i];

}

}

41

Example: Parallel CPU version (Cilk flavor)

42


{

_Cilk_for (int i=0; i < n; i++) {

a[i] = b[i] + c[i];

}

}

Example: Intel Graphics version (Cilk flavor)

43


{

#pragma offload target(gfx) pin(a, b, c : length(n))

_Cilk_for (int i=0; i < n; i++) {

a[i] = b[i] + c[i];

}

}

Example: Parallel CPU version (OpenMP flavor)

44


{

#pragma omp parallel for

for (int i=0; i < n; i++) {

a[i] = b[i] + c[i];

}

}

Example: Intel Graphics version (OpenMP flavor)

45


{

#pragma omp target(gfx) \

map(tofrom: a[0:n], b[0:n], c[0:n]) map(to: n)

#pragma omp parallel for

for (int i=0; i < n; i++) {

a[i] = b[i] + c[i];

}

}

Software and Services GroupCopyright© 2015, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.

Optimization Notice

Copyright© 2015, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

CilkPlus keywords

• Cilk Plus adds three keywords to C and C++:

_Cilk_spawn

_Cilk_sync

_Cilk_for

• If you #include , you can write the keywords as cilk_spawn,

cilk_sync, and cilk_for.

• Cilk Plus runtime controls thread creation and scheduling.

• For GFX offload _Cilk_for is supported

– No Cilk Plus runtime on the target

– Scheduling happens on the host side

46



Optimization Notice


cilk_for loop

• Looks like a normal for loop.

cilk_for (int x = 0; x < 1000000; ++x) { … }

• Any or all iterations may execute in parallel with one another.

• All iterations complete before program continues.

• Constraints:

– Limited to a single control variable.

– Must be able to jump to the start of any iteration at random.

– Iterations should be independent of one another.

47



Optimization Notice


Array Notation (1)

• Use a “:” in array subscripts to operate on multiple elements

A[:] // all of array A

A[lower_bound : length]

A[lower_bound : length : stride]

48

Explicit Data Parallelism Based on C/C++ Arrays



Optimization Notice


Array Notation (2)

• Example: A[:] = B[:]+C[:]

• An extension to C/C++

• Perform operations on sections of arrays in parallel

• Extension has parallel semantics

• Well suited for code that:

– performs per-element operations on arrays,

– without an implied order between them

– with an intent to execute in vector instructions

49



Optimization Notice


Operations on Array Sections (1)

• C/C++ operators

d[:] = a[:] + (b[:] * c[:])

• Function calls

b[:] = foo(a[:]); // Call foo() on each element of a[]

• Reductions combine array elements to get a single result

// Add all elements of a[]

sum = __sec_reduce_add(a[:]);

// More reductions exist...

• If-then-else and conditional operators allow masked operations

if (mask[:]) {

a[:] = b[:]; // If mask[i] is true, a[i]=b[i]

}

50



Optimization Notice


Operations on Array Sections (2)

• Implicit index fills array with index values

– sec_implicit_index(0) // 1st rank index section index

– sec_implicit_index(1) // 2nd rank section index

51

// fill A with values 0,1,2,3,4....

A[:] = __sec_implicit_index(0);

// fill B[i][j] with i+j

B[:][:] = __sec_implicit_index(0) + __sec_implicit_index(1);

// fill the lower-left triangle of C with 1

C[0:n][0:__sec_implicit_index(0)] = 1;



Optimization Notice


SIMD Functions (1)

• A general construct to express data parallelism:

– Write a function to describe the operation on a single element

• Annotate it with one of:

__declspec(vector)__

__attribute__((vector))

– Invoke the function across a parallel data structure (arrays) or from

within a vectorizable loop.

52



Optimization Notice


SIMD Functions (2)

• Polymorphic: a vectorizing compiler may create both array and

scalar versions of the function.

• Writing the function is independent of its invocation

– The function can invoked on scalars, within serial for or cilk_for loops,

using array notation, etc..

53



Optimization Notice


SIMD Functions - Example

• Defining an elemental function:__declspec (vector) double option_price_call_black_scholes(

double S, double K, double r, double sigma, double time)

{

double time_sqrt = sqrt(time);

double d1 = (log(S/K)+r*time)/(sigma*time_sqrt) +

0.5*sigma*time_sqrt;

double d2 = d1-(sigma*time_sqrt);

return S*N(d1) - K*exp(-r*time)*N(d2);

}

• Invoking the elemental function:// The following loop can also use cilk_for

call[0:N] = option_price_call_black_scholes(S[0:N], K[0:N], r, sigma, time[0:N]);

Compiler breaks data into SIMD vectors

and calls function on each vector

54



Optimization Notice


SIMD Annotations

• Loop annotation informs the compiler that vectorized loop will have same

semantics as serial loop:

void f(float *a, const float *b, const int *e, int n)

{

#pragma simd

for (int i = 0; i < n; ++i)

a[i] = 2 * b[e[i]];

}

• Currently implemented as a pragma, but other methods of annotating the

loop can be considered.

• Additional clauses for reductions and other vectorization guidance (matching

OpenMP*)

Potential aliasing and loop-carried dependencies would thwart auto-vectorization

55

Optimization Notice


CilkPlus Example: N-Body simulation

Optimization Notice


N-Body simulation

- N particles

- Each has its own mass, position and velocity in 3D space

- Particles are moving by influence of mutual gravitational forces

- Classic simulation has O(N2) computational complexity

Optimization Notice


N-Body simulation: main loop

_Cilk_for (int i = start; i < end; ++i) {Vector3 acc = 0.0f;for (int j = 0; j < body_count; ++j) {

Vector3 dist = old_pos[j] - old_pos[i];float len = sqrtf(dot(dist, dist) + epsilon);acc += dist * masses[j] / (len * len * len);

}

new_vel[i] = old_vel[i] + acc * time;new_pos[i] = old_pos[i] + old_vel[i] * time + acc * time * time / 2;

}

Straightforward approach is to add #pragma offload to the main loop

Optimization Notice


N-Body simulation: #pragma offload

#pragma offload target(gfx) pin(old_pos, old_vel, new_pos, new_vel: length(body_count))

_Cilk_for (int i = start; i < end; ++i) {

Vector3 acc = 0.0f;

for (int j = 0; j < body_count; ++j) {

Vector3 dist = old_pos[j] - old_pos[i];

float len = sqrtf(dot(dist, dist) + epsilon);

acc += dist * masses[j] / (len * len * len);

}

new_vel[i] = old_vel[i] + acc * time;

new_pos[i] = old_pos[i] + old_vel[i] * time + acc * time * time / 2.0f;

}

Optimization Notice


N-Body simulation: Blocking#pragma offload target(gfx) pin(old_pos, old_vel, new_pos, new_vel: length(body_count))

_Cilk_for (int i = start; i < end; i += TILE) {GFXVector3 pos[TILE];GFXVector3 acc[TILE];pos[0:TILE] = old_pos[i:TILE];acc[0:TILE] = 0.0f;

for (int j = 0; j < body_count; j += TILE) {GFXVector3 tpos[TILE];GFXVector3 tmass[TILE];tpos[0:TILE] = old_pos[j:TILE];tmass[0:TILE] = masses[j:TILE];

for (int t = 0; t < TILE; t++) {for (int k = 0; k < TILE; k++) {

GFXVector3 dist = tpos[t] - pos[k];float inv_len = 1.0f / sqrtf(dot(dist, dist) + epsilon);acc[k] += dist * tmass[t] * inv_len * inv_len * inv_len;

}}

}new_vel[i:TILE] = old_vel[i:TILE] + acc[0:TILE] * time;new_pos[i:TILE] = pos[0:TILE] + old_vel[i:TIME] * time + acc[0:TILE] * time * time / 2.0f;

}



Optimization Notice


CilkPlus advanced features

• Static data declaration

• Separate file compilation (linking)

• Recursive functions

• Shared Local Memory (SLM)

• Shared Virtual Memory (SVM)

61



Optimization Notice


Where to get Intel OpenCL and Intel CilkPlus?

• Intel OpenCL:

– https://software.intel.com/en-us/intel-opencl

• Intel CilkPlus:

– https://software.intel.com/en-us/intel-cilk-plus

62

https://software.intel.com/en-us/intel-openclhttps://software.intel.com/en-us/intel-cilk-plus

63

Agenda


• Tools

• VTune

• GT-Pin


VTune

64

• A platform-wide performance profiler:

• Memory accesses, storage and IO analysis, interrupts, CPU/GPU

concurrency, …

• Intel GPU analysis in VTune:

What compute APIs are used (OpenCL, CilkPlus, Media SDK)

When and what GPU units were utilized by those APIs

Rich set of hardware metrics allowing to see how the actual

machine was utilized

Hints for performance issues

OpenCL source/Gen assembly view and some ability to map

performance data back to original code

https://software.intel.com/en-us/intel-vtune-amplifier-xe

https://software.intel.com/en-us/intel-vtune-amplifier-xe

CPU + GPU Utilization (on a Media + OpenCL app.)

65

SW Queue and GPU engines utilization

… and performance data attributed to them

OpenCL kernels on timeline

GEN GPU engines

utilization

OpenCLkernels…

GPU HW metrics over

time

OpenCL host-side API calls

OpenCL queue

Intel Confidential

Compute API Profiling (on an OpenCL application)

6

Architecture Diagram

67

EU Dynamic Instruction Count

68

EU Instruction Latency

69

GT-Pin

70

• A Pin-like binary instrumentation tool for the EU in Intel GPUs

• Command-line interface

• Used to build a wide range of tools for:

• Performance analysis, workload tracing, debugging

• Providing instruction count and latency data to VTune

• Support:

• OpenCL, CilkPlus, DirectX, OpenGL

• Windows, Linux, Android, OS-X

• Will provide API to allow users to write their own tools

• Availability: first public release in 2016

Sample GT-Pin Tool: Opcodeprof

71

Workload Analysis with Opcodeprof

72

Sample GT-Pin Tool: Cacheprof (a cache simulator)

73

• Trace accesses to data ports from EU

• Simulate Intel GPU cache hierarchy:

• Can also dump memory traces to files for offline analysis

L3 Cache

LLC Cache

Instruction Cache

ConstantCache

SamplerCache

RenderCache

Working-set Size Analysis with Cacheprof

74

working set

Benchmark = gaussian_blur filter

Advanced Tool: GT-PinPoints

75

• A tool to find representative regions in OpenCL traces for GPU simulations and workload analysis

• Based on the proven Simpoint methodology

• Overview:

• Results:

• 223x simulation speedup for 3.0% error or

• 35x simulation speedup for 0.3% error

http://www.cs.columbia.edu/~melanie/iiswc2015.pdf

http://www.cs.columbia.edu/~melanie/iiswc2015.pdf

76

Agenda


• Tools


• Photoshop

• Database

Adobe Photoshop

77

• The de facto industry standard raster graphics

editing software developed by Adobe

• Our workloads have 22 feature tests:

• Each of which does image processing like

applying filters, blurring images, adding effects

etc.

Photoshop Experimental Setup

78

Compare the products with the highest theoretical FLOPs from Intel and Nvidia:

• Intel Graphics (integrated)

• Iris Pro 6200 (Broadwell GT3e)

• Theoretical FLOPs = 883 GFlops

• TDP = 47 W

• Actual power measured via a public tool called GPU-Z

• Nvidia GPU (discrete)

• GTX Titan Z

• Theoretical FLOPs = 8122 GFlops

• TDP = 375 W

• Actual power measured via a Nvidia tool called Nvidia-smi

Performance Comparison

79

Nvidia is 1-2.5x faster

Source: Intel. See slide 78 for experiment configuration, slide 88 for FTC disclaimer.

GPU Power Measured

80

Intel consumes 8-14x less power


GPU Energy Consumed

81

Intel consumes 5-16x less energy


Comparing Integrated and Discrete GPUs on Database

Processing

82

Paper:

Unleashing the Hidden Power of Integrated-GPUs for Database Co-

Processing, by E. Ching et al. from FutureWei Technologies

http://subs.emis.de/LNI/Proceedings/Proceedings232/1755.pdf

http://subs.emis.de/LNI/Proceedings/Proceedings232/1755.pdf

Experimental setup

83

Data transfer time

84Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.

TPC-H decision support benchmark (TPC-H Q1)

85

Execution time (lower is better) Throughput per Watt (higher is better)

Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.

TPC-H decision support benchmark (TPC-H Q9)

86

Execution time (lower is better) Throughput per Watt (higher is better)

Source: FutureWei Technologies. See slide 83 for experiment configuration, slide 88 for FTC disclaimer.

Summary

87

• Intel processor graphics is integrated on die with the CPU:

• No extra cost for GPU

• High performance

• Energy efficient

• Intel processor graphics comes with a rich software ecosystem:

• Support most standard graphics/compute programming APIs

• Modern highly-optimizing compiler

• Helpful tools

• Call for actions:

• Try to program the integrated graphics on your Intel-based laptop!

• Visit this tutorial’s webpage for the slides and more information

FTC disclaimer

88

Software and workloads used in performance tests may have beenoptimized for performance only on Intel microprocessors. Performance tests,such as SYSmark and MobileMark, are measured using specific computersystems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult otherinformation and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY

INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL

DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES

RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR

OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,

such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change

to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating

your contemplated purchases, including the performance of that product when combined with other products.

© 2015, Intel Corporation. All rights reserved. Intel, the Intel logo, Atom, Core, Iris, VTune, Xeon, and Xeon Phi are trademarks of Intel Corporation

in the U.S. and/or other countries.

89

Legal Disclaimer and Optimization Notice

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled

hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in

hardware, software, or configuration will affect actual performance. Consult other sources of information

to evaluate performance as you consider your purchase. For more complete information about

performance and benchmark results, visit http://www.intel.com/performance.

Intel, the Intel logo and others are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2015 Intel Corporation.

http://www.intel.com/performance

programming intel® processor graphics · 3 types of svm coarse-grain buffers (intel 5th gen...

Documents