copyright© 2011, intel corporation. all rights reserved. *other brands and names are the property...

38
Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development Products Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Getting Existing Code to Auto-Vectorize with Intel® Composer XE Alex Wells and Brandon Hewitt March 10, 2011

Upload: alfred-hight

Post on 14-Dec-2015

259 views

Category:

Documents


0 download

TRANSCRIPT

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Intel® SoftwareDevelopment Products

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Getting Existing Code to Auto-Vectorize with

Intel® Composer XE

Alex Wells and Brandon HewittMarch 10, 2011

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Before We Start

•All webinars are recorded and will be available on Intel Learning Lab within a week– http://www.intel.com/go/learninglab

•Use the Question Module to ask questions

• If you have audio issues, give it 5 seconds to resolve.– If audio issues persist, we suggest:

– Drop and reenter the webinar – Call into the phone bridge

04/18/23 2

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

How Intel® Composer XE can improve your application performance• Intel® Composer XE is the latest compiler offering from Intel for

32-bit and 64-bit applications.• What are SIMD instructions and how do they help performance• What is auto-vectorization and how does it help generate

optimized SIMD code• What are the typical obstacles to using the auto-vectorizer and

how do I resolve them• What’s improved in the vectorizer in Composer XE• Skinning Kernel Example• Using high-level objects while still vectorizing• Using Strided Array Access to Vectorize Arrays of Structures• Combining these Techniques into a Kernel Template Library• Performance with Intel(R) Advanced Vector Extensions• More options and info

04/18/23 3

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

SIMD: Single Instruction, Multiple Data

• Scalar mode– one instruction produces

one result

• SIMD processing– With Intel® Streaming SIMD Extensions

(SSE) or Advanced Vector Extensions (AVX) instructions

– one instruction can produce multiple

results

++

XX

YY

X + YX + Y

++

XX

YY

X + YX + Y

== ==

x0+y0x0+y0 x1+y1x1+y1 x2+y2x2+y2 x3+y3x3+y3 x4+y4x4+y4 x5+y5x5+y5 x6+y6x6+y6 x7+y7x7+y7

y0y0 y1y1 y2y2 y3y3 y4y4 y5y5 y6y6 y7y7

x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7

04/18/23 4

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Automatic Vectorization by Compiler Translates Loops into SIMD Parallelism loop is stripmined (unrolled), strip length of 8 for floats with Intel® AVX of 4 for floats with Intel® SSE

128-bit Registers

for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

A[7] A[6] A[5] A[4] A[3] A[2] A[1] A[0]

B[7] B[6] B[5] B[4] B[3] B[2] B[1] B[0]

C[7] C[6] C[5] C[4] C[3] C[2] C[1] C[0]

+ + + ++ + + +

04/18/23 5

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Did my loop auto-vectorize?

icc -vec-report1 -c Multiply.cMultiply.c(31): (col. 3) remark: LOOP WAS VECTORIZED.

• Vectorization and reports are enabled only at –O2 and above• Intel® SSE2 enabled by default• Intel® AVX enabled with /Q[a]xAVX, –[a]xAVX

– Read the documentation to find the option that works best for you

• If you use /arch:IA32 or –mia32, vectorizer is disabled.• /Qvec-reportN or –vec-reportN options enable reports

– N is a number from 0-5– 1 reports any loops vectorized– Recommend 3 to get information on loops vectorized and not vectorized

and why

04/18/23 6

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Obstacles to Successful Vectorization:Pointer Aliasing and Loop iteration dependencies

• Vectorization entails changes in the order of operations within a loop, since each SIMD instruction operates on several data elements at once. Vectorization is only possible if this change of order does not change the results of the calculation.

• One major cause can be the assumption of pointer aliasingvoid add(char *cp_a, char *cp_b, int n) {

for (int i = 0; i < n; i++) {

cp_a[i] += cp_b[i];

}

}

• The compiler must by default assume cp_a and cp_b can overlap in memory, thus causing potential loop dependencies (e.g. a write to cp_a[1] might affect the read from cp_b[12])

• For simple cases like this one, the compiler can insert checks at runtime for aliasing and still vectorize– Extra checks can impact performance– Difficult to resolve for more complicated algorithms

04/18/23 7

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Solutions for Loop Dependencies and Pointer Aliasing

• #pragma ivdep– Just place before the loop, and compiler will assume no dependencies

unless it can prove otherwise.– Compiler may still be able to prove dependencies which will still cause

auto-vectorization problems– Applied per loop

• restrict keyword– Use with pointer declarations to assert that memory is only referenced

through that pointer– Requires extra compiler option (/Qrestrict or –restrict, or C99)– Applied per pointer

• Also compiler options to affect compiler assumptions about loop aliasing– /Qansi-alias, /Oa, /Qalias-args- (Windows*)– -ansi-alias, -fno-alias, -fargument-noalias (Linux*/Mac OS*)

• Note that any of these changes may result in incorrect code if the assumptions being changed are not correct (e.g. your pointers do overlap)

04/18/23 8

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Obstacles to Successful Vectorization:Function Calls

•Vectorization cannot occur when function calls are made in the loop– Can be hard to recognize that calls are happening,

especially in C++

•Solutions– Inline wherever possible

– Use __forceinline on function definitions– Use #pragma forceinline recursive to inline all calls

(and nested calls) in a statement– Use __declspec(vector) on function declarations and

definitions to safely auto-vectorize them– Part of Intel® Cilk™ Plus– Many standard math library functions already vectorize

04/18/23 9

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Obstacles to Successful Vectorization:Not Enough Work

•Compiler may decide that vectorizing a loop will not generate more efficient code.

•Solution– Use #pragma vector always to override compiler

heuristics– Note that compiler will still not vectorize if it determines it’s

unsafe or illegal to do so– Can use #pragma vector always assert to give a

compile-time error if loop does not vectorize– Be sure that vectorization does improve performance

before use

04/18/23 10

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Obstacles to Successful Vectorization:C++ STL Data Types

• Can be done, but need to do help the compiler$ cat vec-stl-1.cpp#include <vector>

std::vector<double> XX, YY;

void foo(int iters, int x, int y) { for (int i=0; i < iters; i++) { XX[x+i] += YY[y+i]; }}

$ icc -vec-report3 -c vec-stl-1.cppvec-stl-1.cpp(6): (col. 3) remark: loop was not vectorized: existence of vector dependence.vec-stl-1.cpp(7): (col. 7) remark: vector dependence: assumed FLOW dependence between

_M_start line 7 and _M_start line 7.

• Some similar problems with loop dependence (due to STL implementation)

• Also problem with compiler not recognizing global addresses as invariant

04/18/23 11

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Solution for Enabling Auto-vectorization

$ cat vec-stl-2.cpp#include <vector>

std::vector<double> XX, YY;

void foo(int iters, int x, int y) { double *x1 = &XX[x]; double *y1 = &YY[y];#pragma ivdep for (int i=0; i < iters; i++) { x1[i] += y1[i]; }}

$ icc -vec-report3 -c vec-stl-2.cppvec-stl-2.cpp(9): (col. 3) remark: LOOP WAS VECTORIZED.

04/18/23 12

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Obstacles to Successful Vectorization:Data Alignment

•Data alignment (on 16-bytes) is a major issue with Intel® SSE performance prior to Intel® Core™ i7

• It’s still a potential performance issue as the compiler will generate runtime checks for alignment

• It will become important again in the next generation of Core i7 (with Intel® AVX instructions) (this time on 32-bytes)

•Two places where explicit coding is needed– When declaring new pointers– When declaring function arguments

04/18/23 13

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Data Alignment Solutions

•For new arrays:– __declspec(align(N)) type name[bounds];

•For new dynamically allocated memory:– type * p = (type*) _mm_malloc(size, N);– _mm_free(p);– Threading Building Blocks memory allocator

– scalable_aligned_malloc / scalable_aligned_free

– Can also overload new/delete operators for classes

•For function arguments:– __assume_aligned(name, N);

•For specific loops:– #pragma vector aligned/unaligned

•Can cause runtime exceptions if assumption is false

04/18/23 14

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Improvements in Intel® Composer XE (compared to previous versions of Intel® C++ Compiler)

• Mixed data types– Loops containing mixed data types (such as float and double) will

now auto-vectorize$ cat mix-data.c

void foo(int n, float *restrict A, double *restrict B){

int i;

float t = 0.0f;

for (i=0; i<n; i++) {

A[i] = t;

B[i] = t;

t += 1.0f;

}

}

11.1 Update 7:

$ icc -vec-report3 -restrict -c mix-data.c

mix-data.c(4): (col. 3) remark: loop was not vectorized: unsupported data type.

12.0 Update 2:

$ icc -vec-report3 -restrict -c mix-data.c

mix-data.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

04/18/23 15

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Improvements in Intel® Composer XE (compared to previous versions of Intel® C++ Compiler)

• Support for more complicated nested if statements$ cat vec-if.c

void foo(int n, double *A, double *B, double *C){

int i;

#pragma ivdep

for (i=0; i<n; i++){

if (A[i] > 0 || B[i] < 0) {

C[i] = 0;

}

else {

C[i] = 1;

}

}

}

11.1 Update 7:

$ icc -vec-report3 -c vec-if.c

vec-if.c(9): (col. 7) remark: loop was not vectorized: statement cannot be vectorized.

12.0 Update 2:

$ icc -vec-report3 –c vec-if.c

vec-if.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

04/18/23 16

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Application of these Techniques in Skinning Kernel Example

04/18/23 17

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 18

For Auto-Vectorization, data needs to be accessed component wise as arrays

•Skinning Kernel Example: compute influence of a joint over a set of vertices.

•Array per component of Input And Output data streams•Expand high level math to be performed per component

#pragma vector always assert#pragma ivdepfor(unsigned int i=0; i < count; ++i) {    const float x = offsetX[i];    const float y = offsetY[i];    const float z = offsetZ[i];    const float nw = normalizedWeight[i];     outX[i] = (x * joint.m[0][0] + y * joint.m[1][0] + z * joint.m[2][0] + joint.m[3][0]) * nw;    outY[i] = (x * joint.m[0][1] + y * joint.m[1][1] + z * joint.m[2][1] + joint.m[3][1]) * nw;    outZ[i] = (x * joint.m[0][2] + y * joint.m[1][2] + z * joint.m[2][2] + joint.m[3][2]) * nw; }

out[i] = (offset[i]*joint)*normalizedWeight[i];

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 19

You can express a kernel’s algorithm using higher level objects, like Point or Matrix and still Auto-Vectorize!

•Within a Kernel Body– Local objects can be created on the stack– Only arrays external to the kernel body need to

be accessed by control loop’s index– Fixed offset Data Members can be accessed.– Methods may be called as long as they inline

•Auto-Vectorization will still work!

float length = offset.length();Vector3 scaledOffset = offset*scale;

const Vector3 vertex(offsetX[i], offsetY[i], offsetZ[i]);float lengthSquared = (offset.x*offset.x) + (offset.y*offset.y) + (offset.z*offset.z);float length = sqrt(lengthSquared);

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 20

Define high level helper classes for use inside the kernel body

•All helper classes & methods inline code to the stack•Compiler optimizes out needless copies

– By the time the auto-vectorizer gets to it, the code resembles the long hand per component version

const Matrix4x3 joint(usersJoint);#pragma vector always assert#pragma ivdepfor(unsigned int i=0; i < count; ++i) {    const Vector3 offset(offsetX[i], offsetY[i], offsetZ[i]);    const float nw = normalizedWeight[i];  const Point3 out = (offset*joint)*nw;

    outX[i]=out.x(); outY[i]=out.y(); outZ[i]=out.z(); }

Algorithm Expressed using High Level Objects

Algorithm Expressed using High Level Objects

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 21

Anything special about the helper classes?

•Flat data layout •All methods are

__forceinline•Otherwise plain old C++

struct Matrix4x3{private:    float m00;    float m01;    float m02;     float m10;    float m11;    float m12;     float m20;    float m21;    float m22;     float m30;    float m31;    float m32; public:// ... implement methods}

struct Matrix4x3{private:    float m[4][3];public:// ... implement methods}

VSVS

__forceinline Vector3Vector3::operator*(float aScalar){    x() *= aScalar;    y() *= aScalar;    z() *= aScalar;    return *this;}

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 22

Multiple Arrays: getting data to the “vector” SSE registers in the CPU

C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15

•SIMD lanes (registers) must be loaded with the same logical kernel value from multiple elements.

•Best is linear memory access (contiguous array per component) - 1 instruction to load

•Multiple arrays can lose cache locality and pressure hardware prefetchers and the TLB vs. single AOS

04/18/23 22

C[i] = A[i] + B[i]; CPU SSERegisters

A8 A9 A10 A11 A12 A13 A14 A15

B8 B9 B10 B11 B12 B13 B14 B15

X M M 0

X M M 1+

A0 +B0

A1 +B1

A2 +B2

A3 +B3=

A4 A5 A6 A7

B4 B5 B6 B7

A0 A1 A2 A3

B0 B1 B2 B3

Array A

Array B

Array C

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 23

Tiled SOA: A Hybrid Data Layout

•We can bring back cache locality by changing the data layout to a Array Of Structures Of Fixed Length Arrays– Lets just call it “Tiled SOA”.– Requires outer loop to walk

through tiles while inner loop vectorizes over the tile’s Width

•A Single Data Stream made up of Blocks of Fixed Sized Arrays.– Provides linear memory access pattern

04/18/23 23

const size_t Width = 4;struct InputTile{    float a[Width];    float b[Width];};InputTile inStreamOfTiles[2500];

CPU SSERegisters

X M M 0

X M M 1

A0 A1 A2 A3 B0 B1 B2 B3 A4 A5 A6 A7 B4 B5 B6 B7 A8 A9 A10 A11 B8 B9 B10

for(size_t tileIndex=0u; tileIndex < 2500; ++tileIndex) { const InputTile & tile = inStreamOfTiles[tileIndex]; const float * A = tile.a; const float * B = tile.b; float * C_forTile = *C[tileIndex*Width]; #pragma ivdep for(size_t index=0u; index < Width; ++index)  { C_forTile[index] = A[index] + B[index]; }}

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 24

Array of Structures: Use Strided Array Access to pack component data into SIMD lanes• If A and B were in AOS format…

•Keep pointers to the each component•Use strided array access to operate on the correct

component within the elements

struct Input{    float a;    float b;};Input in[10000];

C[i] = in[i].a + in[i].b;

C0 C1 C2 C3 C4 C5 C6 C7

float *A = &in[0].a;float *B = &in[0].b;const unsigned int Stride=2;C[i] = A[i*Stride] + B[i*Stride];

CPU SSERegisters

X M M 0

X M M 1+=

A5 A6 A7B5 B6 B7A2A1 A3B1 B2 B3 A4A0 B4B0

A0 +B0

A1 +B1

A2 +B2

A3 +B3

Array in

Array C

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 25

Improving AOS access• Maintaining a pointer per component increases register

pressure as different pointers are used to access each component.

• Alternatively, the data accesses could be defined in terms of a single pointer + fixed offset to the data member of the structure + (stride of structure * index)

• stddef.h contains a macro that computes the fixed offset to a structure’s data member at compile time: offsetof(structure, member)

• Would be nice to have a template library that simplifies accessing AOS and Tiled SOA data for vectorization…

typedef unsigned char BYTE;const BYTE * inArray = reinterpret_cast<const BYTE *>(&in[0]);for(size_t index=0u; index < 10000; ++index) { const BYTE * pointerA = inArray + offsetof(Input, a) + (sizeof(Input)*index); const BYTE * pointerB = inArray + offsetof(Input, b) + (sizeof(Input)*index); float localA = *(reinterpret_cast<const float *>(pointerA); float localB = *(reinterpret_cast<const float *>(pointerB); C[index] = localA + localB;}

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 26

Kernel Template Library (KTL)

• Express algorithms with High level helper objects• Maps accessing multiple arrays into high level objects• Hides accessing AOS as Strided Arrays for the user • Support Tiled SOA data streams• Apply the kernel to each element of the data streams in a

data parallel manner– vectorized, – threaded, – vectorized + threaded, – or serial (for non vectorizing compilers or comparison)

• Utilizes only c++0x standard features allowing other compilers to still compile the code.

• Generate code that optimizes well

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 27

Kernel Code that Optimizes Well

• GOAL: layout code to give the compiler all available information, enabling it to perform better optimizations.

• Put the constants, loop control, and kernel body on the same stack level with no interruptions.– A function call interrupts the optimizer’s view of the stack.

• No Function calls– kernel body must inline inside the control loop– All functions or methods called must inline

• Constants and variables used inside the kernel must be declared on the stack– A reference or pointer to a constant isn’t good enough

– The compiler can’t assume it hasn’t changed in-between iterations

– With a local stack instance the compiler can prove it wasn’t changed and only load a register once

• Identify array data as independent

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 28

Code Layout that Optimizes Well

#pragma forceinline recursive{  // Define the context for the kernel body here // local stack variables for all constants Constant1 constant1; Constant2 constant2; //... const Array1 * array1; Array2 * array2; //...

// Loop control structure #pragma ivdep for(size_t index=0; index < ElementCount; ++index)  { // define kernel body here, using only context variables // declared on the stack above array2[index] = array1[index]*constant1 + constant2; }}

All of the objects and helpers need to inline for auto-vectorization to work

All of the objects and helpers need to inline for auto-vectorization to work

Identify array data as independent

Identify array data as independent

No function calls between local declarations and loop

to interrupt stack

No function calls between local declarations and loop

to interrupt stack

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 29

C++0x Lambda Functions with a template function can generate such a Code Layout

// User code defines kernel as lambda function that operates // on a single element auto kernel = [=](const size_t &iIndex){ array2[iIndex] = array1[iIndex]*constant1 + constant2;};// template library function generates the code for loop controllayoutCodeForKernel(kernel, ElementCount);

template <typename KernelT>__declspec(noinline) void layoutCodeForKernel(const KernelT &iKernel, const size_t iElementCount){ #pragma forceinline recursive {  // Copy kernel closure onto the local stack KernelT localKernel(iKernel); // Loop control structure #pragma ivdep for(size_t index=0u; index < iElementCount; ++index)  { localKernel(index); } }}

Use”auto” keyword to hide type of the right

hand side

Use”auto” keyword to hide type of the right

hand side

Use Lambda Function with capture by value to define the body of

Kernel

Use Lambda Function with capture by value to define the body of

Kernel

Any external variables accessed will be

implicitly captured by value inside the Lambda

Function’s closure

Any external variables accessed will be

implicitly captured by value inside the Lambda

Function’s closure

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 30

Kernel Template Library Example

•Using KTL to vectorize our Skinning example when data is in Array of Structures (AOS) format.

struct InputElement{ Vector3 offset; float weight;};

struct OutputElement{ Vector3 position;};

const size_t elementCount = 10000;InputElement containerInput[elementCount];OutputElement containerOutput[elementCount];

const Matrix4x3 * joint;

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 31

Kernel Template Library Example

•Using KTL to vectorize our Skinning example when data is in Array of Structures (AOS) format.

const InputElement * inArray = &containerInput[0];OutputElement * outArray = &containerOutput[0];const Matrix4x3 localJoint(*joint);

auto kernel = [=](const ElementAccess &access){ // Access array pointers as ktl elements  // ktl elements encapsulate strided arrays and loop control indexes)  auto inElem = access.inputElementOfAos(inArray); auto outElem = access.outputElementOfAos(outArray);

// Get all input data out of the element(s) into local high level objects Vector3 localOffset; float localWeight; inElem.get<offsetof(InputElement, offset)>(localOffset);  inElem.get<offsetof(InputElement, weight)>(weight); 

// Express algorithm using high level objects const Vector3 result = (localOffset*localJoint)*localWeight;

// Set all output element(s) with the results outElem.set<offsetof(OutputElement, position)>(result); };ktl::Engine::vectorizeKernel(kernel, elementCount);

ElementAccess will create elements that encapsulate getting

or setting data for current iteration

ElementAccess will create elements that encapsulate getting

or setting data for current iteration

Use C++0x keyword “auto” to hide template

types

Use C++0x keyword “auto” to hide template

types

Use C++0x Lambda Function to define body

of Kernel

Use C++0x Lambda Function to define body

of Kernel

Get the element data onto the stack in terms of High Level Objects

Get the element data onto the stack in terms of High Level Objects

Algorithm Expressed using High Level Objects

Algorithm Expressed using High Level Objects

Store the results out in terms of High Level

Objects

Store the results out in terms of High Level

ObjectsAsk the Engine to vectorize

the kernel bodyAsk the Engine to vectorize

the kernel body

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 32

Kernel Template Library Example

•Can use MACROS to simplify

KTL_KERNEL_BEGIN(kernel){ KTL_ELEM_IN_AOS(inElem, inArray) KTL_ELEM_OUT_AOS(outElem, outArray);

KTL_LOCAL_FROM_MEMBER(localOffset, inElem, offset) KTL_LOCAL_FROM_MEMBER(localWeight, inElem, weight)

const Vector3 result = (localOffset*localJoint)*localWeight;

KTL_SET_MEMBER_WITH(outElem, position, result)}KTL_KERNEL_ENDktl::Engine::vectorizeKernel(kernel, elementCount);

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 33

Intel® AVX performance• Automatic CPU dispatch to add Intel® AVX code paths

– /QaxAVX /Qdiag-enable:cpu-dispatch

• Weighted Joint Deformation 256 elements (fits in L1), 1000000 iterations

• Test Platform Intel® Core™ i7 2nd generation processor [Sandy Bridge]– 3.10 Ghz, 6MB L3, 4GB Ram, Windows Server 2008* r2

• Compiled with Intel® C++ Composer XE [Version 12.0.0.104 for Intel® 64]• Microsoft Visual Studio 2010* Release Config plus:/O3 /Oi /Ot /fp:fast /QaxAVX

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

04/18/23 34

What’s Next for Kernel Template Library?

•Containers– Runtime Tiled SOA defined and accessed by AOS

– Take on managing tiled structure of arrays allowing user to pretend it’s a AOS for setup/editting and when defining a kernel, but under the hood the it will be Tiled SOA.

•More Data Types for use in Kernel– Current data types are just for example purposes

•Gather/Scatter or other staging data streams– Gather and scatter currently don’t vectorized and an

additional stage needs to be introduced to gather/scatter to/from an intermediate buffer before the rest of the kernel can vectorize.

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

More Options and Information

•Lots of other ways to get the benefits of vectorization and SIMD instructions– Automatic CPU Dispatch (/Qax, -ax)– Manual CPU Dispatch (APIs to dispatch multiple versions of

specific functions explicitly)– Intel® Cilk™ Plus Array Notations and #pragma simd– Intel® Integrated Performance Primitives and Math Kernel

Library

•KTL is available at http://software.intel.com/en-us/articles/kernel-template-library/

•More presentations at http://software.intel.com/en-us/articles/intel-software-development-products-technical-presentations/

•Go to http://software.intel.com for more

04/18/23 35

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Optimization Notice

Optimization Notice

Intel compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel Compiler User and Reference Guides” under “Compiler Options." Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel Streaming SIMD Extensions 2 (Intel SSE2), Intel Streaming SIMD Extensions 3 (Intel SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.

Notice revision #20110228

04/18/23 36

Legal Disclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:  http://www.intel.com/design/literature.htm Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.http://software.intel.com/en-us/articles/intel-sample-source-code-license-agreement/?wapkw=(Samples+Software+License+Agreement)

Note:  The below disclaimer should be included whenever the general performance disclaimer is used, but should be numbered separately:

Configurations:  Details on slide 33.  For more information go to http://www.intel.com/performance

Intel processor numbers are not a measure of performance.  Processor numbers differentiate features within each processor family, not across different processor families.  Go to: http://www.intel.com/products/processor_number

Intel, Core i7 and Cilk Plus are trademarks of Intel Corporation in the U.S. and/or other countries.

Copyright © 2011 Intel Corporation.  All rights reserved.

*Other names and brands may be claimed as the property of others. 

Optimization Notice

04/18/23 37

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Optimization Notice

Q&A

04/18/23 38