joel falcou, boost.simd

Pragmatic Speedup with Boost.SIMD

Unlocked software performance

Joel Falcou

NumScale

24 février 2016

Challenges of SIMDprogramming


Parallelism is everywhere

The Obvious One

� Multi-cores� Many-cores� Distributed systems

The Embedded One

� Pipeline� Super-scalar, out of orders CPUs� SIMD Instructions Sets

2 of 32


What is SIMD ?

Instructions

Data

Results

SISD SIMD

Principes

� Single Instruction, Multiple Data

� Each operation is applied on Nvalues in a single register of xedsize (128,256,512bits)

� Can be up to N times faster thanregular ALU

3 of 32


Why using SIMD ?

� Speedup of x2 to x16 that may be combined with other parallelismsource

� Reduce computing time without changing the infrastructure� Give great results for all kind of regular or less regular computing

patterns

4 of 32


1001 avors of SIMD

Intel x86� MMX 64 bits oat, double� SSE 128 bits oat� SSE2 128 bits int8, int16, int32, int64,

double� SSE3, SSSE3� SSE4a (AMD)� SSE4.1, SSE4.2� AVX 256 bits oat, double� AVX2 256 bits int8, int16, int32, int64� FMA3� FMA4, XOP (AMD)� MIC 512 bits oat, double, int32, int64

PowerPC� AltiVec 128 bits int8, int16, int32,

int64, oat� Cell SPU et VSX, 128 bits int8,

int16, int32, int64, oat, double� QPX 512 bits double

ARM� VFP 64 bits oat, double� NEON 64 bits et 128 bits oat,

int8, int16, int32, int64

5 of 32


SIMD the good ol’way, int32 * int32 -> int32

// NEONreturn vmul_s32(a0, a1); // 64-bitreturn vmulq_s32(a0 , a1); // 128-bit

6 of 32



// SSE4.1return _mm_mullo_epi32(a0 , a1);

6 of 32



// SSE2return_mm_or_si128(

_mm_and_si128(_mm_mul_epu32(a0 ,a1),_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)

), _mm_slli_si128(

_mm_and_si128(_mm_mul_epu32( _mm_srli_si128(a0 ,4)

, _mm_srli_si128(a1 ,4))

, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0))

, 4)

);

6 of 32



// Altivec// reinterpret as u16short0 = (__vector unsigned short)a0;short1 = (__vector unsigned short)a1;

// shifting constantshift = vec_splat_u32 (-16);sf = vec_rl(a1, shift_);

// Compute high part of the producthigh = vec_msum( short0 , (__vector unsigned short)sf

, vec_splat_u32 (0));

// Complete by adding low part of the 16 bits productreturn vec_add( vec_sl(high , shift_)

, vec_mulo(short0 , short1));

6 of 32


Isn’t it a compiler’s job ?

Autovectorization� Issues with :

� memory constraint� vectorisability of the code must be obvious

� What about library functions ?� Compiler may be confused or miss a critical clue

Conclusion� Explicit SIMD can garantee the level of performance� Challenge : Keeping a multi-architecture SIMD code up to date

7 of 32


Our approach

High level abstraction

� Designing a SIMD Domain-Specic Embedded Language (DSEL)� Abstracting SIMD registers as data block� High level optimisation at expression’s scope

Integration within C++

� Make SIMD code generic and cross-hardware

� Integration with the standard library

� Use modern C++ idioms

8 of 32

Boost.SIMD

†Boost.SIMD is a candidate for inclusion in the Boost


SIMD abstraction

pack<T,N>

pack<T, N> SIMD register of N elements of type Tpack<T> same with an optimal N for current hardware

Behave as a value of type T but apply operation to all its elements at once.

Constraints� T is a fundamental type� logical<T> is used to handle boolean� N must be a power of 2.

10 of 32


Operations on pack

Basic Operators

� All language operators are available : pack<T> ⊕ pack<T> , pack<T> ⊕ T , T ⊕pack<T>

� No convertion nor promotion though :uint8_t(255) + uint8_t(1) = uint8_t(0)

Comparisons

� ==, !=, <, <=,> et >= perform SIMD comparisons.

� compare_equal, compare_less return a single boolean.

Other proprerties

� Models RandomAccessFusionSequence and RandomAccessRange

� p[i] return a proxy to access the register internal value

11 of 32


Memory access

Loading and Storing

� (aligned_load/store) and (load/store)

� Support for statically known misalignment

� Support for conditionnal and sparse access (w/r to hardware)

Examples

12 of 32


Memory access

Loading and Storing




Examplesaligned_load< pack<T, N> >(p, i) load a pack from the aligned address p + i.

0D 0E 0F 10 11 12 13 14 15 16 17 18

aligned_load<pack<float>>(0x10,0)

Main Memory

... ...

10 11 12 13

12 of 32


Memory access

Loading and Storing




Examplesaligned_load< pack<T, N>, Offset>(p, i) load a pack from the aligned addressp + i - Offset.

0D 0E 0F 10 11 12 13 14 15 16 17 18

aligned_load<pack<float>,2>(0x10,2)

Main Memory

... ...

12 13 14 15

12 of 32


Shuffle and Swizzle

Genral Principles

� Elements of SIMD register can be permuted by the hardware� Turn complex memory access into computations� Provided by the shuffle function

Examples :// a = [ 1 2 3 4 ]pack <float > a = enumerate < pack <float > >(1);

// b = [ 10 11 12 13 ]pack <float > b = enumerate < pack <float > >(10);

// res = [4 12 0 10]pack <float > res = shuffle <3,6,-1,4>(a,b);

13 of 32


Shuffle and Swizzle

Genral Principles

� Elements of SIMD register can be permuted by the hardware� Turn complex memory access into computations� Provided by the shuffle function

Examples :struct reverse_{

template <class I, class C>struct apply : mpl::int_ <C:: value - I::value - 1> {};

};

// res = [n n-1 ... 2 1]pack <float > res = shuffle <reverse_ >(a);

13 of 32


Integration with the STL

� Algorithms :� SIMD transform� SIMD fold� Use polymorphic functor or lambda for mixing scalar/SIMD

� Iterators :� Provide SIMD aware walkthroughs� boost::simd::(aligned_)(input/output_)iterator� boost::simd::direct_output_iterator� boost::simd::shifted_iterator

14 of 32



std::vector <float , simd::allocator <float > > v(N);

simd:: transform( v.begin(), v.end(), []( auto const& p)

{return p * 2.f;

});

15 of 32



std::vector <float , simd::allocator <float > > i(N), o(N);

std:: transform( simd:: shifted_iterator <3>(in.begin()), simd:: shifted_iterator <3>(in.end()), simd:: aligned_output_begin(o.begin()), average ());

struct average{

template <class T>typename T:: value_type operator ()(T const& t) const{

typename T:: value_type d(1./3);return (t[0]+t[1]+t[2])*d;

}};

15 of 32


Hardware Optimisations

Problem :� Most SIMD hardware support fused operations like fma.� Those optimisations must remain transparent to the user� We use Expression Templates so B.SIMD auto-optimizes those patterns.

Examples :

� a * b + c becomes fma(a, b, c)

� a + b * c becomes fma(b, c, a)

� !(a < b) becomes is_nle(a, b)

16 of 32


Supported Hardwares

Open Source

� Intel SSE2-4, AVX� PowerPC VMX

Proprietary

� ARM Neon� Intel AVX2, XOP, FMA3, FMA4� Intel MIC

17 of 32


Other functions ...

Arithmetic� saturated arithmetics� long multiplication� oat/int conversion� round, oor, ceil, trunc� sqrt, cbrt� hypot� average� random� min/max� roudned division et

remainder

Bitwise

� select

� andnot, ornot

� popcnt

� ffs

� ror, rol

� rshr, rshl

� twopower

IEEE

� ilogb, frexp

� ldexp

� next/prev

� ulpdist

Predicates

� comparison to zero

� negated comparisons

� is_unord, is_nan,is_invalid

� is_odd, is_even

� majority

18 of 32


Reduction et SWAR operations

Reduction� any, all� nbtrue� minimum/maximum,

posmin/posmax� sum� product, dot product

SWAR

� group/split

� splatted reduction

� cumsum

� sort

19 of 32

Performances !


Basic Functions

Single precision trignometricsHardware : Core i7 SandyBridge, AVX using cycles/values

Function Range std Scalar SIMDexp [−10, 10] 46 38 7log [−10,−10] 42 37 5asin [−1, 1] 40 35 13cos [−20π, 20π] 66 47 6

fast_cos [−π/4, π/4] 32 9 1.3

21 of 32


Julia set generator

� Generate a fractal image using the Julia funtion� Purely compute-bound� Challenge : Workload depends on pixel location

22 of 32


Julia set generator

template <class T> typename meta::as_integer <T>:: typejulia(T const& a, T const& b){

typename meta::as_integer <T>:: type iter;std:: size_t i = 0;T x, y;

do {T x2 = x * x;T y2 = y * y;T xy = s_t(2) * x * y;x = x2 - y2 + a;y = xy + b;iter = selinc(x2 + y2 < T(4), iter);

} while(any(mask) && i++ < 256);

return iter;}

23 of 32


Julia set generator

256 x 256 512 x 512 1024 x 1024 2048 x 2048

0

200

400

600

800

x2.9

3

x2.9

9

x3.0

2

x3.0

3

x6.6

4

x6.9

4

x6.0

9

x6.1

6

x6.5

2

x6.8

1

x5.9

7

x6.0

5

Size

cpe

scalar SSE2

simd SSE2

simd AVX

simd AVX2

24 of 32


Motion Detection

� Based on Manzanera’s Sigma Delta algorithm� Background substraction� Pixel intensity modeled as gaussian distributions� Challenge : Low arithmetic intensity

25 of 32


Motion Detection

template <typename T>T sigma_delta(T& bkg , T const& frm , T& var){

bkg = selinc(bkg < frm , seldec(bkg > fr, bkg));

T dif = dist(bkg , frm);T mul = muls(dif ,3);

var = if_else( dif != T(0), selinc(var < mul , seldec(var > mul , var)), var);

return if_zero_else_one( dif < var );}

26 of 32


Motion Detection

480 x 640 @5 600 x 800 @5 1024 x 1280 @5 1080 x 1920 @5 2160 x 3840 @5

0

1

2

3

·104

x3.8

0

x6.6

3

x3.6

4

x5.6

3

x3.5

0

x6.6

9

x6.5

3

x5.7

5

x5.7

4

x5.7

1

x26.

26

x26.

96

x19.

05

x17.

19

x10.

81

Size

FPS

scalar SSE2

simd SSE2

simd AVX

simd AVX2

27 of 32


Sparse Tridiagonal Solver

� Solve Ax = b with sparse A andmultiple x

� Application in uid mechanics� Challenge : Using SIMD despite

being sparse� Solution : Shuffle for local

densication

28 of 32


Sparse Tridiagonal Solver

29 of 32

Conclusion


Conclusion

Boost.SIMD

� We can design a C++ library for low level performance primitives� A lot of success with a lot of different applications� Find us on https ://github.com/jfalcou/nt2

� Tests, comments and feedback welcome

Soon

� Rewrite before submission to Boost is in progress� More architectures to come !

31 of 32

Thanks for your attention !

joel falcou, boost.simd

Software