joel falcou, boost.simd
TRANSCRIPT
Pragmatic Speedup with Boost.SIMD
Unlocked software performance
Joel Falcou
NumScale
24 février 2016
Challenges of SIMDprogramming
Unlocked software performance
Parallelism is everywhere
The Obvious One
� Multi-cores� Many-cores� Distributed systems
The Embedded One
� Pipeline� Super-scalar, out of orders CPUs� SIMD Instructions Sets
2 of 32
Unlocked software performance
Parallelism is everywhere
The Obvious One
� Multi-cores� Many-cores� Distributed systems
The Embedded One
� Pipeline� Super-scalar, out of orders CPUs� SIMD Instructions Sets
2 of 32
Unlocked software performance
What is SIMD ?
Instructions
Data
Results
SISD SIMD
Principes
� Single Instruction, Multiple Data
� Each operation is applied on Nvalues in a single register of xedsize (128,256,512bits)
� Can be up to N times faster thanregular ALU
3 of 32
Unlocked software performance
Why using SIMD ?
� Speedup of x2 to x16 that may be combined with other parallelismsource
� Reduce computing time without changing the infrastructure� Give great results for all kind of regular or less regular computing
patterns
4 of 32
Unlocked software performance
1001 avors of SIMD
Intel x86� MMX 64 bits oat, double� SSE 128 bits oat� SSE2 128 bits int8, int16, int32, int64,
double� SSE3, SSSE3� SSE4a (AMD)� SSE4.1, SSE4.2� AVX 256 bits oat, double� AVX2 256 bits int8, int16, int32, int64� FMA3� FMA4, XOP (AMD)� MIC 512 bits oat, double, int32, int64
PowerPC� AltiVec 128 bits int8, int16, int32,
int64, oat� Cell SPU et VSX, 128 bits int8,
int16, int32, int64, oat, double� QPX 512 bits double
ARM� VFP 64 bits oat, double� NEON 64 bits et 128 bits oat,
int8, int16, int32, int64
5 of 32
Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// NEONreturn vmul_s32(a0, a1); // 64-bitreturn vmulq_s32(a0 , a1); // 128-bit
6 of 32
Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// SSE4.1return _mm_mullo_epi32(a0 , a1);
6 of 32
Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// SSE2return_mm_or_si128(
_mm_and_si128(_mm_mul_epu32(a0 ,a1),_mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0)
), _mm_slli_si128(
_mm_and_si128(_mm_mul_epu32( _mm_srli_si128(a0 ,4)
, _mm_srli_si128(a1 ,4))
, _mm_setr_epi32 (0xffffffff ,0,0xffffffff ,0))
, 4)
);
6 of 32
Unlocked software performance
SIMD the good ol’way, int32 * int32 -> int32
// Altivec// reinterpret as u16short0 = (__vector unsigned short)a0;short1 = (__vector unsigned short)a1;
// shifting constantshift = vec_splat_u32 (-16);sf = vec_rl(a1, shift_);
// Compute high part of the producthigh = vec_msum( short0 , (__vector unsigned short)sf
, vec_splat_u32 (0));
// Complete by adding low part of the 16 bits productreturn vec_add( vec_sl(high , shift_)
, vec_mulo(short0 , short1));
6 of 32
Unlocked software performance
Isn’t it a compiler’s job ?
Autovectorization� Issues with :
� memory constraint� vectorisability of the code must be obvious
� What about library functions ?� Compiler may be confused or miss a critical clue
Conclusion� Explicit SIMD can garantee the level of performance� Challenge : Keeping a multi-architecture SIMD code up to date
7 of 32
Unlocked software performance
Our approach
High level abstraction
� Designing a SIMD Domain-Specic Embedded Language (DSEL)� Abstracting SIMD registers as data block� High level optimisation at expression’s scope
Integration within C++
� Make SIMD code generic and cross-hardware
� Integration with the standard library
� Use modern C++ idioms
8 of 32
Boost.SIMD
†Boost.SIMD is a candidate for inclusion in the Boost
Unlocked software performance
SIMD abstraction
pack<T,N>
pack<T, N> SIMD register of N elements of type Tpack<T> same with an optimal N for current hardware
Behave as a value of type T but apply operation to all its elements at once.
Constraints� T is a fundamental type� logical<T> is used to handle boolean� N must be a power of 2.
10 of 32
Unlocked software performance
Operations on pack
Basic Operators
� All language operators are available : pack<T> ⊕ pack<T> , pack<T> ⊕ T , T ⊕pack<T>
� No convertion nor promotion though :uint8_t(255) + uint8_t(1) = uint8_t(0)
Comparisons
� ==, !=, <, <=,> et >= perform SIMD comparisons.
� compare_equal, compare_less return a single boolean.
Other proprerties
� Models RandomAccessFusionSequence and RandomAccessRange
� p[i] return a proxy to access the register internal value
11 of 32
Unlocked software performance
Memory access
Loading and Storing
� (aligned_load/store) and (load/store)
� Support for statically known misalignment
� Support for conditionnal and sparse access (w/r to hardware)
Examples
12 of 32
Unlocked software performance
Memory access
Loading and Storing
� (aligned_load/store) and (load/store)
� Support for statically known misalignment
� Support for conditionnal and sparse access (w/r to hardware)
Examplesaligned_load< pack<T, N> >(p, i) load a pack from the aligned address p + i.
0D 0E 0F 10 11 12 13 14 15 16 17 18
aligned_load<pack<float>>(0x10,0)
Main Memory
... ...
10 11 12 13
12 of 32
Unlocked software performance
Memory access
Loading and Storing
� (aligned_load/store) and (load/store)
� Support for statically known misalignment
� Support for conditionnal and sparse access (w/r to hardware)
Examplesaligned_load< pack<T, N>, Offset>(p, i) load a pack from the aligned addressp + i - Offset.
0D 0E 0F 10 11 12 13 14 15 16 17 18
aligned_load<pack<float>,2>(0x10,2)
Main Memory
... ...
12 13 14 15
12 of 32
Unlocked software performance
Shuffle and Swizzle
Genral Principles
� Elements of SIMD register can be permuted by the hardware� Turn complex memory access into computations� Provided by the shuffle function
Examples :// a = [ 1 2 3 4 ]pack <float > a = enumerate < pack <float > >(1);
// b = [ 10 11 12 13 ]pack <float > b = enumerate < pack <float > >(10);
// res = [4 12 0 10]pack <float > res = shuffle <3,6,-1,4>(a,b);
13 of 32
Unlocked software performance
Shuffle and Swizzle
Genral Principles
� Elements of SIMD register can be permuted by the hardware� Turn complex memory access into computations� Provided by the shuffle function
Examples :struct reverse_{
template <class I, class C>struct apply : mpl::int_ <C:: value - I::value - 1> {};
};
// res = [n n-1 ... 2 1]pack <float > res = shuffle <reverse_ >(a);
13 of 32
Unlocked software performance
Integration with the STL
� Algorithms :� SIMD transform� SIMD fold� Use polymorphic functor or lambda for mixing scalar/SIMD
� Iterators :� Provide SIMD aware walkthroughs� boost::simd::(aligned_)(input/output_)iterator� boost::simd::direct_output_iterator� boost::simd::shifted_iterator
14 of 32
Unlocked software performance
Integration with the STL
� Algorithms :� SIMD transform� SIMD fold� Use polymorphic functor or lambda for mixing scalar/SIMD
� Iterators :� Provide SIMD aware walkthroughs� boost::simd::(aligned_)(input/output_)iterator� boost::simd::direct_output_iterator� boost::simd::shifted_iterator
14 of 32
Unlocked software performance
Integration with the STL
std::vector <float , simd::allocator <float > > v(N);
simd:: transform( v.begin(), v.end(), []( auto const& p)
{return p * 2.f;
});
15 of 32
Unlocked software performance
Integration with the STL
std::vector <float , simd::allocator <float > > i(N), o(N);
std:: transform( simd:: shifted_iterator <3>(in.begin()), simd:: shifted_iterator <3>(in.end()), simd:: aligned_output_begin(o.begin()), average ());
struct average{
template <class T>typename T:: value_type operator ()(T const& t) const{
typename T:: value_type d(1./3);return (t[0]+t[1]+t[2])*d;
}};
15 of 32
Unlocked software performance
Hardware Optimisations
Problem :� Most SIMD hardware support fused operations like fma.� Those optimisations must remain transparent to the user� We use Expression Templates so B.SIMD auto-optimizes those patterns.
Examples :
� a * b + c becomes fma(a, b, c)
� a + b * c becomes fma(b, c, a)
� !(a < b) becomes is_nle(a, b)
16 of 32
Unlocked software performance
Supported Hardwares
Open Source
� Intel SSE2-4, AVX� PowerPC VMX
Proprietary
� ARM Neon� Intel AVX2, XOP, FMA3, FMA4� Intel MIC
17 of 32
Unlocked software performance
Other functions ...
Arithmetic� saturated arithmetics� long multiplication� oat/int conversion� round, oor, ceil, trunc� sqrt, cbrt� hypot� average� random� min/max� roudned division et
remainder
Bitwise
� select
� andnot, ornot
� popcnt
� ffs
� ror, rol
� rshr, rshl
� twopower
IEEE
� ilogb, frexp
� ldexp
� next/prev
� ulpdist
Predicates
� comparison to zero
� negated comparisons
� is_unord, is_nan,is_invalid
� is_odd, is_even
� majority
18 of 32
Unlocked software performance
Reduction et SWAR operations
Reduction� any, all� nbtrue� minimum/maximum,
posmin/posmax� sum� product, dot product
SWAR
� group/split
� splatted reduction
� cumsum
� sort
19 of 32
Performances !
Unlocked software performance
Basic Functions
Single precision trignometricsHardware : Core i7 SandyBridge, AVX using cycles/values
Function Range std Scalar SIMDexp [−10, 10] 46 38 7log [−10,−10] 42 37 5asin [−1, 1] 40 35 13cos [−20π, 20π] 66 47 6
fast_cos [−π/4, π/4] 32 9 1.3
21 of 32
Unlocked software performance
Julia set generator
� Generate a fractal image using the Julia funtion� Purely compute-bound� Challenge : Workload depends on pixel location
22 of 32
Unlocked software performance
Julia set generator
template <class T> typename meta::as_integer <T>:: typejulia(T const& a, T const& b){
typename meta::as_integer <T>:: type iter;std:: size_t i = 0;T x, y;
do {T x2 = x * x;T y2 = y * y;T xy = s_t(2) * x * y;x = x2 - y2 + a;y = xy + b;iter = selinc(x2 + y2 < T(4), iter);
} while(any(mask) && i++ < 256);
return iter;}
23 of 32
Unlocked software performance
Julia set generator
256 x 256 512 x 512 1024 x 1024 2048 x 2048
0
200
400
600
800
x2.9
3
x2.9
9
x3.0
2
x3.0
3
x6.6
4
x6.9
4
x6.0
9
x6.1
6
x6.5
2
x6.8
1
x5.9
7
x6.0
5
Size
cpe
scalar SSE2
simd SSE2
simd AVX
simd AVX2
24 of 32
Unlocked software performance
Motion Detection
� Based on Manzanera’s Sigma Delta algorithm� Background substraction� Pixel intensity modeled as gaussian distributions� Challenge : Low arithmetic intensity
25 of 32
Unlocked software performance
Motion Detection
template <typename T>T sigma_delta(T& bkg , T const& frm , T& var){
bkg = selinc(bkg < frm , seldec(bkg > fr, bkg));
T dif = dist(bkg , frm);T mul = muls(dif ,3);
var = if_else( dif != T(0), selinc(var < mul , seldec(var > mul , var)), var);
return if_zero_else_one( dif < var );}
26 of 32
Unlocked software performance
Motion Detection
480 x 640 @5 600 x 800 @5 1024 x 1280 @5 1080 x 1920 @5 2160 x 3840 @5
0
1
2
3
·104
x3.8
0
x6.6
3
x3.6
4
x5.6
3
x3.5
0
x6.6
9
x6.5
3
x5.7
5
x5.7
4
x5.7
1
x26.
26
x26.
96
x19.
05
x17.
19
x10.
81
Size
FPS
scalar SSE2
simd SSE2
simd AVX
simd AVX2
27 of 32
Unlocked software performance
Sparse Tridiagonal Solver
� Solve Ax = b with sparse A andmultiple x
� Application in uid mechanics� Challenge : Using SIMD despite
being sparse� Solution : Shuffle for local
densication
28 of 32
Unlocked software performance
Sparse Tridiagonal Solver
29 of 32
Conclusion
Unlocked software performance
Conclusion
Boost.SIMD
� We can design a C++ library for low level performance primitives� A lot of success with a lot of different applications� Find us on https ://github.com/jfalcou/nt2
� Tests, comments and feedback welcome
Soon
� Rewrite before submission to Boost is in progress� More architectures to come !
31 of 32
Thanks for your attention !