Download - Accelerate Framework · •SIMD instructions SSE, AVX and NEON ... •Digital signal processing (vDSP)
These are confidential sessions—please refrain from streaming, blogging, or taking pictures
Fast and energy efficient computation
Session 713
Accelerate Framework
Geoff BelterEngineer, Vector and Numerics Group
What is it?Accelerate Framework
• Easy access to a lot of functionality•Accurate• Fast with low energy usage•Works on both OS X and iOS•Optimized for all of generations of hardware
What operations are available?Accelerate Framework
• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)
Session goalsAccelerate Framework
•How Accelerate helps you•Areas of your code likely to benefit from Accelerate•How you use Accelerate
Why is it fast?Accelerate Framework
Why is it fast?Accelerate Framework
• SIMD instructions■ SSE, AVX and NEON
Why is it fast?Accelerate Framework
• SIMD instructions■ SSE, AVX and NEON
•Match the micro-architecture■ Instruction selection and scheduling■ Software pipelining■ Loop unrolling
Why is it fast?Accelerate Framework
• SIMD instructions■ SSE, AVX and NEON
•Match the micro-architecture■ Instruction selection and scheduling■ Software pipelining■ Loop unrolling
•Multi-threaded using GCD
Tips for Successful Use of Accelerate
Tips for Successful Use of Accelerate
• Prepare your data■ Contiguous■ 16-byte aligned
Tips for Successful Use of Accelerate
• Prepare your data■ Contiguous■ 16-byte aligned
•Understand problem size
Tips for Successful Use of Accelerate
• Prepare your data■ Contiguous■ 16-byte aligned
•Understand problem size•Do setup once/destroy at the end
Using Accelerate FrameworkXcode
Using Accelerate FrameworkXcode
Using Accelerate FrameworkXcode
XcodeUsing Accelerate Framework
XcodeUsing Accelerate Framework
XcodeUsing Accelerate Framework
XcodeUsing Accelerate Framework
XcodeUsing Accelerate Framework
XcodeUsing Accelerate Framework
Command L=lineUsing Accelerate Framework
cc -framework Accelerate main.c
Command L=lineUsing Accelerate Framework
What operations are available?Accelerate Framework
• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)
Vectorized image processing libraryvImage
vImageWhat’s available?
vImageWhat’s available?
Additions and Improvements
Additions and Improvements
• Improved conversion support
Additions and Improvements
• Improved conversion support• vImage Buffer creation utilities
Additions and Improvements
• Improved conversion support• vImage Buffer creation utilities• Resampling of 16-bit images
Additions and Improvements
• Improved conversion support• vImage Buffer creation utilities• Resampling of 16-bit images• Streamlined Core Graphics interoperability
Core Graphics Interoperability
Core Graphics Interoperability
•How do I use vImage with my CGImageRef?
Core Graphics Interoperability
•How do I use vImage with my CGImageRef?•New utility functions• vImageBuffer_InitWithCGImage• vImageCreateCGImageFromBuffer
Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer
#include <Accelerate/Accelerate.h>
// Create and prepare CGImageRefCGImageRef inImg;
// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};
// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);
Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer
#include <Accelerate/Accelerate.h>
// Create and prepare CGImageRefCGImageRef inImg;
// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};
// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);
Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer
#include <Accelerate/Accelerate.h>
// Create and prepare CGImageRefCGImageRef inImg;
// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};
// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);
Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer
#include <Accelerate/Accelerate.h>
// Create and prepare CGImageRefCGImageRef inImg;
// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};
// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);
Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef
#include <Accelerate/Accelerate.h>
// The output buffervImage_Buffer outBuffer;
// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);
Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef
#include <Accelerate/Accelerate.h>
// The output buffervImage_Buffer outBuffer;
// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);
Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef
#include <Accelerate/Accelerate.h>
// The output buffervImage_Buffer outBuffer;
// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);
Core Graphics Interoperability
• Convert between vImage_CGImageFormat types• Tips for use
■ Create converter once■ Use many times
vImageConvert_AnyToAny
0
10
20
30
ARGB
ARGB
pre
mul
BGRA
pre
mul
RGBA
RGBA
pre
mul
RGBA
(flo
at)
RGBA
(flo
at) p
rem
ul
RGBA
(uin
t16)
RGBA
(uin
t16)
pre
mul
MPi
xel/s
Software JPEG Encode Performance
Bigger is Better
iPhone 5
Old methodAnyToAny
0
10
20
30
ARGB
ARGB
pre
mul
BGRA
pre
mul
RGBA
RGBA
pre
mul
RGBA
(flo
at)
RGBA
(flo
at) p
rem
ul
RGBA
(uin
t16)
RGBA
(uin
t16)
pre
mul
MPi
xel/s
Software JPEG Encode Performance
Bigger is Better
iPhone 5
Old methodAnyToAny
0
10
20
30
ARGB
ARGB
pre
mul
BGRA
pre
mul
RGBA
RGBA
pre
mul
RGBA
(flo
at)
RGBA
(flo
at) p
rem
ul
RGBA
(uin
t16)
RGBA
(uin
t16)
pre
mul
MPi
xel/s
Software JPEG Encode Performance
Bigger is Better
iPhone 5
Old methodAnyToAny
0
10
20
30
ARGB
ARGB
pre
mul
BGRA
pre
mul
RGBA
RGBA
pre
mul
RGBA
(flo
at)
RGBA
(flo
at) p
rem
ul
RGBA
(uin
t16)
RGBA
(uin
t16)
pre
mul
MPi
xel/s
Software JPEG Encode Performance
Bigger is Better
iPhone 5
Old methodAnyToAny
Scaling with a premultiplied PlanarF imageConversion Example
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Scaling with a premultiplied PlanarF imageConversion Example
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Scaling with a premultiplied PlanarF imageConversion Example
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Scaling with a premultiplied PlanarF imageConversion Example
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Scaling with a premultiplied PlanarF imageConversion Example
#include <Accelerate/Accelerate.h>
vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);
// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);
// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);
Conversions in ContextScaling on a premultiplied PlanarF image
0 25 50 75 100
Unpremultiply
Scale
Premultiply
Percentage of time taken
Conversions in ContextScaling on a premultiplied PlanarF image
0 25 50 75 100
Unpremultiply
Scale
Premultiply 0.64%
98.26%
1.10%
Percentage of time taken
vImage vs. OpenCV
The competitionOpenCV
•Open source computer vision library• Image processing module
Points of Comparison
• Execution time• Energy consumed
vImage Speedup over OpenCViPhone 5
0 5 10 15 20 25
Histogram
Max
Box Convolve
Affine Warp
Speedup over OpenCV
Above one is better
(lanczos)
vImage Speedup over OpenCViPhone 5
0 5 10 15 20 25
Histogram
Max
Box Convolve
Affine Warp 6.41
7.88
23.19
1.62
Speedup over OpenCV
Above one is better
(lanczos)
Energy Consumption and Battery Life
• Fast code tends to■ Decrease energy consumption■ Increase battery life
Typical Energy Consumption Profile
Pow
er
Idle power
Instantaneous power
Energy
t1t0
p2
p1
p0
Time
Typical Energy Consumption Profile
Pow
er
Idle power
Instantaneous power
Time
Unoptimized
Opt
imiz
ed
vImage Energy Savings over OpenCViPhone 5
0 2 4 6 8
Histogram
Max
Box Convolve
Affine Warp
Times less energy than OpenCV
(lanczos)
Above one is better
vImage Energy Savings over OpenCViPhone 5
0 2 4 6 8
Histogram
Max
Box Convolve
Affine Warp 6.96
4.05
6.71
0.75
Times less energy than OpenCV
(lanczos)
Above one is better
“Using vImage from the Accelerate framework to dynamically pre-render my sprites. It’s the only way to make it fast. ;-)”
–Twitter User
What operations are available?Accelerate Framework
• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)
Vectorized digital signal processing libraryvDSP
• Basic operations on arrays■ Add, subtract, multiply, conversion, accumulation, etc.
•Discrete Fourier Transform• Convolution and correlation
vDSPWhat’s available?
New and Improved in vDSP
New and Improved in vDSP
•Multi-channel IIR filter
New and Improved in vDSP
•Multi-channel IIR filter• Improved power of 2 support
■ Discrete Fourier Transform (DFT)■ Discrete Cosine Transform (DCT)
Discrete Fourier Transform (DFT)
• Same operation, two entries based on number of points
Discrete Fourier Transform (DFT)
• Same operation, two entries based on number of points
256
FFT
384
DFT
Before
Discrete Fourier Transform (DFT)
• Same operation, two entries based on number of points
256
FFT
384
DFT
256 384
DFT
Before OS X 10.9, iOS 7
#include <Accelerate/Accelerate.h>
// Create and prepare data:float *Ir,*Ii,*Or,*Oi;
// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:float *Ir,*Ii,*Or,*Oi;
// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:float *Ir,*Ii,*Or,*Oi;
// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:float *Ir,*Ii,*Or,*Oi;
// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);
DFT Example
#include <Accelerate/Accelerate.h>
// Create and prepare data:float *Ir,*Ii,*Or,*Oi;
// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);
DFT Example
vDSP vs. FFTW
Fastest Fourier Transform in the westFFTW
•One and multi-dimensional transforms• Real and complex data• Parallel
vDSP Speedup over FFTWDFT on iPhone 5
vDSP Speedup over FFTW
240 256 320 384 480 512 640 768 9600
1
2
3
Spee
dup
Number of Points
DFT on iPhone 5
Above one is better
vDSP Speedup over FFTW
240 256 320 384 480 512 640 768 9600
1
2
3
Spee
dup
Number of Points
DFT on iPhone 5
Above one is better
•Used in FaceTime•DFT one of many DSP routines• Percentage of time spent in DFT
DFT In UseAAC Enhanced Low Delay
DFT In UseAAC Enhanced Low Delay
DFT In UseAAC Enhanced Low Delay
47%54%
DFTEverything Else
FFTW
Percent time spent in DFT
DFT In UseAAC Enhanced Low Delay
47%54%
DFTEverything Else
FFTW
70%
30%
vDSP
Percent time spent in DFT
Data Types in vDSP
• Single and double precision• Real and complex• Support for strided data access
“Wanna do FFT on iOS?Use the Accelerate.framework.
Highly recommended. #ios”
–Twitter User
What operations are available?Accelerate Framework
• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)
Fast Math
Luke ChangEngineer, Vector and Numerics Group
Math for Every Data Length
• Libm for scalar data
float
Math for Every Data Length
• Libm for scalar data• vMathLib for SIMD vectors
float vFloat
Math for Every Data Length
• Libm for scalar data• vMathLib for SIMD vectors• vForce for array data
float vFloat
...float []
Standard C math libraryLibm
Libm
• Standard math library in C• Collection of transcendental functions•Operates on scalar data
■ exp[f ]■ log[f ]■ sin[f ]■ cos[f ]■ pow[f ]■ Etc…
What’s New in Libm?
• Extensions to C11, prefixed with “__”•Added in both iOS 7 and OS X 10.9• __exp10[f ]• __sinpi[f ], __cospi[f ], __tanpi[f ]• __sincos[f ], __sincospi[f ]
__exp10[f]Power of 10
• Commonly used for decibel calculations• Faster than pow(10.0, x)•More accurate than exp(log(10) * x)
■ exp(log(10) * 5.0) =100000.0000000002
__sinpi[f ], __cospi[f ], __tanpi[f ]Trigonometry in Terms of PI
• cospi(x) means cos( *x)• Faster because argument reduction is simpler•More accurate when working with degrees
■ cos(M_PI*0.5) returns 6.123233995736766e-17■ __cospi(0.5) returns exactly 0.0
⇡
__sincos[f ], __sincospi[f ]Sine-Cosine Pairs
• Compute sine and cosine simultaneously• Faster because argument reduction is only done once• Clang will call __sincos[f ] when possible
C11 Features
• Some complex values can’t be specified as literals■ (0.0 + INFINITY * I)
• C11 adds CMPLX macro for this purpose■ CMPLX(0.0, INFINITY)
• CMPLXF and CMPLXL are also available
SIMD vector math libraryvMathLib
vMathLib
• Collection of transcendental functions for SIMD vectors•Operates on SIMD vectors
■ vexp[f ]■ vlog[f ]■ vsin[f ]■ vcos[f ]■ vpow[f ]■ Etc…
Writing your own vector algorithmWhen to Use vMathLib?
•Need transcendental functions in your vector code
Taking sine of a vectorvMathLib Example
•Using Libm
#include <math.h>
vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...float *px = (float *)&vx, *py = (float *)&vy;for( i = 0; i < sizeof(vx)/sizeof(px[0]); ++i ) {
py[i] = sinf(px[i]);}...
Taking sine of a vectorvMathLib Example
•Using vMathLib
#include <Accelerate/Accelerate.h>
vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...vy = vsinf(vx);...
Vectorized math libraryvForce
vForce
• Collection of transcendental functions for arrays•Operates on array data
■ vvexp[f ]■ vvlog[f ]■ vvsin[f ]■ vvcos[f ]■ vvpow[f ]■ Etc…
vForce Example
• Filling a buffer with sine wave using a for loop
#include <math.h>
float buffer[length];float indices[length];
...
for (int i = 0; i < length; i++){ buffer[i] = sinf(indices[i]);}
vForce Example
• Filling a buffer with sine wave using vForce
#include <Accelerate/Accelerate.h>
float buffer[length];float indices[length];
...
vvsinf(buffer, indices, &length);
Better PerformanceMeasured on iPhone 5
vForce
For loop
0 1 2 3 4 5 6
Sines Computed per µs
0 10 20 30 40 50 60
Better PerformanceMeasured on iPhone 5
Bigger is better
56
24
vForce
For loop
0 1 2 3 4 5 6
nJ consumed per sine
0 4 8 12 16 20 24
Less EnergyMeasured on iPhone 5
Smaller is better
vForce
For loop
0 1 2 3 4 5 6
nJ consumed per sine
0 4 8 12 16 20 24
Less EnergyMeasured on iPhone 5
Smaller is better
14.16
23.37
Measured on iPhone 5vForce Performance
0 40 80 120 160 200
truncf
logf
expf
powf
sinf
sincosf
Results computed per µs vForceFor loop
Measured on iPhone 5vForce Performance
0 40 80 120 160 200
truncf
logf
expf
powf
sinf
sincosf 10.6
24
9.2
29
22
161
28.8
56
26.6
120
116
0
Results computed per µs vForceFor loop
826
vForce in Detail
• Supports both float and double•Handles edge cases correctly• Requires minimal data alignment• Supports in-place operation• Improves performance even with small arrays
■ Consider using vForce when more than 16 elements
What operations are available?Accelerate Framework
• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)
Linear Algebra PACKage andBasic Linear Algebra Subprograms
LAPACK and BLAS
Geoff BelterEngineer, Vector and Numerics Group
LAPACK Operations
•High-level linear algebra• Solve linear systems•Matrix factorizations• Eigenvalues and eigenvectors
LINPACK
•How fast can you solve a system of linear equations?
•1000 x 1000 matrices
0 500 1000 1500
LAPACK
“Brand A”
LINPACK benchmark performance in Mflops
Bigger is better
Mflops
0 500 1000 1500
LAPACK
“Brand A” 40
LINPACK benchmark performance in Mflops
Bigger is better
Mflops
0 500 1000 1500
LAPACK
“Brand A”
LINPACK benchmark performance in Mflops
Bigger is better
Mflops
0 500 1000 1500
LAPACK
“Brand A”
LINPACK benchmark performance in Mflops
Mflops
Bigger is better
0 500 1000 1500
LAPACK
“Brand A” 788
LINPACK benchmark performance in Mflops
Mflops
Bigger is better
0 500 1000 1500
LAPACK
Accelerate
“Brand A”
Bigger is better
788
LINPACK benchmark performance in Mflops
Mflops
0 500 1000 1500
LAPACK
Accelerate
“Brand A”
Bigger is better
788
1202
LINPACK benchmark performance in Mflops
Mflops
0 500 1000 1500
LAPACK
Accelerate on iPhone 4S
“Brand A”
Bigger is better
788
1202
LINPACK benchmark performance in Mflops
Mflops
0 500 1000 1500
LAPACK
“Brand A”
Bigger is better
788
1202
LINPACK benchmark performance in Mflops
Mflops
0 500 1000 1500 2000
LAPACK
“Brand A”
Bigger is better
Accelerate on iPhone 5
788
LINPACK benchmark performance in Mflops
Mflops
0
Accelerate on iPhone 4S
LAPACK
“Brand A”
500
Bigger is better
1000 1500 2000 2500 3000 3500 4000
788
LINPACK benchmark performance in Mflops
Mflops
Accelerate on iPhone 5
0
Accelerate on iPhone 4S
LAPACK
“Brand A”
3446
500
Bigger is better
1000 1500 2000 2500 3000 3500 4000
788
LINPACK benchmark performance in Mflops
Mflops
Accelerate on iPhone 5
VS.iPadwith Retina Display G5
Power Mac
iPad with Retina display vs. Power Mac G5LAPACK
• Triumphant return•After 10 years•All fans blazing
LAPACK
0
Accelerate on iPhone 4S
PowerMac G5
iPad with Retina Display
1000 2000 3000 4000
Bigger is better
LINPACK benchmark performance in Mflops
Mflops
LAPACK
0
Accelerate on iPhone 4S
PowerMac G5
iPad with Retina Display
1000 2000 3000 4000
Bigger is better
3643
LINPACK benchmark performance in Mflops
Mflops
LAPACK
0
Accelerate on iPhone 4S
PowerMac G5
iPad with Retina Display
1000 2000 3000 4000
Bigger is better
3643
LINPACK benchmark performance in Mflops
Mflops
LAPACK
0
Accelerate on iPhone 4S
PowerMac G5
iPad with Retina Display
1000 2000 3000 4000
3686
Bigger is better
3643
LINPACK benchmark performance in Mflops
Mflops
LAPACK ExampleSolve linear system
#include <Accelerate/Accelerate.h>
// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;
// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);
LAPACK ExampleSolve linear system
#include <Accelerate/Accelerate.h>
// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;
// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);
LAPACK ExampleSolve linear system
#include <Accelerate/Accelerate.h>
// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;
// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);
BLAS Operations
• Low-level linear algebra• Vector
■ Dot product, scalar product, vector sum
•Matrix-vector■ Matrix-vector product, outer product
•Matrix-matrix■ Matrix multiply
BLAS ExampleMatrix Multiply
#include <Accelerate/Accelerate.h>
// Create and prepare datadouble *A, *B, *C;
// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);
BLAS ExampleMatrix Multiply
#include <Accelerate/Accelerate.h>
// Create and prepare datadouble *A, *B, *C;
// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);
BLAS ExampleMatrix Multiply
#include <Accelerate/Accelerate.h>
// Create and prepare datadouble *A, *B, *C;
// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);
Data Types
• Single and double precision• Real and complex•Multiple data layouts
■ Dense, banded, triangular, etc.■ Transpose, conjugate transpose■ Row and column major
“Playing with the Accelerate.framework today. Having a BLASt.”
–Twitter User
Summary
Lots of functionalityAccelerate Framework
• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)
Accelerate Framework
• Easy access to a lot of functionality•Accurate• Fast with low energy usage•Works on both OS X and iOS•Optimized for all of generations of hardware
Features and benefits
• Prepare your data■ Contiguous■ 16-byte aligned
•Understand problem size•Do setup once/destroy at the end
Accelerate FrameworkTo be successful
If You Need a Feature
If You Need a FeatureRequest It
“Discrete Cosine Transform was my feature request that made it into the
Accelerate Framework. I feel so special!”
–Twitter User
“Thanks Apple for making the Accelerate Framework.”
–Twitter User
Paul DanboldCore OS Technologies [email protected]
George WarnerDTS Sr. Support [email protected]
DocumentationvImage Programming Guidehttp://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vImage/Introduction/Introduction.html
vDSP Programming Guidehttp://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vDSP_Programming_Guide/Introduction/Introduction.html
Apple Developer Forumshttp://devforums.apple.com
More Information
Labs
Accelerate Lab Core OS Lab AThursday 4:30PM