tutorial presentation 6

8/19/2019 Tutorial Presentation 6

1/24

BLAS and Vectorization extensions.

Carlos Pachajoa

December 5, 2012

http://find/http://goback/


2/24

Contents

BLAS

Vectorization extensions for X86

GPGPU

http://find/


3/24

BLAS

Stands for Basic Linear Algebra Subprograms

Is an interface for linear algebra operations. BLAS itself is aspecification for Fortran. The equivalent interface in C is CBLAS.

Use the local implementation with#include

http://find/


4/24

BLAS levels

The operations are divided in three levels: Level 1: Vector operations like

y ← αx + y, x, y ∈ ZN

and also dot products and vector norms.

Level 2: Matrix-vector operations likey ← αAx + β y, x, y ∈ ZN , A ∈ ZM ×N

and solutions for triangular systems, among others.

Level 3: Matrix-matrix operations likeC ← αAB + β C, C ∈ ZM ×N , A ∈ ZM ×P , B ∈ ZP ×N

Different calls for different precisions and whether real or complexnumbers.

http://find/


5/24

BLAS function naming conventions

The first letter specifies precision: S for real, single precision.

D for real, double precision.

C for complex, single precision.

Z for complex, double precision.

The first letter is followed by a function, for examplexAXPY is y ← αx + y from level one. Here, x represents a space

and a precision.Therefore, SAXPY will perform the operation using single precisionfloating point numbers.

http://find/


6/24

CBLAS data representation

CBLAS receives data as contiguous positions in memory and a

size. Both matrices and vectors are stored in this way. To specify amatrix, one additionally has to provide a stride and define whetherit is row- or column- major.

{1,2,3,4,5,6,7,8,9} will be

1 2 34 5 6

7 8 9

in R.M. and

1 4 72 5 8

3 6 9

in C.M.

The ordering is given using this enumeration:enum CBLAS ORDER {CblasRowMajor=101, CblasColMajor=102};



7/24

A function signature

y ← αAx + β y, y ← αATx + β y

v o i d c bl a s s g em v (c o n s t enum CBLAS ORDER Ord er ,c o n s t enum CBLAS TRANSPOSE Tran sA ,c on s t i n t M,

c on st i n t N ,c on s t f l o a t a l p h a ,c o n st f l o a t ∗A ,c on st i n t l d a ,c o n st f l o a t ∗X ,c o n s t i n t incX ,c on st f l o a t b et a ,f l o a t ∗Y ,c on st i n t i n c Y

) ;

http://find/


8/24

Enumeration types

enum CBLAS ORDER {C bl a sR o wM a jo r = 10 1 , /∗ row−m a j o r a r r a y s ∗/C b l a s C o l M a j o r = 1 0 2}; /∗ column−m aj o r a r r a y s ∗/

enum CBLAS TRANSPOSE { // Whether t o work w it h t r a n s p o s eC b l a s No T r a n s =111 , /∗ t r a n s = ’ N ’ ∗/C b l a s T r a n s =112 , /∗ t r a n s = ’ T ’ ∗/

C b l a s C o n j T r a n s = 1 1 3} ; /∗ t r a n s = ’ C ’ ∗/enum CBLAS UPLO { // The m at ri x i s up p e r o r l o w erC b l a s U p p e r =121 , /∗ u p l o = ’U ’ ∗/C b l a s L o w e r = 1 2 2}; /∗ u p l o =’L ’ ∗/

enum CBLAS DIAG { // The m at ri x i s u n i t r i a n g u l a rC b l a s N o n U n i t =131 , /∗ d i a g = ’N ’ ∗/

C b l a s U n i t = 13 2}; /∗ d i a g = ’U ’ ∗/enum CBLAS SIDE { // O r d er o f m at ri x m u l t i p l i c a t i o n

C b l a s L e f t =141 , /∗ s i d e =’L ’ ∗/C b l a s R i g h t = 14 2}; /∗ s i d e =’R ’ ∗/

http://find/


9/24

Some CBLAS implementations

ATLAS (Automatically Tuned Linear Algebra Software)

MKL (Math Kernel Library) CUBLAS

http://find/


10/24

Documents with CBLAS routines

http:

//math-atlas.sourceforge.net/psdoc/cblasqref.ps

https://developer.apple.com/library/mac/

documentation/Accelerate/Reference/BLAS_Ref/

Reference/reference.html

http://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://find/http://goback/


11/24

SIMD

Single Instruction, Multiple Data

Taken from

http://archive.arstechnica.com/cpu/1q00/simd/figure6.gif

http://archive.arstechnica.com/cpu/1q00/simd/figure6.gifhttp://archive.arstechnica.com/cpu/1q00/simd/figure6.gifhttp://find/


12/24

SSE

Streaming SIMD Extensions.

Additional registers in the processor and operations in thearchitecture.

http://en.wikipedia.org/wiki/File:XMM_registers.svg

8 registers with 128 bits each. 4 single precision floating pointnumbers in each register.

http://en.wikipedia.org/wiki/File:XMM_registers.svghttp://en.wikipedia.org/wiki/File:XMM_registers.svghttp://find/


13/24

Some instructions

; A l l i n s t r u c t i o n s end up i n S; p e nu l t i m at e l e t t e r d en ot es s c a l a r o r v e c t o r; S s t a n d s f o r s c a l a r , P f o r v e c t o r .; O pe ra nd s a r e XMM r e g i s t e r s

; Adds a l l e le me nt s o f a r r a y s op1 and op2 i n t o op1ADDPS op1 , op2

; Adds t he f i r s t e le me nt o f op1 and op2 i n t o; t h e f i r s t p o s i t i o n o f op2

ADDSS op1 , op2



14/24

Some SSE instructions

v e c r e s . x = v 1 . x + v 2 . x ;

v e c r e s . y = v 1 . y + v 2 . y ;v e c r e s . z = v 1 . z + v 2 . z ;v e c r e s . w = v1 . w + v2 . w ;

; xmm0 = v1 .w | v1 . z | v1 . y | v1 . xmovaps xmm0, [ v1 ]

; xmm0 = v1 . w+v2 . w | v1 . z+v2 . z | v1 . y+v2 . y | v1 . x+v2 . xaddps xmm0, [ v2 ]

m ov ap s [ v e c r e s ] , xmm0

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://find/http://goback/


15/24

SSE upgrades

SSE2 Allows to represent multiple types of data fitting onthe vectors, including integers and characters, andperform the corresponding operations.AMD’s implementation also doubled the number of XMM arrays.

SSE3 Addition of horizontal operations within the XMMarrays, such as data reduction.

SSSE3 Additional instructions for SSE3.

SSE4 Introduction of Dword multiply, allowing to multiplytwo pairs of 32-bit integers to produce 2 64-bitnumbers. Vector dot products.

http://find/


16/24

AVX

Intel’s extension to SSE for the Sandy Bridge microarchitecture,

introduced in 2011. Also available in AMD’s Bulldozer.

http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svg

http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svghttp://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svghttp://find/


17/24

Automatic vectorization

The Intel compiler can, under certain conditions, vectorize loops inthe code.

This can be activated by using the -vec option.

for(i=0; i


18/24

Obstacles to vectorization

Non-contiguous memory access

for(i=0; i < SIZE; i+=stride) A[i] = B[i] + C[i];

Data dependencies

for(i=0; i < SIZE; i++) A[i] = A[i-1] + B[i];

G idi ICC i i

http://find/


19/24

Guiding ICC vectorization

Pragmas For example, #pragma ivdep, among others tocontrol when to vectorize a loop.

Keywords, such as restrict .

Switches passed to the compiler as optimization levels.

Look at the ICC automatic vectorization documentation[11].

GPGPU



20/24

GPGPU

http://blogs.nvidia.com/2009/12/

whats-the-difference-between-a-cpu-and-a-gpu/

GPGPU d CPU

http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://find/


21/24

GPGPU and CPU

CPU

General purpose

Pipelines

Few threads

Lots of cache (Correlation)

GPGPU

Specialized for local vector operations

Many cores and threads Little cache

Smaller consumption relative to a CPU

CUDA

http://find/


22/24

CUDA

Stands for Compute Unified Device Architecture .

Effectively, a programming model to access and control GPUsusing a virtual instruction set, in a simmilar manner as a CPU.

Only supported on NVIDIA cards.

Uses the NVIDIA compiler, and can be programmed using CUDAC/C++, languages based on C/C++.

O CL

http://find/


23/24

OpenCL

Stands for Open Computing Language

It also provides access to the GPU.

It’s an open standard, supported by NVIDIA and AMD,among others.

Provides a language based on C99.

Functionality provided by a driver. Compilation handled by linking to the correct library.

References

http://find/


24/24

References

http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

http://www.netlib.org/blas/

http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.html

http://math-atlas.sourceforge.net/faq.html

http://software.intel.com/sites/products/documentation/hpc/mkl/

mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htm

https://developer.nvidia.com/cublas

http://www.godevtool.com/TestbugHelp/XMMfpins.htm

software.intel.com/en-us/avx

http://www.khronos.org/opencl/

http://developer.download.nvidia.com/CUDA/training/GTC_Express_

Sarah_Tariq_June2011.pdf

http:

//software.intel.com/sites/products/documentation/hpc/composerxe/

en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htm
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://www.netlib.org/blas/http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.htmlhttp://math-atlas.sourceforge.net/faq.htmlhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttps://developer.nvidia.com/cublashttp://www.godevtool.com/TestbugHelp/XMMfpins.htmhttp://localhost/var/www/apps/conversion/tmp/scratch_5/software.intel.com/en-us/avxhttp://www.khronos.org/opencl/http://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://www.khronos.org/opencl/http://localhost/var/www/apps/conversion/tmp/scratch_5/software.intel.com/en-us/avxhttp://www.godevtool.com/TestbugHelp/XMMfpins.htmhttps://developer.nvidia.com/cublashttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://math-atlas.sourceforge.net/faq.htmlhttp://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.htmlhttp://www.netlib.org/blas/http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://find/