tutorial presentation 6

Upload: hisuin

Post on 07-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 Tutorial Presentation 6

    1/24

    BLAS and Vectorization extensions.

    Carlos Pachajoa

    December 5, 2012

    http://find/http://goback/

  • 8/19/2019 Tutorial Presentation 6

    2/24

    Contents

    BLAS

    Vectorization extensions for X86

    GPGPU

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    3/24

    BLAS

    Stands for  Basic Linear Algebra Subprograms 

    Is an  interface  for linear algebra operations. BLAS itself is aspecification for Fortran. The equivalent interface in  C  is CBLAS.

    Use the local implementation with#include  

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    4/24

    BLAS levels

    The operations are divided in three levels:  Level 1: Vector operations like

    y ← αx + y,   x, y ∈  ZN 

    and also dot products and vector norms.

      Level 2: Matrix-vector operations likey ← αAx + β y,   x, y ∈  ZN ,   A ∈  ZM ×N 

    and solutions for triangular systems, among others.

      Level 3: Matrix-matrix operations likeC ←  αAB + β C,   C ∈  ZM ×N ,   A ∈  ZM ×P ,   B ∈  ZP ×N 

    Different calls for different precisions and whether real or complexnumbers.

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    5/24

    BLAS function naming conventions

    The first letter specifies precision:   S  for real, single precision.

      D  for real, double precision.

      C  for complex, single precision.

      Z  for complex, double precision.

    The first letter is followed by a function, for examplexAXPY   is  y  ← αx  + y   from level one. Here, x represents a space

    and a precision.Therefore,  SAXPY  will perform the operation using single precisionfloating point numbers.

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    6/24

    CBLAS data representation

    CBLAS receives data as contiguous positions in memory and a

    size. Both matrices and vectors are stored in this way. To specify amatrix, one additionally has to provide a stride and define whetherit is row- or column- major.

    {1,2,3,4,5,6,7,8,9}  will be

    1 2 34 5 6

    7 8 9

    in R.M. and

    1 4 72 5 8

    3 6 9

    in C.M.

    The ordering is given using this enumeration:enum CBLAS ORDER   {CblasRowMajor=101, CblasColMajor=102};

    http://find/http://goback/

  • 8/19/2019 Tutorial Presentation 6

    7/24

    A function signature

    y ← αAx + β y,   y ← αATx + β y

    v o i d c bl a s s g em v (c o n s t enum CBLAS ORDER Ord er ,c o n s t enum CBLAS TRANSPOSE Tran sA ,c on s t i n t M,

    c on st i n t N ,c on s t f l o a t a l p h a ,c o n st f l o a t   ∗A ,c on st i n t l d a ,c o n st f l o a t   ∗X ,c o n s t i n t incX ,c on st f l o a t b et a ,f l o a t   ∗Y ,c on st i n t i n c Y

    ) ;

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    8/24

    Enumeration types

    enum CBLAS ORDER   {C bl a sR o wM a jo r = 10 1 , /∗   row−m a j o r a r r a y s   ∗/C b l a s C o l M a j o r = 1 0 2}; /∗   column−m aj o r a r r a y s   ∗/

    enum CBLAS TRANSPOSE   {   // Whether t o work w it h t r a n s p o s eC b l a s No T r a n s =111 , /∗   t r a n s = ’ N ’   ∗/C b l a s T r a n s =112 , /∗   t r a n s = ’ T ’   ∗/

    C b l a s C o n j T r a n s = 1 1 3} ; /∗   t r a n s = ’ C ’   ∗/enum CBLAS UPLO   {   // The m at ri x i s up p e r o r l o w erC b l a s U p p e r =121 , /∗   u p l o = ’U ’   ∗/C b l a s L o w e r = 1 2 2}; /∗   u p l o =’L ’   ∗/

    enum CBLAS DIAG   {   // The m at ri x i s u n i t r i a n g u l a rC b l a s N o n U n i t =131 , /∗   d i a g = ’N ’   ∗/

    C b l a s U n i t = 13 2}; /∗   d i a g = ’U ’   ∗/enum CBLAS SIDE   {   // O r d er o f m at ri x m u l t i p l i c a t i o n

    C b l a s L e f t =141 , /∗   s i d e =’L ’   ∗/C b l a s R i g h t = 14 2}; /∗   s i d e =’R ’   ∗/

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    9/24

    Some CBLAS implementations

     ATLAS (Automatically Tuned Linear Algebra Software)

     MKL (Math Kernel Library)   CUBLAS

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    10/24

    Documents with CBLAS routines

      http:

    //math-atlas.sourceforge.net/psdoc/cblasqref.ps

      https://developer.apple.com/library/mac/

    documentation/Accelerate/Reference/BLAS_Ref/

    Reference/reference.html

    http://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttps://developer.apple.com/library/mac/documentation/Accelerate/Reference/BLAS_Ref/Reference/reference.htmlhttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://math-atlas.sourceforge.net/psdoc/cblasqref.pshttp://find/http://goback/

  • 8/19/2019 Tutorial Presentation 6

    11/24

    SIMD

    Single Instruction, Multiple Data

    Taken from

    http://archive.arstechnica.com/cpu/1q00/simd/figure6.gif

    http://archive.arstechnica.com/cpu/1q00/simd/figure6.gifhttp://archive.arstechnica.com/cpu/1q00/simd/figure6.gifhttp://find/

  • 8/19/2019 Tutorial Presentation 6

    12/24

    SSE

    Streaming SIMD Extensions.

    Additional registers in the processor and operations in thearchitecture.

    http://en.wikipedia.org/wiki/File:XMM_registers.svg

    8 registers with 128 bits each. 4 single precision floating pointnumbers in each register.

    http://en.wikipedia.org/wiki/File:XMM_registers.svghttp://en.wikipedia.org/wiki/File:XMM_registers.svghttp://find/

  • 8/19/2019 Tutorial Presentation 6

    13/24

    Some instructions

    ; A l l i n s t r u c t i o n s end up i n S; p e nu l t i m at e l e t t e r d en ot es s c a l a r o r v e c t o r; S s t a n d s f o r s c a l a r , P f o r v e c t o r .; O pe ra nd s a r e XMM r e g i s t e r s

    ; Adds a l l e le me nt s o f a r r a y s op1 and op2 i n t o op1ADDPS op1 , op2

    ; Adds t he f i r s t e le me nt o f op1 and op2 i n t o; t h e f i r s t p o s i t i o n o f op2

    ADDSS op1 , op2

    http://find/http://goback/

  • 8/19/2019 Tutorial Presentation 6

    14/24

    Some SSE instructions

    v e c r e s . x = v 1 . x + v 2 . x ;

    v e c r e s . y = v 1 . y + v 2 . y ;v e c r e s . z = v 1 . z + v 2 . z ;v e c r e s . w = v1 . w + v2 . w ;

    ; xmm0 = v1 .w   |   v1 . z   |   v1 . y   |   v1 . xmovaps xmm0, [ v1 ]

    ; xmm0 = v1 . w+v2 . w   |   v1 . z+v2 . z   |   v1 . y+v2 . y   |   v1 . x+v2 . xaddps xmm0, [ v2 ]

    m ov ap s [ v e c r e s ] , xmm0

    http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

    http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://find/http://goback/

  • 8/19/2019 Tutorial Presentation 6

    15/24

    SSE upgrades

    SSE2  Allows to represent multiple types of data fitting onthe vectors, including integers and characters, andperform the corresponding operations.AMD’s implementation also doubled the number of XMM arrays.

    SSE3  Addition of horizontal operations within the XMMarrays, such as data reduction.

    SSSE3  Additional instructions for SSE3.

    SSE4  Introduction of Dword multiply, allowing to multiplytwo pairs of 32-bit integers to produce 2 64-bitnumbers. Vector dot products.

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    16/24

    AVX

    Intel’s extension to SSE for the Sandy Bridge microarchitecture,

    introduced in 2011. Also available in AMD’s Bulldozer.

    http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svg

    http://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svghttp://upload.wikimedia.org/wikipedia/commons/f/f5/AVX_registers.svghttp://find/

  • 8/19/2019 Tutorial Presentation 6

    17/24

    Automatic vectorization

    The Intel compiler can, under certain conditions, vectorize loops inthe code.

    This can be activated by using the  -vec  option.

    for(i=0; i

  • 8/19/2019 Tutorial Presentation 6

    18/24

    Obstacles to vectorization

     Non-contiguous memory access

    for(i=0; i < SIZE; i+=stride) A[i] = B[i] + C[i];

      Data dependencies

    for(i=0; i < SIZE; i++) A[i] = A[i-1] + B[i];

    G idi ICC i i

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    19/24

    Guiding ICC vectorization

      Pragmas  For example,  #pragma ivdep, among others tocontrol when to vectorize a loop.

      Keywords, such as  restrict .

      Switches  passed to the compiler as optimization levels.

    Look at the ICC automatic vectorization documentation[11].

    GPGPU

    http://find/http://goback/

  • 8/19/2019 Tutorial Presentation 6

    20/24

    GPGPU

    http://blogs.nvidia.com/2009/12/

    whats-the-difference-between-a-cpu-and-a-gpu/

    GPGPU d CPU

    http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://blogs.nvidia.com/2009/12/whats-the-difference-between-a-cpu-and-a-gpu/http://find/

  • 8/19/2019 Tutorial Presentation 6

    21/24

    GPGPU and CPU

    CPU

      General purpose

      Pipelines

     Few threads

     Lots of cache (Correlation)

    GPGPU

     Specialized for local vector operations

     Many cores and threads   Little cache

     Smaller consumption relative to a CPU

    CUDA

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    22/24

    CUDA

    Stands for  Compute Unified Device Architecture .

    Effectively, a programming model to access and control GPUsusing a virtual instruction set, in a simmilar manner as a CPU.

    Only supported on NVIDIA cards.

    Uses the NVIDIA compiler, and can be programmed using  CUDAC/C++, languages based on C/C++.

    O CL

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    23/24

    OpenCL

    Stands for  Open Computing Language 

    It also provides access to the GPU.

     It’s an open standard, supported by NVIDIA and AMD,among others.

     Provides a language based on C99.

     Functionality provided by a driver.  Compilation handled by linking to the correct library.

    References

    http://find/

  • 8/19/2019 Tutorial Presentation 6

    24/24

    References

    http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions

    http://www.netlib.org/blas/

    http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.html

    http://math-atlas.sourceforge.net/faq.html

    http://software.intel.com/sites/products/documentation/hpc/mkl/

     mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htm

    https://developer.nvidia.com/cublas

    http://www.godevtool.com/TestbugHelp/XMMfpins.htm

    software.intel.com/en-us/avx

    http://www.khronos.org/opencl/

    http://developer.download.nvidia.com/CUDA/training/GTC_Express_

    Sarah_Tariq_June2011.pdf

    http:

    //software.intel.com/sites/products/documentation/hpc/composerxe/

    en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htm

    http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://www.netlib.org/blas/http://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.htmlhttp://math-atlas.sourceforge.net/faq.htmlhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttps://developer.nvidia.com/cublashttp://www.godevtool.com/TestbugHelp/XMMfpins.htmhttp://localhost/var/www/apps/conversion/tmp/scratch_5/software.intel.com/en-us/avxhttp://www.khronos.org/opencl/http://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/2011Update/cpp/lin/optaps/common/optaps_vec_use.htmhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://developer.download.nvidia.com/CUDA/training/GTC_Express_Sarah_Tariq_June2011.pdfhttp://www.khronos.org/opencl/http://localhost/var/www/apps/conversion/tmp/scratch_5/software.intel.com/en-us/avxhttp://www.godevtool.com/TestbugHelp/XMMfpins.htmhttps://developer.nvidia.com/cublashttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-2BCA8900-BD2F-4A15-9044-0AA23D07D0D2.htmhttp://math-atlas.sourceforge.net/faq.htmlhttp://www.stanford.edu/class/me200c/tutorial_77/18.1_blas.htmlhttp://www.netlib.org/blas/http://en.wikipedia.org/wiki/Streaming_SIMD_Extensionshttp://find/