1. 2 define the purpose of mkl upon completion of this module, you will be able to: identify and...

Post on 14-Dec-2015

218 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

2

Upon completion of this module, you will be able to:

Performance Features

Using the Library

MKL Addresses:Solvers (BLAS, LAPACKEigenvector/eigenvalue solvers (BLAS, LAPACK)Some quantum chemistry needs (dgemm)PDEs, signal processing, seismic, solid-state physics (FFTs)Geneal scientific, financial [vector transcendental functions (VML) and vector random number generators (VSL)

Software Construction

Geometric Transformation

Don’t use Intel® Math Kernel (Intel® MKL) on …

Don’t use Intel® MKL on “small” counts.Don’t call vector math functions on small n.

§ But you could use Intel® Performance Primitives

6

BLAS (Basic Linear Algebra SubroutinesLevel 1 BLAS – vector-vector operations

15 function types48 functions

Level 2 BLAS – matrix-vector operations26 function types66 functions

Level 3 BLAS – matrix-matrix operations9 function types30 functions

Extended BLAS – level 1 BLAS for sparse vectors8 function types24 functions

7

LAPACK (linear algebra packageSolvers and eigensolvers. Many hundreds of routines totalThere are more than 1000 total user callable and support routinesDiscrete Fourier Transformations (DFT)Mixed radix, multi-dimensional transformsMulti threadedVML (Vector Math Library)Set of vectorized transcendental functionsMost of libm functions, but fasterVSL (Vector Statistics Library)Set of vectorized ran

8

BLAS and LAPACK* are both FortranLegacy of high performance computation

VSL and VML have Fortran and C interfacesDFTs have Fortran 95 and C interfacescblas intercate. It is more convenient for a C/C++ programmer to call BLAS

9

Support 32-bit and 64-bit Intel Processors

Large set of examples and testsExtensive documentation

04/18/23 10

The goal of all optimization is maximum speed.Resource limited optimization – exhaust one or more resource of system:

CPU: Register use, FP unitsCache: Keep data in cache as long as possible; deal with cache interleaving.TLBs: Maximally use data on each pageMemory bandwidth: Minimally access memoryComputer: Use all the processors available using threadingSystem: Use all the nodes available (cluster software)

11

Most of Intel MKL could be threaded but:Limited resource is memory bandwidthThreading level 1 and level 2 BLAS are mostly ineffective (O(n) )

There are numerous opportunities for threading:Level 3 BLAS (O(n3) )LAPACK* (O(n3) )FFTs (O(n log(n) )VML, VSL? Depends on processor and function

All threading is via OpenMP*All Intel MKL is designed and compiled for thread safety

12

Scenario 1: ifort, BLAS, IA-32 processor:ifort myprog.f mkl_c.lib

Scenario 2: CVF, LAPACK, IA-32 processor:f77 myprog.f mkl_s.lib

Scenario 3: Statically link a C program with DLL linked at runtime:link myprog.obj mkl_c_dll.libNote: Optimal binary code will execute at run time based on processor.

13

14

15

Most important LAPACK optimizations:Threading – effectively uses multiple CPUsRecursive factorization

Reduces scalar time (Amdahl’s law: t=tscalar + tparallel/pExtends blocking further into the code

No runtime library support required

16

One dimensional, two-dimensional, three-dimensionalMultithreadedMixed radixUser – specified scaling, transform signTransforms on imbedded matricesMultiple one-dimensional transforms on single cellStridesC and F90 interfaces

17

Basically a three-step processCreate a descriptor

Status = DftiCreate Descriptor (MDH,…)Commit the descriptor (instantiates it)

Status = DftiCommit Descriptor (MDH)Perform the transform

Status = DftiComputeForard (MDH, X)Optionally free the descriptor

18

Vector Math Library: Vectorized transcendental functions – like libm but better (faster)Interface: Have both Fortran and C interfacesMultiple accuracies

High accuracy (<1ulp)Lower accuracy, faster (<4 ulps)

Special value handling √(-a), sin(0), and so onError handling – can not duplicate libm here

19

It is important for financial codes (Monte Carlo simulations)Exponentials, logarithms

Other scientific codes depend on transcendental functionsError functions can be big time sinks in come codes

20

Set of random number generators (RNGs)Numerous non-uniform distributionsVML used extensively for transformationsParallel computation support – some functionsUser can supply own BRNG or transformationsFive basic RNGs (BRNGs) – bits, integer, FP

◦ MCG31, R250, MRG32, MCG59, WH

21

Gaussian (two methods)ExponentialLaplaceWeibullCauchyRayleighLognormalGumbel

22

Basically a 3-step ProcessCreate a stream pointer. VSLStreamStatePtr stream;Create a stream.vslNewStream(&stream,VSL_BRNG_MC_G31, seed );Generate a set of RNGs.vsRngUniform( 0, &stream, size, out, start, end );Delete a stream (optional).vslDeleteStream(&stream);

2323

Compare the performance of C source code (RAND function) and VSL.Exercise control of the threading capabilities in MKL/VSL.

24

Intel® Math Kernel Library is a broad scientific/engineering math library.It is optimized for Intel® processors.It is threaded for effective use on SMP machines.

top related