high performance parallel programming - qasthigh performance parallel programming multicore...

40
High Performance Parallel Programming Multicore development tools with extensions to many-core. Investment protection. Scale Forward.

Upload: others

Post on 30-May-2020

26 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

High Performance Parallel Programming Multicore development tools with extensions to many-core.Investment protection. Scale Forward.

Page 2: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Develop & Parallelize Today for Maximum PerformanceUse One Software Architecture Today. Scale Forward Tomorrow.

Cluster

MulticoreCluster

4/29/2014 Intel Confidential - Use under NDA only2

Enabling & Advancing Parallelism High Performance Parallel Programming

Intel tools, libraries and parallel models extend to multicore, many-core and heterogeneous computing

Code

CompilerLibraries

Parallel Models

CompilerLibraries

Parallel Models

Multicore& Many -core

Cluster

Many-core

MulticoreCPU

Intel® Xeon Phi™ coprocessor

Multicore

MulticoreCPU

Page 3: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Intel® Software Development ProductsDeliver Application Performance

4/29/2014 Intel Confidential - Use under NDA only3

Foundation of Performance, Productivity, and Standards

Advanced Performance Cluster Performance

Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Advisor

Intel® Inspector XE, Intel® VTune™ Amplifier XE, Intel® Advisor

Intel® C/C++ and Fortran Compilersw/OpenMP

Intel® C/C++ and Fortran Compilersw/OpenMP

Intel® MKL, Intel® Cilk™ Plus, Intel® TBB Library, Intel® IPP Library

Intel® MKL, Intel® Cilk™ Plus, Intel® TBB Library, Intel® IPP Library

Intel® Trace Analyzer and Collector

Intel® Trace Analyzer and Collector

Intel® MPI LibraryIntel® MPI Library

Intel® Parallel Studio XEIntel® Parallel Studio XE

Page 4: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only4

A Family of Parallel Programming ModelsDeveloper Choice

Choice of high-performance parallel programming modelsproduct

Intel® Cilk™ Plus

C/C++ language extensions to simplify parallelism

Open sourcedAlso an Intel product product

Intel® Threading Building Blocks

Widely used C++ template library for parallelism

Open sourcedAlso an Intel product

Domain-Specific Libraries

Intel® Integrated Performance Primitives

Intel® Math Kernel Library

Established Standards

Message Passing Interface (MPI)

OpenMP*

CoarrayFortran

OpenCL*

Research andDevelopment

Intel® Concurrent Collections

Offload Extensions

Intel® SPMD Parallel Compiler

Applicable to Multicore and Many-core Programming

Page 5: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only5

Invest in Common Tools and Programming Models

Intel® Xeon® processors are designed for intelligent

performance and smart energy efficiency

Continuing to advance Intel® Xeon® processor family and instruction set (e.g., Intel®

AVX, etc.)

Multicore

Intel® Xeon Phi™ coprocessors are ideal for highly parallel computing

applications

Software development platforms ramping now

+

Many-core

Tomorrow

Use One Software Architecture Today. Scale Forward Tomorrow.

Code

Today

Use One SoftwareArchitecture

+

Page 6: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Go Parallel with Intel® Cilk™ Plus

4/29/2014 Intel Confidential - Use under NDA only6

Proven Cilk parallel model, teachable in one minute

§ Parallelism in Three Key Words:

§ cilk_spawn

§ cilk_sync

§ cilk_for

Cilk™ Plus: an open specification

§ Recently placed into open source by Intel for the advancement of parallel programming Learn more at http://cilkplus.org

// Parallel function invocation, in Ccilk_for (int i=0; i<n; ++i){

Foo(a[i]);}

// Parallel spawn in a recursive fibonacci// computation, in Cint fib (int n) {

if (n < 2) return 1;else {int x, y;x = cilk_spawn fib(n-1);y = fib(n-2);cilk_sync;return x + y;

}}

Intel® Cilk™ Plus is Applicable to Multicore and Many-core Programming

Page 7: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

//pragma SIMD: User-mandated// vectorization#pragma simdfor (i=0; i<n; i++) {

A[i] = A[i]+ B[i] + C[i];}

// Simplify operation using// array notations in C/C++:a[:] = b[:] + c[:];

// Elemental functions, in C,// using Cilk Plus:__declspec (vector)void saxpy(float a, float x, float &y) {

y += a * x;}

Go Parallel with Intel® Cilk™ Plus

4/29/2014 Intel Confidential - Use under NDA only7

Data and Task Parallelism as first class citizens in C and C++

§ Vectorization via intuitive notations that automatically span MMX, SSE, AVX, and wider widths in the future including those in the Intel® Xeon Phi™ coprocessors

§ array notations

§ #pragma SIMD controls

§ elemental functions

Learn more at http://cilkplus.org

Intel® Cilk™ Plus is Applicable to Multicore and Many-core Programming

Page 8: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only8

Go Parallel with Intel® Threading Building Blocks (Intel® TBB)

A popular parallel abstraction for C++ developers

§ A C++ template library

§ Scalable memory allocation

§ Load-balancing

§ Work-stealing task scheduling

§ Thread-safe pipeline

§ Concurrent containers

§ High-level parallel algorithms

§ Numerous synchronization primitives

Intel remains a leading participant and contributor in the TBB open source project as well as a leading supplier of TBB support and supporting tool.

//Parallel function invocation example, in C++, //using TBB:parallel_for (0, n,

[=](int i) {Foo(a[i]);

});

Learn more at http://threadingbuildingblocks.org

Intel® TBB is Applicable to Multicore and Many-core Programming

Page 9: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Intel® TBB 4.2New FeaturesSupport for the latest Intel architectures− Transactional Synchronization Extensions (TSX)

− Intel® Xeon Phi™ coprocessor for Windows

Android* OS support

Windows Store support

Lower memory overhead

Improved handling of large memory requests

Better fork support

Parallel Patterns Library (PPL)* Compatibility

9

Page 10: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only10

Go Parallel with Message Passing Interface (MPI) Intel® Message Passing Interface (Intel® MPI)

Extend your cluster solutions to the Intel® Xeon Phi™ coprocessor

§ E.g., Intel Xeon Phi™ coprocessor in every node of the cluster using Intel® MPI and Intel® Threading Building Blocks and/or Intel® Cilk™ Plus on nodes

§ Same model as an Intel® Xeon processor based cluster .

Learn more at http://intel.com/go/mpi

Intel is a leading vendor of MPI implementations and tools

Clusters with Multicore and Many-core

… …

Multicore Cluster

Clusters

MPI is applicable to Multicore and Many-core Programming

Page 11: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only11

Go Parallel with Coarray FortranIntel® Fortran Compiler

A standard, explicit notation for data decomposition, such as that often used in message-passing models, expressed in a natural Fortran-like syntax.

For parallel programming on both shared memory and distributed memory systems

!Sum in Fortran, using co-array feature:

REAL SUM[*]CALL SYNC_ALL( WAIT=1 )DO IMG= 2,NUM_IMAGES()

IF (IMG==THIS_IMAGE()) THENSUM = SUM + SUM[IMG-1]

ENDIFCALL SYNC_ALL( WAIT=IMG )

ENDDO

Learn more at http://intel.com/software/products

Coarray Fortran is Applicable to Multicore and Many-core Programming

Page 12: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

• Three coarray execution models

• Images run on host with offload regions

• Images run on both coprocessor and host

• Images run natively on the coprocessor

• Last 2 models requires manual upload of referenced shared object libs including MPI

(impi.so) and coarray (libicaf.so)

12

Go Parallel with Coarray FortranIntel® Fortran Compiler

Page 13: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only13

Go Parallel with OpenMP*Intel® C/C++ and Fortran Compilers

A flexible interface for developing parallel applications

§ An abstraction for multi-threaded solutions

OpenMP* is a standard used by many parallel applications

§ Supported by every major compiler for Fortran, C, and C++

//C/C++ OpenMP* Pragma !Fortran OpenMP*

#pragma omp parallel for reduction(+:pi)for (i=0; i<count; i++) {

float t = (float)((i+0.5)/count);pi += 4.0/(1.0+t*t);

}pi /= count;

!$omp parallel dodo i=1,10A(i) = B(i) * C(i)enddo

!$omp end parallel do

Learn more at http://openmp.org

OpenMP* is Applicable to Multicore and Many-core programming

Page 14: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Intel® Xeon® processor Intel® Xeon Phi™ coprocessor

main(){ double pi = 0.0f; long i;

for (i=0; i<N; i++){

double t = (double)((i+0.5)/N); pi += 4.0/(1.0+t*t);}printf("pi = %f\n",pi/N); }

#pragma omp parallel for reduction(+:pi)#pragma offload target (mic)

OpenMP* is Applicable to Multicore and Many-core Programming

One Line Change to Offload to the

Intel® Xeon Phi™ coprocessor

4/29/2014 Intel Confidential - Use under NDA only14

Go Parallel with OpenMP*Intel® C/C++ and Fortran Compilers(C Example)

Page 15: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Intel® Xeon® processor Intel® Xeon Phi™ coprocessor

do i=1,10A(i) = B(i) * C(i)enddo

!$omp end parallel do

!$omp parallel do!dir$ omp offload target(mic)

4/29/2014 Intel Confidential - Use under NDA only15

Go Parallel with OpenMP*Intel® C/C++ and Fortran Compilers(Fortran Example)

OpenMP* is Applicable to Multicore and Many-core Programming

One Line Change to Offload to the

Intel® Xeon Phi™ coprocessor

Page 16: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Partial OpenMP* 4.0Intel® Composer XE 2013 SP1

• New directives to enable vectorization and offloading of execution to attached devices (i.e., coprocessors or accelerators) for C++ and Fortran

• TARGET Constructs enable creation of a data environment for attached devices, movement of data between host and devices, and execution of constructs on devices

• SIMD Constructs enable loops and functions to be executed concurrently by a thread team using SIMD vector instructions

• See http://openmp.org and the OpenMP API Specification Version 4.0 RC2 for our current implementation of the supported features.

• Changes to features since the OpenMP 4.0 RC2 spec are not yet supported

16

Page 17: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Other Supported OpenMP* 4.0 features

• OpenMP taskgroup directive construct for syncing child tasks

• Atomic clause seq_cst (sequential consistency) provides an implicit flush after all atomic operations

• OMP_PLACES - processor list available to the execution environment. Allows thread affinity control

• omp_get_proc_bind() - API to find thread affinity policy of next parallel region

• Extended support for Fortran 2003

17

Page 18: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only18

Go Parallel with C/C++ Language Extensions

Simple Keyword Language Extensions to control offloading to Intel Xeon Phi™ coprocessor

C/C++ Language Extensions to Multicore and Many-core Programming

C/C++ Language Extensionsclass _Shared common {

int data1;char *data2;class common *next;void process();

};_Shared class common obj1, obj2;…_Cilk_spawn _Offload obj1.process();_Cilk_spawn obj2.process();…

Page 19: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

void foo() /* Intel® Math Kernel Library */{

float *A, *B, *C; /* Matrices */sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);

}

4/29/2014 Intel Confidential - Use under NDA only19

Go Parallel with High Performance Math Kernel LibraryIntel® Math Kernel Library (Intel® MKL)

Intel® Xeon® processor Intel® Xeon Phi™ coprocessor

Implicit automatic offloading requires no code changes, simply link with the

offload MKL Library

Intel High Performance Math Kernel Library is Applicable to Multicore and Many-core Programming

Page 20: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Intel® Math Kernel Library 11.1New Features

Conditional Numerical Reproducibility (CNR) for unaligned memory

− Balances performance with reproducible results by allowing greater flexibility in code branch choice and by ensuring algorithms are deterministic. More information: training site or the Intel® MKL User’s Guide).

− This release extends the feature to remove memory alignment requirements

− Memory alignment is still recommended for best performance

More info: http://software.intel.com/en-us/articles/intel-mkl-11-1/

20

Page 21: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Use the Same Code for Execution onIntel® Xeon Phi™ coprocessors by Offloading

4/29/2014 Intel Confidential - Use under NDA only21

C/C++ Offload Pragma#pragma offload target (mic)#pragma omp parallel for reduction(+:pi)

for (i=0; i<count; i++) {

float t = (float)((i+0.5)/count);

pi += 4.0/(1.0+t*t);

}

pi /= count;

MKL Implicit Offload//MKL implicit offload requires no source code changes, simply link with the offload MKL Library.

MKL Explicit Offload#pragma offload target (mic) \

in(transa, transb, N, alpha, beta) \in(A:length(matrix_elements)) \in(B:length(matrix_elements)) \in(C:length(matrix_elements)) \out(C:length(matrix_elements)alloc_if(0))sgemm(&transa, &transb, &N, &N, &N, &alpha,

A, &N, B, &N, &beta, C, &N);

Fortran Offload Directive!dir$ omp offload target(mic)!$omp parallel do

do i=1,10A(i) = B(i) * C(i)enddo

!$omp end parallelC/C++ Language Extensionsclass _Shared common {

int data1;

char *data2;

class common *next;

void process();

};

_Shared class common obj1, obj2;

…_Cilk_spawn _Offload obj1.process();

_Cilk_spawn obj2.process();

Page 22: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only22

Parallelism with OpenCL*Intel® OpenCL SDK

OpenCL* is a framework for writing programs that execute across heterogeneous platforms (e.g., CPUs, GPUs, many-core)

Intel is a leading participant in the OpenCL* standard efforts, and a vendor of solutions and related tools with early implementations available today.

OpenCL* addresses the needs of customers in specific segments

//Simple per element multiplication using OpenCL*:

kernel void dotprod( global const float *a,global const float *b,global float *c)

{int myid = get_global_id(0);c[myid] = a[myid] * b[myid];

}Learn more at http://intel.com/go/opencl

OpenCL is applicable to multicore and many-core programming

Page 23: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

SIMD Types for Intel® Architecture

AVX Vector size: 256 bitData types:• 8, 16, 32, 64 bit integer• 32 and 64 bit floatVL: 4, 8, 16, 32

Intel® MICVector size: 512 bitData types:• 32 bit integer• 32 and 64 bit floatVL: 8, 16

X4X4

Y4Y4

X4X4◦◦Y4Y4

X3X3

Y3Y3

X3X3◦◦Y3Y3

X2X2

Y2Y2

X2X2◦◦Y2Y2

X1X1

Y1Y1

X1X1◦◦Y1Y1

00X8X8

Y8Y8

X8X8◦◦Y8Y8

X7X7

Y7Y7

X7X7◦◦Y7Y7

X6X6

Y6Y6

X6X6◦◦Y6Y6

X5X5

Y5Y5

X5X5◦◦Y5Y5

255255

X4X4

Y4Y4

X4X4◦◦Y4Y4

X3X3

Y3Y3

X3X3◦◦Y3Y3

X2X2

Y2Y2

X2X2◦◦Y2Y2

X1X1

Y1Y1

X1X1◦◦Y1Y1

00X8X8

Y8Y8

X8X8◦◦Y8Y8

X7X7

Y7Y7

X7X7◦◦Y7Y7

X6X6

Y6Y6

X6X6◦◦Y6Y6

X5X5

Y5Y5

X5X5◦◦Y5Y5

X16X16

Y16Y16

X16X16◦◦Y16Y16

……

......

……

511511

Illustrations: Xi, Yi & results 32 bit integer

4/29/2014 23

Page 24: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Intel Host Processor

Multicore

Running your ApplicationExecution on the host and Intel® Xeon Phi™ coprocessor

4/29/2014 Intel Confidential - Use under NDA only24

Intel® Xeon Phi™ coprocessor(s)

Many-core

Host Offload Library

Message Library

Target Offload Library

Message Library

Without: Intel® Xeon Phi™ coprocessor(s) are absent

With: Intel® Xeon Phi™ coprocessor(s) are present

Application starts and executes on host

Application starts on host and executes portions on Intel® Xeon Phi™ coprocessor(s) At runtime, if Intel® Xeon Phi™ coprocessor (s) are available, the target binary is loaded

At each offload, the construct runs on host cores/threads

At each offload, the construct runs on the Intel® Xeon Phi™ coprocessor(s)

Normal program termination on host

At program termination, target binary is unloaded

Your ApplicationWith identified

Compute Intensive Kernels

Execution Flow

Your ApplicationWith identified

Compute Intensive Kernels

Page 25: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Intel® Debugger

Using the Intel® Debugger Overview

4/29/2014 Intel Confidential - Use under NDA only25

Debugging of host and target simultaneously

If host application is being debugged, target application is also debugged automatically

Debugger runs on host for both host and target program

Debugger halts and resumes both host and target program synchronously

Full C, C++ and Fortran support on both sides

Future: debugger presents view of one virtual application inside a single GUI

Extensible to cover more than one offload card

User

Host Program

Target ProgramTarget

Program

Intel® DebugServerIntel® Debug

Server

Page 26: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only26

Analyzing your ApplicationPerformance Analysis ToolsIntel® VTune™ Amplifier XE performance profiler

§ Analyze your multicore and many-core performance

• Analyze performance of the application in offload mode

• Support for Intel® Xeon Phi™ coprocessors includes:

– A Linux* hosted command line tool that collects events

– The VTune™ Amplifier XE graphical user interface to display results collected in previous step highlighting bottlenecks, time spent and other details of performance.

Page 27: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only27

Preserve Your Development Investment Common Tools and Programming Models for Parallelism

Multicore

Many-core

Heterogeneous Computing

Intel® Cilk PlusIntel® Cilk Plus

Intel® TBBIntel® TBBOffload PragmasOffload Pragmas

OpenCL*OpenCL*

OpenMP*OpenMP*

OpenMP*OpenMP*

CoarrayCoarray

Offload DirectivesOffload Directives

Intel® MPIIntel® MPI

Intel® MKLIntel® MKL

C/C++

Fortran

Intel® C/C++ CompilerIntel® C/C++ Compiler

Intel® Fortran CompilerIntel® Fortran Compiler

Develop Using Parallel Models that Support Heterogeneous Computing

Page 28: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only28

Invest in Common Tools and Programming Models

Intel® Xeon® processors are designed for intelligent

performance and smart energy efficiency

Continuing to advance Intel® Xeon® processor family and instruction set (e.g., Intel®

AVX, etc.)

Multicore

Intel® Xeon Phi™ coprocessors are ideal for highly parallel computing

applications

Software development platforms ramping now

+

Many-core

Tomorrow

Use One Software Architecture Today. Scale Forward Tomorrow.

Code

Today

Use One SoftwareArchitecture

+

Page 29: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only29

Call to Action

• Evaluate the Intel® Software Development Products, including the family of Parallel Programming Models, for your High Performance needs:

http://www.intel.com/software/products/eval

• For product information see:

http://www.intel.com/software/products

Note: The Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 products include support for Intel® Xeon Phi™ coprocessors prior to the coprocessors being generally available.

Page 30: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only30

Page 31: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only31

Performance Caveats and Notes

Performance varies with each application, regardless of the technology and methods used.

Certain types of HPC applications are amenable to acceleration and it is important to understand their characteristics.

Once an application is identified to take advantage of acceleration, the high level and low level techniques are expected to work equally well.

Page 32: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

Backup

Intel Confidential - Use under NDA only

4/29/201432

Page 33: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only33

Using Language Extensions for Intel® MICSimple Offload Extensions with the Intel® Compilers

C/C++ Syntax SemanticsNew offload pragma

#pragma offload ( clauses ) Execute next statement on target (which could be an OpenMP* parallel construct)

Place function on target

__declspec ( target ( x ) ) Compile function for host and target

Place on data target

__declspec (target(MIC)) float array [8000];

Two arrays are created, one on the host and one on the Intel® Xeon Phi™ coprocessor

Fortran Syntax SemanticsNew offload directive

!dir$ omp offload <clauses> Execute nextOpenMP* parallel construct on target

Place function on target

!dir$ attributes offload:<x> :: <rtn-name>

Compile function for host and target

Page 34: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only34

Using Language Extensions (contd.)

What Syntax Semantics

Target specification target ( name ) Where to run construct

Inputs in (var-list modifiersopt ) Copy CPU to target

Outputs out (var-list modifiersopt ) Copy target to CPU

Inputs & outputs inout (var-list modifiersopt ) Copy both ways

Non-copied data nocopy (var-list modifiersopt ) Data is local to target

Modifiers

Specify pointer length length (element-count-expr) Copy that many pointer elements

Control pointer memory allocation

alloc_if ( condition ) Allocate new block of memory for pointer if condition is TRUE

Control freeing of pointer memory

free_if ( condition ) Free memory used for pointer if condition is TRUE

Variables restricted to scalars, arrays and pointers to scalars/arrays, structs (which can be bit-wise copied)

Page 35: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only35

Using other constructs for Intel® Xeon Phi™ coprocessorEnhancements in control layers provide additional flexibility

Feature Example Description

Offloading a function call x = _Offload func(y); func executes on Intel Xeon Phi™ coprocessor

Offloading asynchronously x = _cilk_spawn _Offload f(y); Non blocking offload

Data available on both sides _Shared int x = 0; Allocated in the shared memory area, can be synchronized

Function available on both sides int _Shared f(int x) { return x+1}

The function can execute on either side

Function call from Intel Xeon Phi™ coprocessor to the CPU

x = _Borrow func(y); The caller on the Intel® Xeon Phi™ coprocessor, the callee on the CPU.(use the waiting thread on the CPU)

Offload a parallel for loop(Requires Intel® Cilk™ Plus on Intel® Xeon Phi™ coprocessor)

_Offload _cilk_for (i = 0; i < N; i++) {a[i] = b[i] + c[i];

}

Loop executes in parallel. The loop is implicitly outlined as a function call. (borrow inside the loop disallowed)

Offload array expressions _Offload a[:] = b[:] <op> c[:];_Offload a[:] = elemental_func(b[:]);

Array operations execute in parallel .

Page 36: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only36

Using the Intel® Math Kernel Library (Intel® MKL)

Heterogeneous Intel® Math Kernel Library (Intel® MKL) automatically extends existing Intel® MKL functions to use Intel® Xeon Phi™ coprocessor to accelerate computations.

Host SideApplication

Messaging layer

Intel Xeon Phi™ coprocessor Side

Messaging layer

Heterogeneous Intel MKLon host

Heterogeneous Intel MKLon the Intel® Xeon Phi™

coprocessor Intel MKL

Dispatcher

Expert API

Expert API

Page 37: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

4/29/2014 Intel Confidential - Use under NDA only37

Using Array Notation Parallel ConstructsPart of the Intel® Cilk™ Plus Parallel Model

Making it easier to express and exploit vectorization opportunities and wider SIMD units on IA and Intel® Xeon Phi™ coprocessor

a[:]+b[:]

+ + + + + + + +

Element-wise vector operations Function mappingf( ) f( ) f( ) f( ) f( )

f(b[:])

f f f +f f f f f f f f

f ff

Reductions

__sec_reduce(f,0,a[:]);

37

ü Deterministic vectorizationü Predictable performanceü Freely mixable with scalar C/C++

Page 38: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

ISV Guidance: Tools for Parallelism

Memory Lang Model and Capability

Description

Shared C++ Cilk™ Plus • Language extensions for task and vector parallelism • Serial semantics capability = low overhead & powerful

TBB • Widely used C++ template library for task parallelism• Contains a rich feature set for general purpose parallelism

C Cilk™ Plus • Language extensions for task and vector parallelism• Serial semantics capability = low overhead & powerful

OpenCL* • Emerging industry standard for hybrid (CPU+GPU) computing• Is low level – requires deep expertise and advanced knowledge

C or Fortran

OpenMP* • Industry standard compiler based language with roots in HPC• Thread based with many controls to tweak behavior and get performance

Distributed C++, C or Fortran

MPI • Library based capability • Enables apps to run on clusters as well as shared memory• Works with all above models

Intel Confidential - Use under NDA only38

Select from a variety of powerful tools to aid parallelism

Page 39: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level

Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, Xeon Phi, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

39

39

Intel Confidential - Use under NDA only

4/29/2014

Page 40: High Performance Parallel Programming - QastHigh Performance Parallel Programming Multicore development tools with extensions to many-core. ... §Concurrent containers §High-level