intel® intel®®®parallel composerparallel composer · pdf...

IntelIntelIntelIntel®®®® Parallel ComposerParallel ComposerParallel ComposerParallel Composer

Stephen Blair-ChappellIntel Compiler Labs

4/16/20092

Copyright © 2007, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® Software Development Products Overview

IntelIntelIntelIntel®®®® Parallel ComposerParallel ComposerParallel ComposerParallel Composer

CODE & DEBUG PHASECODE & DEBUG PHASECODE & DEBUG PHASECODE & DEBUG PHASE

� Easier, faster parallelism for Windows* apps

� C/C++ compiler and advanced threaded libraries

� Built-in parallel debugger

� Supports OpenMP*

� Save time and increase productivity

Develop effective applications with a C/C++ compiler and comprehensive threaded libraries

4/16/20093



Key Features of ComposerKey Features of ComposerKey Features of ComposerKey Features of Composer� Extensions for parallelism

– Simple concurrent functionality (__task/__taskcomplete)

– Vectorization support for SSE2/SSE3/SSSE3/SSE4 instruction set

– OpenMP™ 3.0

– Seamless integration into MS Visual Studio*

– Intel® Parallel Debugger Extensions - A Plug-in to Visual Studio*

– Intel® Threading Building Blocks

– C++ lambda function support (enables simpler interfacing with Intel®TBB)

– Intel® Integrated Performance Primitives

– Integrated array notation, data-parallel Intel® IPP functions

� Parallel build (/MP) feature

� Diagnostics to help develop parallel programs (/Qdiag-enable:thread)

� Threading tutorials with sample code

4/16/20094



IntelIntelIntelIntel®®®® Parallel ComposerParallel ComposerParallel ComposerParallel ComposerExtend parallel debugging capabilitiesExtend parallel debugging capabilitiesExtend parallel debugging capabilitiesExtend parallel debugging capabilities

Data race

detection

Allows filtering to control amount of data collected

Serializes parallel regions without recompilation

Adds a new class of “data breakpoints”

Adds window to visualize logs

5

Software Solutions Group Developer Products DivisionCopyright © 2008, Intel Corporation. All rights reserved.

*Other brands and names are the property of their respective owners.Product Plans, dates, and specifications are preliminary and subject to

change without notice.

5

Compiler Pro 11.0

Composer Future Intel Compiler

Full C and C++ Support √ √ √

Fortran Support √ No Fortran √

Fortran 2003 Support Several features No Fortran 2003 Many more features

__task/__taskcomplete No Parallel Exploration

√ √

OpenMP 3.0 √ √ √

Valarray Specializations √ √ √

Lambda Functions √ √ √

Intel® Threading Building Blocks √ √ √

Intel® Integrated Performance Primitives √ √ √

Intel® Math Kernel Library √ No √

C++ Parallel Debug Plug-in with SSE/vector window and OpenMP parallelism window (Windows Only)

No Windows Debugger

√ √

GUI Debugger with SSE/vector window and OpenMP parallelism window (Linux Only)

√ No Linux debugger √

Code Coverage Utility √ √ √

Test Select Utility √ √ √

Full Fortran Interoperability √ √, except when IPO used

√

Decimal Floating-point √ √ √

Section 2Section 2Section 2Section 2New FeaturesNew FeaturesNew FeaturesNew Features

Concurrent FunctionailtyNew C++0x FeaturesDebugger Extensions

7




Concurrent Functionality - Idea

• The parallel programming extensions are intended for quickly getting a program parallelized without learning a great deal about APIs

• A few keywords and the program is parallelized

• If the constructs are not powerful enough in terms of data control then there may be a need to look into other more comprehensive parallel programming methodologies, such as OpenMP.

8




Concurrent Functionality

• Introduction of novel C/C++ language extensions to make parallel programming easier

• There are four new keywords introduced used as statement prefixes:

__taskcomplete, __task, __par, and __critical.

• To benefit from the parallelism afforded by these keywords the switch /Qpar must be used

• The runtime system will manage the actual degree of parallelism

9




Concurrent Functionality - example

__taskcomplete {

__task f_sum(500, a, b, c); __task f_sum(500, a+500, b+500, c+500);

}

f_sum(1000, a, b, c);

int a[1000], b[1000], c[1000];

void f_sum ( int length, int *a, int *b, int *c )

{

int i;

for (i=0; i<length; i++)

{

c[i] = a[i] + b[i];

}

}

Serial call

Parallel call

4/16/200910



New C++0x FeaturesNew C++0x FeaturesNew C++0x FeaturesNew C++0x Features

New C++0x features enabled by switch:

/Qstd=c++0x (Win)

– lambda functions

– static assertions

– RVALUE references

– C99-compliant preprocessor

– __func__ predefined identifier

– variadic templates

– Extern templates

… and some more

4/16/200911



C++0x: FuturesC++0x: FuturesC++0x: FuturesC++0x: Futures

� Future - a mechanism used to provide values that will be accessed in the future and resolved to another value asynchronously

� The definition of futures does not specify if the computation of the given expression starts immediately or when the result is requested

� Futures are realized in the Intel® Compiler by templates

intel::future<Page*> future_page;

do { // on user click

if (user clicked NEXT) {

page = future_page.get();//Wait for next page to finish loading

} else {

future_page.cancel(); //user clicked END, speculation wasted

break;

}

} while (1);

4/16/200912



C++0x: Lambda FunctionsC++0x: Lambda FunctionsC++0x: Lambda FunctionsC++0x: Lambda Functions

� A lambda abstraction defines an unnamed function

� Lambda functions in C++ provides:

– treat functions as first class objects

– composing functions inline

– treating functions as class objects

– Enhances concurrent code, as it is possible to pass around code chunks like objects

– Ability to pass code as parameter

� Defines the <> operator

std::vector<int> someList;int total = 0;std::for_each(someList.begin(),someList.end(), <>(int x) : [&total](total += x));std::cout << total;

4/16/200913



VALARRAY methods accelerated VALARRAY methods accelerated VALARRAY methods accelerated VALARRAY methods accelerated by Intelby Intelby Intelby Intel®®®® IPPIPPIPPIPP

� C++ standard template (STL) container class for arrays consisting of array methods for high performance computing

� The operations are designed to exploit low level hardware features, for example vectorization

� In order to take full advantage of valarray, you need an optimizing C++ compiler that recognizes valarray as an intrinsic type and replaces such types by Intel® IPP library calls

// Create a valarray of ints.

valarray_t::value_type ibuf[10] = {0,1,2,3,4,5,6,7,8,9};valarray_t vi(ibuf, 10);

// Create a valarray of bools for a mask.

maskarray_t::value_type mbuf[10] = {1,0,1,1,1,0,0,1,1,0};maskarray_t mask(mbuf,10);

// Double the values of the masked array

vi[mask] += static_cast<valarray_t> (vi[mask]);

IntelIntelIntelIntel®®®® Parallel ComposerParallel ComposerParallel ComposerParallel ComposerDebugger ExtensionsDebugger ExtensionsDebugger ExtensionsDebugger Extensions

15




• Shared Data Access Detection

• Break on Thread Shared Data Access

• Re-entrant Function Detection

• SIMD SSE Debugging Window

• Enhanced OpenMP* Support

• Serialize OpenMP threaded application execution on the fly

• Insight into thread groups, barriers, locks, wait lists etc.

Key Features

16




Shared Data Access Detection

• Shared data access is a major problem in multi-threaded applications� Can cause hard to diagnoseintermittent program failure

� Tool support is required fordetection

• Technology built on:� Code Instrumentation by Intel compiler

� Debug runtime library (RTL) thatcollects data access traces and triggers debugger tool

� Add-on that reports and visualizesRTL events while debugging

� The combination enables a large variety of additional debug usecases

Visual Studio or IDB

Intel Debug Runtime

GUI

sGUI Extension

Debug Engine

Debugger Extension

CompilerInstrumentedApplication

Memory AccesInstrumentation

Normaldebugging

RTLevents

17




Shared Data Access Detection

�Breakpoint model(stop on detection)

�GUI extensions showresults & link to source

�Filter capabilities to hide false positives

• New powerful databreakpoint types

�Stop when 2nd threadaccesses specificaddress

�Stop on read fromaddress

Key User Benefit: A simplified feature to detect shared data accesses from multiple threads

Data sharing detection is part of overall debug process

18




Shared Data Access Detection - Filtering

�Data Filter�Specific data itemsand variables can beexcluded

�Code Filter� Functions can beexcluded

�Source files can beexcluded

�Address ranges canbe excluded,

Data sharing detection is selective

19




Re-Entrant Call Detection• Automatically halts execution when a function is executed by more than one thread at any given point in time.

• Allows to identify reentrancy requirements/problems for multi-threaded applications

20




Enhanced OpenMP* Debugging Support

•Dedicated OpenMP runtimeobject - informationWindows � OpenMP Task and Spawn Tree lists

� Barrier and Lock information

� Task Wait lists

� Thread Team worker lists

• Serialize Parallel Regions� Change number of parallel threadsdynamically during runtime to 1 or N (all)

� Verify code correctness for serialexecution vs. Parallel execution

� Identify whether a runtime issue isreally parallelism related

User benefit: Detailed execution state information for OpenMP applications (deadlock detection). Influences execution behavior without recompile!

21




OpenMP*Task Details

22




Parallel Debug Plug-In

• Allows Filtering to control amount of data collected

• Adds window to visualize logs

• Can serialize parallel regions

23




Parallel Run-Control - Use Cases• Stepping parallel loops

• Problem:State investigation difficult * Threads stop at arbitrary positions (red line)

•Parallel Debugger Support

• Add Syncpoint to stop team threads at same location

•User Benefit

• Get and keep defined program State

• Operations like private data comparison now meaningful

Syncpoint

Breakpoint

Normal Step

LockStep

LockStep

LockStep

• Serial Execution

• Problem:Parallel loop computes a wrong result. Is it a concurrency or algorithm issue ?

• Parallel Debugger Support

• Runtime access to the OpenMP num_threadproperty

• Set to 1 for serial execution of next parallel block

• User Benefit• Verification of a algorithm “on-the-fly” without slowing down the entire application to serial execution

• On demand serial debugging without recompile/restart

…

Disable parallel

…

Enable parallel

24




SIMD SSE Debugging Window

� SIMD Window (new)

� Supports evaluation of arbitrary length expressions.

� SSE Registers display of variables used for SIMD operations

� In-depth insight into data parallelization and vectorization.

Section 3Section 3Section 3Section 3Creating Parallel CodeCreating Parallel CodeCreating Parallel CodeCreating Parallel Code

4/16/200926



Implementing Parallelism Implementing Parallelism Implementing Parallelism Implementing Parallelism –––– Different MethodsDifferent MethodsDifferent MethodsDifferent Methods

� Automatic

– Via Compiler

– No Code Changes

� Programming

– OpenMP

– Native Threads• Win32

• POSIX

– Threading Building Blocks

– MPI

� Using Parallel-enabled Libraries

– MKL

– IPPThree ways of achieving Parallelism

Automatic

Programming

Libraries

4/16/200927



Auto ParallelismAuto ParallelismAuto ParallelismAuto Parallelism

Loop-level parallelism automatically supplied by the compiler

4/16/200928



AutoAutoAutoAuto----parallelization parallelization parallelization parallelization

� Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives.

– Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.

-par_report[n]

-parallel

Mac*

-par_report[n]

-parallel

Linux*

/Qpar_report[n]

/Qparallel

Windows*

4/16/200929



Optimisation Results Optimisation Results Optimisation Results Optimisation Results –––– pi applicationpi applicationpi applicationpi application

4.64.64.64.60.2030.2030.2030.203autoautoautoauto----vecvecvecvec. & auto. & auto. & auto. & auto----par.par.par.par.

1.81.81.81.80.5160.5160.5160.516autoautoautoauto----parallelismparallelismparallelismparallelism

2.52.52.52.50.3750.3750.3750.375autoautoautoauto----vectorisationvectorisationvectorisationvectorisation

11110.9380.9380.9380.938default default default default

SpeedupTime Taken (secs)Optimisation

4/16/200930



Sample implementations of Parallel ProgrammingSample implementations of Parallel ProgrammingSample implementations of Parallel ProgrammingSample implementations of Parallel Programming

POOP – Parallel Object Oriented Programming

*Other names and brands may be claimed as the property of others.

MPI

No ThreadsNo ThreadsNo ThreadsNo Threads

The Sample Application

4/16/200932



Our running Example: Our running Example: Our running Example: Our running Example: The PI programThe PI programThe PI programThe PI programNumerical IntegrationNumerical IntegrationNumerical IntegrationNumerical Integration

∫4.0

(1+x2)dx = π

0

1

∑∑∑∑ F(xi)∆x ≈≈≈≈ ππππi = 0

N

Mathematically, we know that:

We can approximate the

integral as a sum of

rectangles:

Where each rectangle has

width ∆x and height F(xi) at the middle of interval i.

F( x

) =

4. 0

/(1+

x2)

4.0

2.0

1.0

X0.0



PI Program: PI Program: PI Program: PI Program: The sequential programThe sequential programThe sequential programThe sequential program

static long num_steps = 100000;double step;void main (){ int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=1;i<= num_steps; i++){x = (i-0.5)*step;sum = sum + 4.0/(1.0+x*x);

}pi = step * sum;

}}

Native ThreadsNative ThreadsNative ThreadsNative Threads

4/16/200935



Threads and Parallel ProgrammingThreads and Parallel ProgrammingThreads and Parallel ProgrammingThreads and Parallel Programming

� Operating Systems 101:

– Process: A unit of work managed by an OS with its own address space (the heap) and OS managed resources.

– Threads: resources within a process that execute the instructions in a program. They have their own program counter and a private memory region (a stack) but share the other resources within the process … including the heap.

� Threads are the natural unit of execution for parallel programs on shared memory hardware.

� The threads share memory so data structures don’t have to be torn apart into distinct pieces.

4/16/200936



Programming with Native ThreadsProgramming with Native ThreadsProgramming with Native ThreadsProgramming with Native Threads

� The OS provides an API for creating, managing, and destroying threads.

– Windows* threading API

– Posix threads (on Linux)

� Advantages of threads libraries:

– The thread library gives you detailed control over the threads.

� Disadvantage of thread libraries:

– The thread library REQUIRES that you take detailed control over the threads.




Win32 APIWin32 APIWin32 APIWin32 API#include <windows.h>

#define NUM_THREADS 2

HANDLE thread_handles[NUM_THREADS];

CRITICAL_SECTION hUpdateMutex;

static long num_steps = 100000;

double step;

double global_sum = 0.0;

void Pi (void *arg)

{

int i, start;

double x, sum = 0.0;

start = *(int *) arg;


for (i=start;i<= num_steps;

i=i+NUM_THREADS){

x = (i-0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

EnterCriticalSection(&hUpdateMutex);

global_sum += sum;

LeaveCriticalSection(&hUpdateMutex);

}

void main ()

{

double pi; int i;

DWORD threadID;

int threadArg[NUM_THREADS];

for(i=0; i<NUM_THREADS; i++) threadArg[i] = i+1;

InitializeCriticalSection(&hUpdateMutex);

for (i=0; i<NUM_THREADS; i++){

thread_handles[i] = CreateThread(0, 0,

(LPTHREAD_START_ROUTINE) Pi,

&threadArg[i], 0, &threadID);

}

WaitForMultipleObjects(NUM_THREADS,

thread_handles, TRUE,INFINITE);

pi = global_sum * step;

printf(" pi is %f \n",pi);

}



Win32 thread library: Win32 thread library: Win32 thread library: Win32 thread library: ItItItIt’’’’s not as bad as it looks s not as bad as it looks s not as bad as it looks s not as bad as it looks #include <windows.h>





double step;


void Pi (void *arg)

{

int i, start;





i=i+NUM_THREADS){

x = (i-0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}


global_sum += sum;


}

void main ()

{

double pi; int i;

DWORD threadID;








}





}

Setup multi-threading support

Define the work each thread

will do and pack it into a

function

Setup arguments, book

keeping, and launch the

threads

Wait for all the threads to finish

Compute

and print

final answer

Update the

final answer

one thread at

a time



Win32 APIWin32 APIWin32 APIWin32 API#include <windows.h>





double step;


void Pi (void *arg)

{

int i, start;





i=i+NUM_THREADS){

x = (i-0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}


global_sum += sum;


}

void main ()

{

double pi; int i;

DWORD threadID;








}





}

Threading Building BlocksThreading Building BlocksThreading Building BlocksThreading Building Blocks

4/16/200941



Featured ComponentsFeatured ComponentsFeatured ComponentsFeatured Components

Generic Parallel Algorithmsparallel_for

parallel_reducepipeline

parallel_sortparallel_whileparallel_scan

Concurrent Containersconcurrent_hash_mapconcurrent_queueconcurrent_vector

Task Scheduler

4/16/200942



Concurrent Containers Concurrent Containers Concurrent Containers Concurrent Containers

� Library provides highly concurrent containers

– STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container

– Standard practice is to wrap a lock around STL containers

• Turns container into serial bottleneck

� Library provides fine-grained locking or lockless implementations

– Worse single-thread performance, but better scalability.

– Can be used with the library, OpenMP, or native threads.

4/16/200943



Generic Programming for C++ developersGeneric Programming for C++ developersGeneric Programming for C++ developersGeneric Programming for C++ developers

� Best known example is C++ STL

� Enables distribution of broadly-useful high-quality algorithms and data structures

� Write best possible algorithm with fewest constraints

– Do not force particular data structure on user

– Classic example: STL std::sort

� Instantiate algorithm to specific situation

– C++ template instantiation, partial specialization, and inlining make resulting code efficient

4/16/200944



template <typename Range, typename Body>void parallel_for(const Range& range, const Body &body);

� Requirements for parallel_for Body

� parallel_for partitions original range into subranges, and deals out subranges to worker threads in way that:

– Balances load

– Uses cache efficiently

– Scales

Apply the body to subrange.

void Body::operator() (Range& subrange) const

DestructorBody::~Body()

Copy constructorBody::Body(const Body&)

4/16/200945



Serial ExampleSerial ExampleSerial ExampleSerial Example

� static void SerialUpdateVelocity() {� for( int i=1; i<UniverseHeight-1; ++i )� #pragma ivdep� for( int j=1; j<UniverseWidth-1; ++j ) � V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i];� }

Intel® TBB product has complete serial and parallel versions of seismic wave simulation.

4/16/200946



Parallel VersionParallel VersionParallel VersionParallel Version

struct UpdateVelocityBody {void operator()( const blocked_range<int>& range ) const {int end = range.end();for( int i= range.begin(); i<end; ++i ) {

#pragma ivdepfor( int j=1; j<UniverseWidth-1; ++j ) V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i];}

}};

void ParallelUpdateVelocity() {parallel_for( blocked_range<int>( 1, UniverseHeight-1, GrainSize ),

UpdateVelocityBody() );}

Task

Establishes grain sizePattern

blue = original codered = provided by TBBblack = boilerplate for library

4/16/200947



tasks available to thieves

Task scheduler exampleTask scheduler exampleTask scheduler exampleTask scheduler example

Split range...

.. recursively...

...until ≤ grainsize.

4/16/200948



TBB ClassTBB ClassTBB ClassTBB Class

class ParallelPi

{

public:

double pi;

ParallelPi(): pi(0) {}

ParallelPi(ParallelPi &body, Split):pi(0) {}

void operator()(const tbb::blocked_range<int> &r)

{

for(int i = r.begin(); i != r.end(); ++i)

{

float x = Step * ((float)i-0.5);

pi += 4.0 / (1.0 + x*x);

}

}

void join(ParallelPi &body) { pi += body.pi; }

};

4/16/200949



TBB ClassTBB ClassTBB ClassTBB Class

int main()

{

ParallelPi Pi;

parallel_reduce(

tbb::blocked_range<int>(0, INTERVALS, 100),

Pi

);

printf(“Pi = %f\n”, Pi.pi/INTERVALS);

}

Section 5Section 5Section 5Section 5Expressing Parallelism Using the Expressing Parallelism Using the Expressing Parallelism Using the Expressing Parallelism Using the

Intel CompilerIntel CompilerIntel CompilerIntel Compiler

Stephen Blair-ChappellTechnical Consulting Engineer

Intel Compiler Labs

4/16/200951



OpenMPOpenMPOpenMPOpenMP

A deeper dive into OpenMP

4/16/200952



What is What is What is What is OpenMPOpenMPOpenMPOpenMP™™™™ ????

Portable, Shared Memory Multi-processing API– Fortran 77, Fortran 90, C, and C++– Multi-vendor support, for both Unix and Windows

� Standardizes loop-level parallelism

� Supports coarse-grained parallelism

� Combines serial and parallel code in single source– No need for separate source code revision

� See www.openmp.org for standard documents, tutorials, sample code

� Intel is premier member of OpenMP Review Board



Parallel APIs: Parallel APIs: Parallel APIs: Parallel APIs: OpenMP*OpenMP*OpenMP*OpenMP*

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTERC$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

OpenMP: An API for Writing Multithreaded ApplicationsOpenMP: An API for Writing Multithreaded Applications

•• A set of compiler directives and library routines for A set of compiler directives and library routines for

parallel application programmersparallel application programmers

•• Makes it easy to create multithreaded (MT) Makes it easy to create multithreaded (MT)

programs in Fortran, C and C++programs in Fortran, C and C++

•• Standardizes last 15 years of SMP practiceStandardizes last 15 years of SMP practice

4/16/200954



OpenMP ArchitectureOpenMP ArchitectureOpenMP ArchitectureOpenMP Architecture

� Fork-Join Model

� Worksharing constructs

� Synchronization constructs

� Directive/pragma-based parallelism

� Extensive API for finer control

4/16/200955



OpenMP RuntimeOpenMP RuntimeOpenMP RuntimeOpenMP Runtime

Threads in Operating System

Runtime Library

Environment Variables

User

Application

Directive Compiler

4/16/200956



OpenMPOpenMPOpenMPOpenMP Programming Model: Programming Model: Programming Model: Programming Model:

Fork-Join Parallelism: �Master thread spawns a team of threads as needed.

�Parallelism added incrementally until performance are met: i.e. the sequential program evolves into a parallel program.

Parallel RegionsMaster Thread in red

A Nested

Parallel region

A Nested

Parallel region

Sequential PartsSequential PartsSequential PartsSequential Parts*Other names and brands may be claimed as the property of others.

4/16/200957



Intel Compiler Switches for Intel Compiler Switches for Intel Compiler Switches for Intel Compiler Switches for OpenMPOpenMPOpenMPOpenMP

� OpenMP™ support

– /Qopenmp

– /Qopenmp_report{0|1|2}



Basic SyntaxBasic SyntaxBasic SyntaxBasic Syntax

� Fork-Join Model

� Threads are created as ‘parallel’’’’ pragma is crossed

� Data is classed as shared among threads or private to each thread

Several, e.g. 4, threads

created on entry

Threads either spin or

sleep between regions

main() {

#pragma omp parallel \

shared(A) private(I)

{

// this code is parallel

...

}

}

4/16/200959



Hello WorldHello WorldHello WorldHello World

� This program runs on three threads: � Prints this:

Hello World

Hello World

Hello World

Iter: 1

Iter: 2

Iter: 3

Iter: 4

Goodbye World

Goodbye World

Goodbye World

Void main()

#pragma omp parallel

{

printf(“Hello World\n”);

#pragma omp for

for(i=0;i<=4;i++) {

printf(“Iter: %d”, I);

}

printf(“Goodbye World\n”);

}



Parallel Loop ModelParallel Loop ModelParallel Loop ModelParallel Loop Model

� Threads are created

� Data is classified as shared or private

A is shared

Parallel For

I=1

I=2

I=3

I=4

I=5

I=6

A

Iterations

distributedacross threads

Barrier at end

Threads either spin or

sleep between regions

void* work(float* A) {

#pragma omp parallel for \

shared(A) private(i)

for(i=1; i<=12; i++) {

/* Iterations divided

* among threads

*/

}

}

4/16/200961



Data Scope AttributesData Scope AttributesData Scope AttributesData Scope Attributes

� The default status can be modified with

default (shared | none)

� Scoping attribute clauses

shared(varname,…………)

private(varname,…………)

4/16/200962



The Private ClauseThe Private ClauseThe Private ClauseThe Private Clause

� Reproduces the variable for each thread

• Variables are un-initialized; C++ object is default constructed

• Any value external to the parallel region is undefined

void* work(float* c, int N) {

float x, y; int i;

#pragma omp parallel for private(x,y)

for(i=0; i<N; i++) {

x = a[i]; y = b[i];

c[i] = x + y;

}

}

4/16/200963



Example: Dot ProductExample: Dot ProductExample: Dot ProductExample: Dot Product

float dot_prod(float* a, float* b, int N)

{

float sum = 0.0;

#pragma omp parallel for shared(sum)

for(int i=0; i<N; i++) {

sum += a[i] * b[i];

}

return sum;

}

What is Wrong?What is Wrong?

4/16/200964



Protect Shared DataProtect Shared DataProtect Shared DataProtect Shared Data

� Must protect access to shared, modifiable data

float dot_prod(float* a, float* b, int N)

{

float sum = 0.0;

#pragma omp parallel for shared(sum)

for(int i=0; i<N; i++) {#pragma omp critical

sum += a[i] * b[i];

}

return sum;

}

4/16/200965



OpenMP* Critical ConstructOpenMP* Critical ConstructOpenMP* Critical ConstructOpenMP* Critical Construct

float R1, R2;


{ float A, B;

#pragma omp for

for(int i=0; i<niters; i++){

B = big_job(i);


consum (B, &R1);

A = bigger_job(i);


consum (A, &R2);

}

}

Threads wait their turn –

at a time, only one calls consum() thereby

protecting R1 and R2 from race conditions.

Naming the critical constructs is optional, but may increase

performance.

(R1_lock)

(R2_lock)

� #pragma omp critical [(lock_name)]

� Defines a critical region on a structured block

4/16/200966



OpenMP* Reduction ClauseOpenMP* Reduction ClauseOpenMP* Reduction ClauseOpenMP* Reduction Clause

� reduction (op : list)

� The variables in “list” must be shared in the enclosing parallel region

� Inside parallel or work-sharing construct:

• A PRIVATE copy of each list variable is created and initialized depending on the “op”

• These copies are updated locally by threads

• At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable

4/16/200967



Reduction ExampleReduction ExampleReduction ExampleReduction Example

� Local copy of sum for each thread

� All local copies of sum added together and stored in “global” variable

#pragma omp parallel for reduction(+:sum)

for(i=0; i<N; i++) {

sum += a[i] * b[i];

}

4/16/200968



� A range of associative and commutative operators can be used with reduction

� Initial values are the ones that make sense

C/C++ Reduction OperationsC/C++ Reduction OperationsC/C++ Reduction OperationsC/C++ Reduction Operations

0^

0-

1*

0+

Initial ValueOperator

0||

1&&

0|

~0&

Initial ValueOperator

4/16/200969



Schedule ClauseSchedule ClauseSchedule ClauseSchedule Clause

#pragma omp parallel for schedule (static, 8)

for( int i = start; i <= end; i += 2 )

{

if ( TestForPrime(i) ) gPrimesFound++;

}

� Iterations are divided according to schedule statement



OpenMPOpenMPOpenMPOpenMPParallel for with a reductionParallel for with a reductionParallel for with a reductionParallel for with a reduction

#include <omp.h>


double step;


void main ()

{

int i;

double x, pi, sum = 0.0;


omp_set_num_threads(NUM_THREADS);

#pragma omp parallel for reduction(+:sum) private(x)

for (i=1;i<= num_steps; i++)

{

x = (i-0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

pi = step * sum;

}


4/16/200971



OpenMP FOR-schedule schemas

� schedule clause defines how loop iterations assigned to threads

� Compromise between two opposite goals:

– Best thread load balancing

– With minimal controlling overhead

N

N/2

C

f

single thread

schedule(static)

schedule(dynamic, c)

schedule(guided, f)

4/16/200972



Iterative Iterative Iterative Iterative worksharingworksharingworksharingworksharing versus code replicationversus code replicationversus code replicationversus code replication

� Iterative worksharing -- prints ‘Hello World’ 10 times, regardless of the number of threads:

� Code replication -- assuming a team of 4 threads, prints ‘Hello World’ 40 times:

#pragma omp parallel for

for( int i = 1,i< 10;i++)

{

printf(“Hello World\n”);}


for( int i = 1,i< 10;i++)

{

printf(“Hello World\n”);}

4/16/200973



Parallel SectionsParallel SectionsParallel SectionsParallel Sections

� Independent sections of code can execute concurrently

Serial Parallel

#pragma omp parallel sections

{

#pragma omp section

phase1();

#pragma omp section

phase2();

#pragma omp section

phase3();

}

4/16/200974



Implicit BarriersImplicit BarriersImplicit BarriersImplicit Barriers

� Several OpenMP constructs have implicit barriers:

do

for

single

sections

� Unnecessary barriers hurt performance

� Suppress them, when safe, with nowait

!$omp do

[...]

!$omp end do nowait

#pragma omp for nowait

for(...)

[...];

!$omp sections

[...]

!$omp end sections nowait

#pragma omp single nowait

{ [...] }

4/16/200975



Static and Dynamic ExtentStatic and Dynamic ExtentStatic and Dynamic ExtentStatic and Dynamic Extent

� Static extent or lexical extent is the code that is lexically within the parallel/end parallel directive

� Dynamic extent includes the static extent and the entire call tree of any subroutine or function called in the static extent.

� OpenMP directives in the dynamic extent of a parallel region are called orphaned directives

� An orphaned worksharing construct behaves as if the construct was within the lexical extent – the work is divided across thread team

– The only difference is that slightly different data scoping rules apply.

4/16/200977



Communication and data scopeCommunication and data scopeCommunication and data scopeCommunication and data scope

� In OpenMP every variable has a scope that is either shared or private

� By default, all variables have shared scope

� Data scoping clauses that can appear on parallel constructs:

– The shared and private clauses explicitly scope specific variables

– The firstprivate and lastprivate clauses perform initialization and finalization of private variables

– The default clause changes the default scoping rules when variables are not explicitly scoped

– The reduction clause explicitly identifies reduction variables

4/16/200978




� Each thread has a private stack used for automatic variables

� For all other program variables, the parallel constructs can scope each variable as shared, private, or reduction

� Private variables need to be initialized at the start of a parallel construct. The firstprivate clause will initialize from the global instance

� For parallelized loops, the lastprivate clause will update the global instance from the private value computed with the last iteration

4/16/200979




� Variables default to shared scope, except for these cases:– Loop index variables to which a parallel do or parallel forapplies default to private

– Locals variables in subroutines in the dynamic extent default toprivate unless they are marked with the save attribute (Fortran) or as static (C/C++)

� Data scoping clauses only apply to the named variables within the lexical extent

– Global variables in orphaned constructs are shared by default, regardless of the attribute in the lexical extent

– Automatic (stack allocated) variables in orphaned constructs arealways private

– Formal parameters to a subroutine in the dynamic extent acquire their scope from that of the actual variables in the caller’s context

4/16/200980



Environment VariablesEnvironment VariablesEnvironment VariablesEnvironment Variables� Standard Environment Variables– OMP_SCHEDULE

• Runtime schedule and optional iteration chunk size– OMP_NUM_THREADS

• Number of worker threads; defaults to number of processors– OMP_DYNAMIC

• Enables dynamic adjustment of the number of threads – OMP_NESTED

• Enables nested parallelism

� Intel Extension Environment Variables– KMP_ALL_THREADS

• Maximum number of threads in a parallel region– KMP_BLOCKTIME

• Thread wait time before sleeping at the end of a parallel region– KMP_LIBRARY

• Runtime execution mode: throughput (default) for multi-user systems, turnaround for a dedicated (single user) system

– KMP_STACKSIZE

• Worker thread stack size

4/16/200981



SynchronizationSynchronizationSynchronizationSynchronization

� Two kinds of synchronization– Mutual exclusion synchronization for exclusive access to data by only one thread at a time

– Event synchronization for imposing a thread execution order

� Most commonly used synchronization directives– Data synchronization gives a thread exclusive access to a shared variable• !$omp critical

an arbitrarily large block of structured code

• !$omp end critical

• !$omp atomic

a single assignment that updates a scalar variable

– Event synchronization signals the occurrence of an event that all threads must synchronize their execution on• !$omp barrierspecifies a point in the program where each thread must wait for all other threads to arrive

� Other less-frequently used synchronization directives!$omp master, !$omp flush, !$omp ordered

4/16/200982



The Workqueuing ModelThe Workqueuing ModelThe Workqueuing ModelThe Workqueuing Model

� Work units need not be all known or pre-computed at the beginning of construct (e.g.: while loops and recursive functions)

� taskq specifies an environment (the queue)task specifies the units of work (dynamic)

� One thread executes the taskq block, enqueuing each task it encounters

� All other threads dequeue and execute work from the queue

� Intel-specific extension to OpenMP* 2.5

– Has been accepted in new 3.0 version ( to be released Q2/08)

4/16/200983



Why Workqueuing?Why Workqueuing?Why Workqueuing?Why Workqueuing?

while(p != NULL){

do_work(p->data);

p = p->next;

}

int n=0, i=0;

NODEPTR q=p;

NODEPTR *r;

while(q != NULL){

n++;

q = q->next;

}

r=allocate(n *

sizeof(NODEPTR));

while(p != NULL){

r[i++]=p;

p = p->next;

}

#pragma omp parallel for

for (i=0; i<n; i++)

do_work(r[i]->data);

free(r);

WorksharingWorksharing

WorkqueuingWorkqueuing

#pragma intel omp parallel taskq

{

while(p != NULL){

#pragma intel omp task

do_work(p->data);

p = p->next;

}

}

4/16/200984



OpenMPOpenMPOpenMPOpenMP* 3.0 Support * 3.0 Support * 3.0 Support * 3.0 Support

� IntelIntelIntelIntel®®®® Compilers 11.0 for C++ and Compilers 11.0 for C++ and Compilers 11.0 for C++ and Compilers 11.0 for C++ and FortanFortanFortanFortan will be fully complaint to will be fully complaint to will be fully complaint to will be fully complaint to OpenMPOpenMPOpenMPOpenMP* 3.0* 3.0* 3.0* 3.0

– Standard not release yet Standard not release yet Standard not release yet Standard not release yet • Very likely May 08 !Very likely May 08 !Very likely May 08 !Very likely May 08 !

– Draft at Draft at Draft at Draft at www.openmp.orgwww.openmp.orgwww.openmp.orgwww.openmp.org

– Four major extensions:Four major extensions:Four major extensions:Four major extensions:• Tasking for unstructured parallelismTasking for unstructured parallelismTasking for unstructured parallelismTasking for unstructured parallelism

• Loop collapsingLoop collapsingLoop collapsingLoop collapsing

• Enhanced loop scheduling controlEnhanced loop scheduling controlEnhanced loop scheduling controlEnhanced loop scheduling control

• Better support for nested parallelismBetter support for nested parallelismBetter support for nested parallelismBetter support for nested parallelismand for all who waited and for all who waited and for all who waited and for all who waited ………… eventually allow unsigned eventually allow unsigned eventually allow unsigned eventually allow unsigned intintintint for loop index variable for loop index variable for loop index variable for loop index variable ☺☺☺☺

4/16/200985



OpenMPOpenMPOpenMPOpenMP* 3.0 Tasking* 3.0 Tasking* 3.0 Tasking* 3.0 TaskingMaybe the most relevant new featureMaybe the most relevant new featureMaybe the most relevant new featureMaybe the most relevant new feature

� A task has

– Code to execute

– A data environment (it owns its data)

– An assigned thread that executes the code and uses the data� Two activities: packaging and execution

– Each encountering thread packages a new instance of a task (code and data)

– Some thread in the team executes the task at some later time� A task is nothing really new to OpenMP

– Implicitly each PARALLEL directive creates ‘tasks’ but they have been transparent objects

– OpenMP* 3.0 makes ‘tasking’ explicit

4/16/200986



OpenMPOpenMPOpenMPOpenMP* 3.0 Tasking * 3.0 Tasking * 3.0 Tasking * 3.0 Tasking ---- DefinitionsDefinitionsDefinitionsDefinitions

� Task construct – task directive plus structured block

� Task – the package of code and instructions for allocating data created when a thread encounters a task construct

� Task region – the dynamic sequence of instructions produced by the execution of a task by a thread

� ####pragmapragmapragmapragma ompompompomp task [task [task [task [clause[[,][[,][[,][[,]clause] ...] ] ...] ] ...] ] ...] structured-block

� Where <clause> is one of

� if (if (if (if (expression) ) ) )

� untieduntieduntieduntied

� shared (shared (shared (shared (list))))

� private (private (private (private (list) ) ) )

� firstprivatefirstprivatefirstprivatefirstprivate ((((list))))

� default( shared default( shared default( shared default( shared | none none none none ))))

4/16/200987



OpenMP* 3.0 Tasking ExampleOpenMP* 3.0 Tasking ExampleOpenMP* 3.0 Tasking ExampleOpenMP* 3.0 Tasking ExamplePostorder Tree TraversalPostorder Tree TraversalPostorder Tree TraversalPostorder Tree Traversal

� void postorder(node *p)

� {

� if (p->left)

� #pragma omp task

� postorder(p->left);

� if (p->right)

� #pragma omp task

� postorder(p->right);

� #pragma omp taskwait // wait for descendants

� process(p->data);

� }

� Parent task suspended until children tasks complete

Task scheduling pointThreads may switch to

execute other tasks

4/16/200988



OpenMPOpenMPOpenMPOpenMP* 3.0 * 3.0 * 3.0 * 3.0 –––– Enhanced Schedule ControlEnhanced Schedule ControlEnhanced Schedule ControlEnhanced Schedule Control

� Made schedule(runtime) more useful now

– can get/set it with library routinesomp_set_schedule()

omp_get_schedule()

– allow implementations to use their own schedule kinds

� Adds a new schedule kind AUTO which gives full freedom to the runtime to determine the scheduling of iterations to threads

� Allows C++ random access iterators as loop control variables in parallel loops

4/16/200989



OpenMPOpenMPOpenMPOpenMP* 3.0 Loop Collapsing * 3.0 Loop Collapsing * 3.0 Loop Collapsing * 3.0 Loop Collapsing

� Allow collapsing of perfectly nested loops

� Will form a single loop and then parallelize that

� Scheduling of combined iteration space follows the order of the original, sequential execution

!$omp parallel do collapse(2)

do i=1,n

do j=1,n

.....

end do

end do

4/16/200990



OpenMPOpenMPOpenMPOpenMP 3.0 : Nested Parallelism3.0 : Nested Parallelism3.0 : Nested Parallelism3.0 : Nested Parallelism

� Better support for nested parallelism

� Per-thread internal control variables

– Allows, for example, calling omp_set_num_threads()inside a parallel region.

– Controls the team sizes for next level of parallelism

� Library routines to determine depth of nesting, IDs of parent/grandparent etc. threads, team sizes of parent/grandparent etc. teams

� omp_get_active_level()

� omp_get_ancestor(level)

� omp_get_teamsize(level)

intel® intel®®®parallel composerparallel composer · pdf...

Documents