parallel programming with openmp

Parallel Programming with OpenMP

Ing. Andrea [email protected]

The Multicore Revolution is Here!

More instruction-level parallelism hard to find– Very complex designs needed for small gain– Thread-level parallelism appears live and well

Clock frequency scaling is slowing drastically– Too much power and heat when pushing envelope

Cannot communicate across chip fast enough– Better to design small local units with short paths

Effective use of billions of transistors– Easier to reuse a basic unit many times

Potential for very easy scaling– Just keep adding processors/cores for higher (peak) performance

The Road Here

One processor used to require several chips

Then one processor could fit on one chip

Now many processors fit on a single chip

– This changes everything, since single processors are no longer the default

Multiprocessing is Here

Multiprocessor and multicore systems are the future Some current examples

Vocabulary in the Multi Era

AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor

SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor

Vocabulary in the Multi Era

Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design.

Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.

Maintaining shared data consistent

Cache coherency:– Fundamental technology for

shared memory– Local caches for each

processor in the system– Multiple copies of shared

data can be present in caches

– To maintain correct function, caches have to be coherent

When one processor changes shared data, no other cache will contain old data (eventually)

Future Embedded Systems

The Software becomes the Problem

The advent of MPSoCs on the marketplace raised the necessity for standard parallel programming models.

Parallelism required to gain performance– Parallel hardware is “easy” to design

– Parallel software is (very) hard to write

Fundamentally hard to grasp true concurrency– Especially in complex software environments

Existing software assumes single-processor– Might break in new and interesting ways

– Multitasking no guarantee to run on multiprocessor

Exploiting parallelism has historically resulted in significant programmer effort This could be alleviated by a programming model that exposes those

mechanisms that are useful for the programmer to control, while hiding the details of their implementation.

Message-Passing & Shared-Memory

Local memory for each task, explicit messages for communication

All tasks can access the same memory, communication is implicit

Programming Parallel Machines

Synchronize & coordinate execution

Communicate data & status between tasks

Ensure parallelism-safe access to shared data

Components of the shared-memory solution:– All tasks see the same memory– Locks to protect shared data accesses– Synchronization primitives to coordinate execution

Programming Model: MPI

Message-passing API– Explicit messages for

communication– Explicit distribution of data

to each thread for work– Shared memory not visible

in the programming model Best scaling for large

systems (1000s of CPUs)– Quite hard to program– Well-established in HPC

Programming model: Posix Threads

Standard API Explicit operations Strong programmer

control, arbitrary work in each thread

Create & manipulate– Locks– Mutexes– Threads– etc.

Programming model: OpenMP

De-facto standard for the shared memory programming model

A collection of compiler directives, library routines and environment variables

Easy to specify parallel execution within a serial code

Requires special support in the compiler Generates calls to threading libraries (e.g. pthreads) Focus on loop-level parallel execution Popular in high-end embedded

Fork/Join Parallelism

Initially only master thread is active Master thread executes sequential code Fork: Master thread creates or awakens additional threads to execute parallel code Join: At the end of parallel code created threads die or are suspended

Pragmas

Pragma: a compiler directive in C or C++

Stands for “pragmatic information”

A way for the programmer to communicate with the compiler

Compiler free to ignore pragmas: original sequential semantic is not altered

Syntax:

#pragma omp <rest of pragma>

Components of OpenMP

Parallel regions

#pragma omp parallel

Work sharing

#pragma omp for #pragma omp sections

Synchronization

#pragma omp barrier #pragma omp critical #pragma omp atomic

Parallel regions


Work sharing

#pragma omp for #pragma omp sections

Synchronization

#pragma omp barrier #pragma omp critical #pragma omp atomic

Directives Data scope attributes

private shared reduction

Data scope attributes

private shared reduction

Clauses

Thread Forking/Joining

omp_parallel_start() omp_parallel_end()

Number of threads

omp_get_num_threads()

Thread IDs

omp_get_thread_num()

Thread Forking/Joining

omp_parallel_start() omp_parallel_end()

Number of threads

omp_get_num_threads()

Thread IDs

omp_get_thread_num()Runtime Library


Most important directive Enables parallel execution through

a call to pthread_create Code within its scope is replicated

among threads

int main(){#pragma omp parallel { printf (“\nHello world!”); }}

A sequential program....is easily parallelized

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ printf (“\nHello world!”);}





Code originally contained within the scope of the pragma is outlined to a new function within the compiler





The #pragma construct in the main function is replaced with function calls to the runtime library





First we call the runtime to fork new threads, and pass them a pointer to the function to execute in parallel





Then the master itself calls the parallel function





Finally we call the runtime to synchronize threads with a barrier and suspend them

#pragma omp parallelData scope attributes

int main(){ int id; int a = 5;#pragma omp parallel { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); }} A slightly more complex example

Call runtime to get thread ID:Every thread sees a different value

Master and slave threads access the same variable a


int main(){ int id; int a = 5;#pragma omp parallel { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); }} A slightly more complex example

Call runtime to get thread ID:Every thread sees a different value

Master and slave threads access the same variable a

How to inform the compiler about these different

behaviors?

How to inform the compiler about these different

behaviors?


int main(){ int id; int a = 5;#pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) { a = a * 2; printf (“Master: a = %d.”, a); } else printf (“Slave: a = %d.”, a); }} A slightly more complex example

Insert code to retrieve the address of the shared object from within each parallel thread

Allow symbol privatization: Each thread contains a private copy of this variable

#pragma omp for

The parallel pragma instructs every thread to execute all of the code inside the block

If we encounter a for loop that we want to divide among threads, we use the for pragma

#pragma omp for

#pragma omp forLoop partitioning algorithm

Es. N=10, Nthr=4.

LB = C * TID

Thread ID (TID) 0 1 2 3

0 3 6 9

3 6 9 10UB = min { [C * ( TID + 1) ], N}

N

NthrC = ceil ( )

DATA CHUNK

3 elements

#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }

LOWER BOUND

UPPER BOUND

#pragma omp for

int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}


int parfun(…){ for (i=LB; i<UB; i++) a[i] = i; }

The code of the for loop is moved inside the parallel function. Every thread works on a different subset of the iteration space

#pragma omp for

int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}


int parfun(…){ for (i=LB; i<UB; i++) a[i] = i; }

Lower and upper boundaries (LB, UB) are computed based on the thread ID, according to the previously described algorithm

#pragma omp sections

The for pragma allows to exploit data parallelism in loops

OpenMP also provides a directive to exploit task parallelism


Task Parallelism Example

int main(){#pragma omp parallel { #pragma omp sections { v = alpha(); #pragma omp section w = beta (); } #pragma omp sections { x = gamma (v, w); #pragma omp section y = delta (); }} printf (“%f\n”, epsilon (x, y));}

The dataflow graph of this application shows some

parallelism (i.e. functions whose execution doesn’t depend on the

outcomes of other functions)



alpha and beta execute in parallel

gamma and delta execute in parallel



A barrier is implied at the end of the sections pragma that

prevents gamma from consuming v and w before they

are produced

#pragma omp barrier

Most important synchronization mechanism in shared memory fork/join parallel programming

All threads participating in a parallel region wait until everybody has finished before computation flows on

This prevents later stages of the program to work with inconsistent shared data

It is implied at the end of parallel constructs, as well as for and sections (unless a nowait clause is specified)

#pragma omp critical

Critical Section: a portion of code that only one thread at a time may execute

We denote a critical section by putting the pragma

#pragma omp critical

in front of a block of C code

-finding code example

double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;

area += 4.0/(1.0 + x*x); }} pi = area/n;

We must synchronize accesses to shared variable area to avoid inconsistent results. If we don’t..

Race condition

…we set up a race condition in which one process may “race ahead” of another and not see its change to shared variable area

Race condition (Cont’d)

-finding code example

double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;#pragma omp critical area += 4.0/(1.0 + x*x); }} pi = area/n;

#pragma omp critical protects the code within its scope by acquiring a lock before entering the critical section and releasing it after execution

Correctness, not performance!

As a matter of fact, using locks makes execution sequential

To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible)

A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!

The programmer is not aware of the real granularity of the critical section

Using locks, as a matter of fact makes execution sequential

To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible)

A simple instruction to compute the value of area in the previous example is translated into much more instructions

The programmer is not aware of the real granularity of the critical section


This is a dump of the intermediate

representation of the program within the

compiler

This is a dump of the intermediate

representation of the program within the

compiler


call runtime to acquire lock

Lock-protected operations (critical

section)

call runtime to release lock

This is the way the compiler represents the instruction

area += 4.0/(1.0 + x*x);

This is the way the compiler represents the instruction

area += 4.0/(1.0 + x*x);

THIS IS DONE AT EVERY LOOP ITERATION!


A programming pattern such as area += 4.0/(1.0 + x*x); in which we:

– Fetch the value of an operand– Add a value to it– Store the updated value

is called a reduction, and is so common that OpenMP provides support for that

OpenMP takes care of storing partial results in private variables and combining partial results after the loop


double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area) reduction(+:area){ for (i=0; i<n; i++) { x = (i +0.5)/n;

area += 4.0/(1.0 + x*x); }} pi = area/n;

The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable



area += 4.0/(1.0 + x*x); }} pi = area/n;


Shared variable is updated at every

iteration.

Execution of this critical section is SERIALIZED


iteration.


Shared variable is only updated at the end of the loop, when partial sums

are computed

Shared variable is only updated at the end of the loop, when partial sums

are computed

UNOPTIMIZED CODE

Each thread computes partial sums on private copies of the reduction

variable

Each thread computes partial sums on private copies of the reduction

variable

LOOP



area += 4.0/(1.0 + x*x); }} pi = area/n;


This is a single atomic write. Target architecture

may provide such an instruction

This is a single atomic write. Target architecture

may provide such an instruction

__sync_fetch_and_add(&.omp_data_i->area, area);

UNOPTIMIZED CODE


iteration.



iteration.


Summary

Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy

Memory latency is well recognized as a severe performance bottleneck

MPSoCs feature complex memory hierarchy, with multiple cache levels, private and shared on-chip and off-chip memories

Using efficiently the memory hierarchy is of the utmost importance to exploit the computational power of MPSoCs, but..

Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy

It is a difficult task, requiring deep understanding of the application and its memory access pattern

OpenMP standard doesn’t provide any facilities to deal with data placement and partitioning

Customization of the programming interface would bring the advantages of OpenMP to the MPSoC world

Data Placement

#pragma omp parallel sections

{

#pragma omp section

for (i = 0; i < n; i++)

A[i][rand()] = foo ();

#pragma omp section

for (j = 0; j < n; j++)

B[j] = goo ();

}

INTERCONNECT

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

SHAREDMEMORY

OpenMP provides means to map parallel tasks to different processors..

Data PlacementScenario 1 – Off-chip shared memory


{

#pragma omp section

for (i = 0; i < n; i++)


#pragma omp section

for (j = 0; j < n; j++)

B[j] = goo ();

}

INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

A B

..but not to specify phisical placement of data. This can have a great impact on performance!

Memory reference cost =

Bus latency +

Off-chip memory latency (70 cycles)

Data Placement:Scenario 2 – On-chip remote scratchpad


{

#pragma omp section

for (i = 0; i < n; i++)


#pragma omp section

for (j = 0; j < n; j++)

B[j] = goo ();

}

INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2


Bus latency +

On-chip memory latency (2 cycles)

Data Placement:Scenario 3 – On-chip local scratchpad


{

#pragma omp section

for (i = 0; i < n; i++)


#pragma omp section

for (j = 0; j < n; j++)

B[j] = goo ();

}

INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2


Local memory latency (1 cycle)

Data Placement

Dealing with efficient data placement is a difficult and time-consuming activity

Why not to have the compiler take care about that?1. How to drive data allocation to specific memory regions?

Through the extension of the OpenMP model

2. How to discover the best mapping of data to memories? Static code analysis is not sufficient Application Profiling

Extending OpenMP to support Data Distribution

We need a custom directive that enables specific code analysis and transformation

When static code analysis can’t tell how to distribute data we must rely on profiling

The runtime is responsible for exploiting this information to efficiently map arrays to memories


The entire process is driven by the custom #pragma omp distributed directive

{

int A[m];

float B[n];

#pragma omp distributed (A, B)

…

}

{

int *A;

float *B;

A = distributed_malloc (m);

B = distributed_malloc (n);

…

}

Originally stack-allocated arrays are transformed into pointers to allow for their explicit placement throughout the memory hierarchy within the program


The entire process is driven by the custom #pragma omp distributed directive

{

int A[m];

float B[n];

#pragma omp distributed (A, B)

…

}

{

int *A;

float *B;

A = distributed_malloc (m);

B = distributed_malloc (n);

…

}

The transformed program invokes the runtime to retrieve profile information which drive data placement

When no profile information is found, the distributed_malloc returns a pointer to the shared memory


Annotated code is compiled with custom OpenMP translator (GCC 4.3.2)

A first simulation run takes place. Arrays are placed in the shared memory

Every access to arrays declared with #pragma omp distributed is monitored

At the end of the program an access count map by each processor to each array (location) is retrieved

An allocation algorithm is executed that fills in a metadata structure based on profile information

A second simulation run takes place: Metadata is available to distributed_malloc to figure out the most efficient data mapping

Data partitioning

OpenMP model is focused on loop parallelism

In this parallelization scheme multiple threads may access different sections (discontiguous addresses) of shared arrays

Data partitioning is the process of tiling data arrays and placing the tiles in memory such that a maximum number of accesses are satisfied from local memory

Most obvious implementation of this concept is the data cache, but..

– Inter-array discontiguity often causes cache conflicts– Embedded systems impose constraints on energy, predictability,

real-time that often make caches not suitable

A simple example

#pragma omp parallel for

for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

A[ i ][ j ] = 1.0;

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACEINTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

The iteration space is partitioned between the processors

A simple example


for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

A[ i ][ j ] = 1.0;

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACE DATA SPACEINTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

Data space overlaps with iteration space. Each processor accesses a different tile

A(i,j)

Array is accessed with the loop induction variables

A simple example


for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

A[ i ][ j ] = 1.0;

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACE DATA SPACEINTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

The compiler can actually split the matrix into four smaller arrays and allocate them

onto scratchpads

A(i,j)

No access to remote memories through the bus, since data is allocated locally

Another example


for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

hist[A[i][j]]++;

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACE A

INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

Different locations within array hist are accessed by many processors

3 7 7 5 4 4

3 2 2 3 0 0

1 1 0 2 3 5

1 1 4 5 5 4

hist

Another example

In this case static code analysis can’t tell anything on array access pattern

How to decide most efficient partitioning?

Split array in as many tiles as there are processors

Use access count information to map tiles to the processor that has most accesses to it

Another example

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACE A

INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

Now processors need to access remote scratchpads, since they work on

multiple tiles!!!

3 7 7 5 4 4

3 2 2 3 0 0

1 1 0 2 3 5

1 1 4 5 5 4

hist

TILE 1Access countPROC1 1PROC2 2PROC3 3PROC4 2




Problem with data partitioning

If there is no overlapping of iteration space and data space it may happen that multiple processor need to access different tiles

In this case data partitioning introduces addressing difficulties because the data tiles can become discontiguous in physical memory

How to address the problem of generating efficient code to access data when performing loop and data partitioning?

We can further extend the OpenMP programming interface to deal with that!

The programmer only has to specify the intention of partitioning an array throughout the memory hierarchy, and the compiler does the necessary instrumentation

Code Instrumentation

In general, the steps for addressing an array element using tiling are:

Computation of the offset w.r.t. the base address

Identify the tile to which this element belongs to

Re-compute the index relative to the current tile

Load the tile base address from a metadata array

This metadata array is populated during the memory allocation step of the tool-flow. It relies on access count information to figure out the most efficient mapping of array tiles to memories.

Extending OpenMP to support data partitioning

#pragma omp parallel tiled(A)

{

…

/* Access memory */

A[i][j] = foo();

…

}

{

/* Compute offset, tile and index for distributed array */

int offset = …;

int tile = …;

int index = …;

/* Read tile base address */

int *base = tiles[dvar][tile];

/* Access memory */

base[index] = foo();

…

}

The instrumentation process is driven by the custom tiled clause, which can be coupled with every parallel and work-sharing construct.

parallel programming with openmp

Documents

sharedmemorylocal memory

processor changes shared

sharedmemory solution

shared datacomponents

memory mp

shared memorylocal caches

workshared memory

single processors