lecture3: moreonopenmp...

Lecture 3: More on OpenMP

with application to a Monte Carlo solver

Mike Giles

Mathematical Institute

Mike Giles Lecture 3: More on OpenMP 1 / 30

False sharing

In the first lecture, we discussed cache lines and the problem of falsesharing in which different threads try to update different parts of thesame cache line.

In Practical 1 you saw the potential consequences of this forOpenMP performance.

In real applications this is probably my most common concern, andit’s hard to diagnose. This is why I check to see whether I am gettinga good fraction of peak performance; if I’m not, then I look to see if Imight have a false sharing problem.


False sharing

Optimal approach – parallel outer j loop, vector inner i loop

cache line

executionvector

✲

✻

i

j


False sharing

Poor approach – parallel outer i loop, vector inner j loop

cache line

executionvector

✲

✻

i

j


Practical 1

Parallelising the inner loop is much worse.

Starting and stopping a large team of threads is expensive, sogenerally best to parallelise out loop unless that doesn’t offerenough parallelism.

Following on from this brings us to ideas of parallel sections andwork sharing.


Parallel sectionsThe OpenMP construct

#pragma omp parallel for ...

is really a concatenation of two different constructs:

#pragma omp parallel shared(...) private (...)

{

}

which defines a parallel section of code which is to be executed bya team of threads, and

#pragma omp for

which is a work sharing directive to say that different threads in theteam should do different parts of the loop


Parallel sections

Some experts argue strongly that it is bad programming practice touse the concatenated form, but I like its simplicity and find it veryuseful for lots of applications.

However, there are some applications for which the parallel sectionapproach is ideal.

Note that if there are no other pragmas in the parallel section then allthreads will execute the same code identically.

The RTL function omp_get_thread_num can be used to get athread ID (0 up to #threads−1), and this can be used to changewhat each thread does.


Work sharingBy default,

#pragma omp for

splits the execution of the subsequent loop into chunks of equal size(or as close as possible) for the different threads to execute.

This can be changed by using a schedule “clause”

schedule(static) – the default

schedule(static,chunk_size) – similar but assigns thespecified chunk sizes round-robin to each thread (usefulsometimes when work per loop element varies)

schedule(dynamic,chunk_size) – similar again, but newchunks are assigned when previous ones are completed (againuseful for variable work, but loses cache reuse between loops)

see documentation for other alternatives (guided, auto, runtime)


Reductions

As well as addition, the reduction clause can handle ∗, −, max, minand various logical operators.

It is even possible to define your own reduction operator.

The latest OpenMP version 4.5 extends the syntax to arrays so that

reduction(+:array[:10])

will apply the reduction operation to an array of length 10.

i.e. each loop element is creating an array of length 10, and these areall being added to together to create a overall sum array of length 10.

Note: Intel compiler icc version 18.0 is needed for this.


Critical sectionWithin a parallel section, a critical sub-section is a piece of codewhich must be executed one thread at a time.

#pragma omp critical

{

}

I’ve used it before for array reductions, but now that those aresupported in version 4.5 it’s better to use the OpenMP reduction.

But there may be other times when it’s useful to avoid conflicts whendifferent threads are updating the same variables in memory.

There is also a similar #pragma omp atomic capability for singleinstructions (e.g. to increment a shared counter)


Single / master / barrier

#pragma omp single

{

}

means that the sub-section is executed only once, by the first threadwhich reaches it.

#pragma omp master is similar but the execution is by the masterthread and there is no synchronisation at the end.

Synchronisation is provided by #pragma omp barrier — threadswait there until they have all reached it


Nested parallelism

So far I have only described a single level of parallelism.

Some implementations also support nested parallelism in which, forexample, one might start with a team of 4 threads, and then each ofthose can in turn generate a team of 5 threads so that now there are20 threads in total.

This has the potential to get very confusing, and may have littlebenefit in most situations, but may be very useful in some.

(This is something which developers of mathematical libraries worryabout. For high performance they want to use multithreading, butthey don’t know whether the higher-level application is already doingmulti-threading.)


Nested parallelismOne example is a 3D grid application:

for (int k=0; k<K; k++) {

for (int j=0; j<J; j++) {

for (int i=0; i<I; i++) {

u2[i+I*j+I*J*k] = ....

}

}

}

If K =100 and there are 72 threads then parallelising the outer loopis poor

some threads do 1 value of k, and some do 2(poor load-balancing)

all threads will be reading lots of data from neigbouring threads(poor data locality)


Nested parallelism

One possibility is to manually collapse/merge the outer two loops

for (int jk=0; jk<J*K; jk++) {


u2[i+I*jk] = ....

}

}

This addresses the load-balancing problem, but not data accessproblem.

OpenMP has a collapse(2) clause which has a similar effect.


Nested parallelism

An alternative is to use 12-way parallelism on the k loop, and then6-way parallelism on the j loop.

omp_set_num_threads(12);

#pragma omp parallel for shared(...)


omp_set_num_threads(6);

#pragma omp parallel for shared(...)



u2[i+I*j+I*J*k] = ....

}

}

}


Nested parallelism

It can also be implemented using num_threads clauses:

#pragma omp parallel for shared(...) num_threads(12)


#pragma omp parallel for shared(...) num_threads(6)



u2[i+I*j+I*J*k] = ....

}

}

}

Both of these are likely to be much better than loop collapsing, but Ihave not tried either myself!


SIMD parallelism

In Practical 1, the compiler automatically vectorised the innermostloop.

Sometimes it is necessary to force (or at least strongly encourage)the compiler to use vectorisation by using the pragma

#pragma omp simd

which has an optional clause simdlen(length)

(SIMD = Single Instruction Multiple Data)


Memory alignment

This is another topic which we don’t have time to get into in detail.

When loading in a 512-bit vector of data, there are three possibilities:

contiguous and aligned – corresponds perfectly to one 64 bytecache linecontiguous and unaligned – corresponds to pieces of two cachelinesnon-contiguous (e.g. u[4*i] or u[index[i]]) – couldcorrespond to several cache lines

The third is obviously worst. A “strided” access (first example) canperhaps be avoided by changing the data layout, or restructuringloops, but a “gather” (second example) is often unavoidable.


Memory alignmentThe second can sometimes be avoided by

declaring arrays to be “aligned” with start of a cache line

double a[8] __attribute__(aligned(64)) ;

double *b = (double *) _mm_malloc(num_bytes,64);

_mm_free(b);

telling the compiler (repeatedly!)

void myfunc( double p[] ) {

__assume_aligned(p, 64);

for (int i=0; i<n; i++) p[i]++;

}

#pragma omp simd aligned(p:64)

for(int i=0; i<n; i++) p[i]++;

Note: I’ve never used these myself, but I regularly think I should


Global scope variables

Global variables are defined outside the main program or any function,and have a global scope – means they can be referenced anywhere.

I often use this for constants, set once and then used throughout theapplication. I almost never use it for variables which are repeatedlychanged – I pass these through function argument lists.

In OpenMP, by default such variables are shared variables.

There is a special threadprivate capability to declare them asprivate, creating separate copies for each thread – very rarely neededbut very useful when it is.


OpenMP RTL

The full list of functions is available here:

https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-reference-openmp-run-time-library-routines

There are about 30, but I have mentioned all of those that I haveactually used.


Environment variables

The full list of environment variables is available here, as part of abigger list of environment variables recognised at run-time:

https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-reference-supported-environment-variables

This includes the following:

OMP_DYNAMIC which controls whether the number of threadscan be adjusted dynamically (default: FALSE)

OMP_NESTED which controls whether nested parallelism isallowed (default: FALSE)

OMP_PLACES which controls the placement of threads on cores


Additional tools

We don’t have time to discuss software tools for development,debugging and performance profiling:

Intel tools, all part of Parallel Studio XE:https://software.intel.com/en-us/parallel-studio-xe

◮ Intel Advisor (inc Roofline Analysis)https://software.intel.com/en-us/advisor– helps to develop and assess parallel/vector code

◮ Intel Inspectorhttps://software.intel.com/en-us/intel-inspector-xe– identifies code errors such as race conditions

◮ Intel VTune Amplifierhttps://software.intel.com/en-us/intel-vtune-amplifier-xe– assesses performance and possible performance bottlenecks


Additional tools

ARM/Allinea tools in Arm Forgehttps://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge

◮ DDThttps://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/ddt– parallel debugger

◮ MAPhttps://www.arm.com/products/development-tools/hpc-tools/cross-platform/forge/map– assesses performance and possible performance bottlenecks

Vampir: https://www.vampir.eu/– free tool developed by Julich supercompting centre in Germany


Practical 2

Practical 2 is a Monte Carlo solver in which we want to compute

∑

n

Xn,

∑

n

X 2

n

where the Xn are each generated using different random numbers.

High-level view:

trivially parallel, just needs final top-level reduction

each threads needs its own random number generator

should use Intel MKL library to generate random numbersefficiently in chunks

chunks should be big enough for good compute performance,but small enough to stay in L2 cache for good bandwidth


Practical 2

/* each OpenMP thread has its own VSL RNG and storage */

#define RV_BYTES 65536

double *dW;

VSLStreamStatePtr stream;

#pragma omp threadprivate(stream, dW)

Here we use threadprivate to define separate generatorsfor each thread. RV_BYTES will be the size of dW, chosen to fitinto each L2 cache.


Practical 2

#pragma omp parallel shared(...) reduction(+:sum1,sum2)

{ ...

int num_t = omp_get_num_threads();

int tid = omp_get_thread_num();

vslNewStream(&stream, VSL_BRNG_MRG32K3A,1337);

long long skip = ((long long) (tid+1)) << 48;

vslSkipAheadStream(stream,skip);

dW = (double *)malloc(RV_BYTES);

int N3 = ((tid+1)*((long long) N)) / num_t

- ( tid *((long long) N)) / num_t;

pathcalc(T,X0,mu,sigma,dt, M, N3,&sum1_t,&sum2_t);

sum1 += sum1_t; sum2 += sum2_t;

}


Practical 2

/* work out max number of paths in a group */

int bytes_per_path = M*sizeof(double);

int N2 = RV_BYTES / bytes_per_path;

/* loop over all paths in groups of size N2 */

for (int n0=0; n0<N; n0+=N2) {

/* may have to reduce size of final group */

if (N2>N-n0) N2 = N-n0;

/* generate required random numbers for this group */

vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2,

stream,M*N2,dW,0,sqrt(dt));

...

}


Practical 2

#pragma omp simd reduction(+:sum1,sum2) simdlen(32)

for (int n2=0; n2<N2; n2++) {

double X = X0;

for (int m=0; m<M; m++) {

double delW = dW[m*N2+n2]; // sequential in n2

X = X*(1.0 + mu*dt + sigma*delW);

}

sum1 += X; sum2 += X*X;

}

The compiler generates code which does outer loop in chunks of 32,with each chunk being vectorised by swapping with inner loop.


Final comments

this lecture covered a lot of more advanced / exotic features– don’t feel that you need to understand and use all of them

concentrate on the fundamentals, and then expand into usingthe more advanced ones purely as needed

remember to check on the web for examples, or talk to otherswith more experience

always monitor your performance to assess whether you aredoing a good job


lecture3: moreonopenmp...

Documents