programming multi-core systems with openmp · 2016. 6. 27. · target multi-core systems...
TRANSCRIPT
![Page 1: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/1.jpg)
Programming Multi-Core Systems with OpenMP
Clemens Grelck
University of Amsterdam
UvA / SURFsaraHigh Performance Computing and Big Data
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 2: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/2.jpg)
Programming Multi-Core Systems with OpenMP
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 3: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/3.jpg)
Target Multi-core Systems
Small-scale general-purpose (x86) multicore processors:
I Intel / AMD commodity processors with 2, 4, 6 or 8 cores
I potentially hyperthreaded
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 4: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/4.jpg)
Target Multi-core Systems
Medium-scale server systems:
I multiple (2 or 4 in practice) identical processors
I each processor with several cores
I high bandwidth data path between processors
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 5: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/5.jpg)
Target Multi-core Systems
Large-scale shared address space compute systems:
I large number of slightly simpler cores
I SUN MicroSystems / Oracle Niagara / UltraSparc T series
I up to 512 hardware threads (T3-4 server)
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 6: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/6.jpg)
Design Rationale of OpenMP
Ideal:
I Automatic parallelisation of sequential code.
I No additional parallelisation effort for development,debugging, maintenance, etc.
Problem:
I Data dependences are difficult to assess.
I Compilers must be conservative in their assumptions.
Way out:
I Take or write ordinary sequential program.
I Add annotations/pragmas/compiler directives that guideparallelisation.
I Let the compiler generate the corresponding code.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 7: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/7.jpg)
Design Rationale of OpenMP
Ideal:
I Automatic parallelisation of sequential code.
I No additional parallelisation effort for development,debugging, maintenance, etc.
Problem:
I Data dependences are difficult to assess.
I Compilers must be conservative in their assumptions.
Way out:
I Take or write ordinary sequential program.
I Add annotations/pragmas/compiler directives that guideparallelisation.
I Let the compiler generate the corresponding code.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 8: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/8.jpg)
Design Rationale of OpenMP
Ideal:
I Automatic parallelisation of sequential code.
I No additional parallelisation effort for development,debugging, maintenance, etc.
Problem:
I Data dependences are difficult to assess.
I Compilers must be conservative in their assumptions.
Way out:
I Take or write ordinary sequential program.
I Add annotations/pragmas/compiler directives that guideparallelisation.
I Let the compiler generate the corresponding code.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 9: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/9.jpg)
OpenMP at a Glance
OpenMP as a programming interface:
I Compiler directives
I Library functions
I Environment variables
C/C++ version:
#pragma omp name [clause]*
structured block
Fortran version:
!$ OMP name [ clause [, clause]*]
code block
!$ OMP END name
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 10: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/10.jpg)
Hello World with OpenMP
#include "omp.h"
#include <stdio.h>
int main()
{
printf( "Starting execution with %d threads :\n",
omp_get_num_threads ());
#pragma omp parallel
{
printf( "Hello world says thread %d of %d.\n",
omp_get_thread_num (),
omp_get_num_threads ());
}
printf( "Execution of %d threads terminated .\n",
omp_get_num_threads ());
return( 0);
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 11: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/11.jpg)
Hello World with OpenMP
Compilation:
gcc -fopenmp hello_world.c
Output using 4 threads:
Starting execution with 1 threads:
Hello world says thread 2 of 4.
Hello world says thread 3 of 4.
Hello world says thread 1 of 4.
Hello world says thread 0 of 4.
Execution of 1 threads terminated.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 12: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/12.jpg)
Hello World with OpenMP
Using 4 threads:
Starting execution with 1 threads:
Hello world says thread 2 of 4.
Hello world says thread 3 of 4.
Hello world says thread 1 of 4.
Hello world says thread 0 of 4.
Execution of 1 threads terminated.
Who determines number of threads ?
I Environment variable: export OMP_NUM_THREADS=4
I Library function: void omp_set_num_threads( int)
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 13: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/13.jpg)
OpenMP Execution Model
Classical Fork/Join:
Master and slave threads concurrently
Master thread executes serial code.
Master thread encounters parallel directive.
Implicit barrier, wait for all threads to finish.
Master thread resumes serial execution.
execute parallel block.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 14: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/14.jpg)
Programming Multi-Core Systems with OpenMP
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 15: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/15.jpg)
Simple Loop Parallelisation
Example: element-wise vector product:
void elem_prod( double *c, double *a, double *b, int len)
{
int i;
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
}
Prerequisite:
I No data dependence between any two iterations.
I Caution: YOU claim this property !!
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 16: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/16.jpg)
Simple Loop Parallelisation
Example: element-wise vector product:
void elem_prod( double *c, double *a, double *b, int len)
{
int i;
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
}
Prerequisite:
I No data dependence between any two iterations.
I Caution: YOU claim this property !!
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 17: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/17.jpg)
Simple Loop Parallelisation
Example: element-wise vector product:
void elem_prod( double *c, double *a, double *b, int len)
{
int i;
#pragma omp parallel for
for (i=0; i<len; i++)
{
c[i] = a[i] * b[i];
}
}
Prerequisite:
I No data dependence between any two iterations.
I Caution: YOU claim this property !!
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 18: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/18.jpg)
Directive #pragma omp parallel for
What the compiler directive does for you:
I It starts additional worker threads depending onOMP NUM THREADS.
I It divides the iteration space among all threads.
I It lets all threads execute loop restricted to their mutuallydisjoint subsets.
I It synchronizes all threads at an implicit barrier.
I It terminates worker threads.
Restrictions:
I The directive must directly precede for-loop.
I The for-loop must match a constrained pattern.
I The trip-count of the for-loop must be known in advance.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 19: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/19.jpg)
Directive #pragma omp parallel for
What the compiler directive does for you:
I It starts additional worker threads depending onOMP NUM THREADS.
I It divides the iteration space among all threads.
I It lets all threads execute loop restricted to their mutuallydisjoint subsets.
I It synchronizes all threads at an implicit barrier.
I It terminates worker threads.
Restrictions:
I The directive must directly precede for-loop.
I The for-loop must match a constrained pattern.
I The trip-count of the for-loop must be known in advance.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 20: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/20.jpg)
Shared and Private Variables
Example:
#pragma omp parallel for
for (i=0; i<len; i++)
{
res[i] = a[i] * b[i];
}
I Shared variable: one instance for all threads
I Private variable: one instance for each thread
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 21: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/21.jpg)
Shared and Private Variables
Example:
#pragma omp parallel for
for (i=0; i<len; i++)
{
res[i] = a[i] * b[i];
}
Who decides that res, a, b, and len are shared variables,whereas i is private ??
Default rules:
I All variables are shared.
I Only loop variables of parallel loops are private.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 22: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/22.jpg)
Shared and Private Variables
Example:
#pragma omp parallel for
for (i=0; i<len; i++)
{
res[i] = a[i] * b[i];
}
Who decides that res, a, b, and len are shared variables,whereas i is private ??
Default rules:
I All variables are shared.
I Only loop variables of parallel loops are private.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 23: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/23.jpg)
Parallelisation of a Less Simple Loop
Mandelbrot set:double x, y;
int i, j, max = 200;
int depth[M,N];
...
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
} }
Properties to check:
I No data dependencies between loop iterations ?
YES !I Trip-count known in advance ?
YES !
I Function mandelval without side-effects ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 24: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/24.jpg)
Parallelisation of a Less Simple Loop
Mandelbrot set:
double x, y;
int i, j, max = 200;
int depth[M,N];
...
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
} }
Properties to check:
I No data dependencies between loop iterations ?
YES !
I Trip-count known in advance ?
YES !
I Function mandelval without side-effects ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 25: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/25.jpg)
Parallelisation of a Less Simple Loop
Mandelbrot set:
double x, y;
int i, j, max = 200;
int depth[M,N];
...
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
} }
Properties to check:
I No data dependencies between loop iterations ? YES !
I Trip-count known in advance ?
YES !
I Function mandelval without side-effects ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 26: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/26.jpg)
Parallelisation of a Less Simple Loop
Mandelbrot set:
double x, y;
int i, j, max = 200;
int depth[M,N];
...
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
} }
Properties to check:
I No data dependencies between loop iterations ? YES !
I Trip-count known in advance ? YES !
I Function mandelval without side-effects ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 27: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/27.jpg)
Parallelisation of a Less Simple Loop
Function mandelval:
int mandelval( double xx, double yy, int max)
{
int i =0;
double x = xx , y = yy;
while (x*x + y*y <= 4.0 && i < max) {
x = x*x - y*y + xx;
y = x*y + x*y + yy;
i++;
}
return i;
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 28: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/28.jpg)
Parallelisation of a Less Simple Loop
Mandelbrot set:
double x, y;
int i, j, max = 200;
int depth[M,N];
...
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
} }
Properties to check:
I No data dependencies between loop iterations ? YES !
I Trip-count known in advance ? YES !
I Function mandelval without side-effects ? YES !
I Only loop variable i needs to be private ? NO !!!!Check x,y,j
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 29: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/29.jpg)
Parallelisation of a Less Simple Loop
Mandelbrot set:double x, y;
int i, j, max = 200;
int depth[M,N];
...
#pragma omp parallel for private( x, y, j) shared( M, N, max)
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
} }
Private clause:
I Directives may be refined by clauses.I Private clause allows us to tag any variable as private.I Caution: private variables are not initialised outside parallel
section !!I Shared clause allows us to explicitly mark shared variables.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 30: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/30.jpg)
Parallelisation of a Less, Less Simple Loop
Mandelbrot set with additional counter:
int total = 0;
...
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
total = total + depth[i,j];
} }
Problems:
I New variable total introduces data dependence.
I Data dependence could be ignored due to associativity.
I New variable total must be shared.
I Incrementation of total must avoid race condition.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 31: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/31.jpg)
Parallelisation of a Less, Less Simple Loop
Mandelbrot set with additional counter:
int total = 0;
...
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
total = total + depth[i,j];
} }
Problems:
I New variable total introduces data dependence.
I Data dependence could be ignored due to associativity.
I New variable total must be shared.
I Incrementation of total must avoid race condition.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 32: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/32.jpg)
Parallelisation of a Less, Less Simple Loop
Mandelbrot set with additional counter:
int total = 0;
...
#pragma omp parallel for private( x, y, j)
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
#pragma omp critical
{
total = total + depth[i,j];
}
}
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 33: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/33.jpg)
Critical Regions
The critical directive:
I Directive must immediately precede new statement block.
I Statement block is executed without interleaving.
I Directive implements critical region.
Equivalence:
#pragma omp critical
{
<statements >
}
pthread_mutex_lock( &lock);
<statements >
pthread_mutex_unlock( &lock);
Disadvantage:
I All critical regions in entire program are synchronised.
I Unnecessary overhead.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 34: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/34.jpg)
Critical Regions
The critical directive:
I Directive must immediately precede new statement block.
I Statement block is executed without interleaving.
I Directive implements critical region.
Equivalence:
#pragma omp critical
{
<statements >
}
pthread_mutex_lock( &lock);
<statements >
pthread_mutex_unlock( &lock);
Disadvantage:
I All critical regions in entire program are synchronised.
I Unnecessary overhead.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 35: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/35.jpg)
Critical Regions
The named critical directive
I Critical regions may be associated with names.
I Critical regions with identical names are synchronised.
I Critical regions with different names are executed concurrently.
Equivalence:
#pragma omp critical (name)
{
<statements >
}
pthread_mutex_lock( &name_lock );
<statements >
pthread_mutex_unlock( &name_lock );
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 36: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/36.jpg)
Reduction Operations
Specific solution: reduction clause
#pragma omp parallel for private( x, y, i, j) \
reduction (+: total)
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
x = (double) i / (double) M;
y = (double) j / (double) N;
depth[i,j] = mandelval( x, y, max);
total = total + depth[i,j];
}
}
Properties:
I Reduction clause only supports built-in reduction operations:+, *, ^, &, |, &&, ||.
I User-defined reductions only supported via critical regions.
I Bit accuracy not guaranteed.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 37: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/37.jpg)
Shared and Private Variables Reloaded
Shared variables:
I One instance shared between sequential and parallel execution.
I Value unaffected by transition.
Private variables:
I One instance during sequential execution.
I One instance per worker thread during parallel execution.
I No exchange of values.
New: Firstprivate variables:
I Like private variables, but ...
I Worker thread instances initialised with master thread value.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 38: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/38.jpg)
Shared and Private Variables Reloaded
Shared variables:
I One instance shared between sequential and parallel execution.
I Value unaffected by transition.
Private variables:
I One instance during sequential execution.
I One instance per worker thread during parallel execution.
I No exchange of values.
New: Firstprivate variables:
I Like private variables, but ...
I Worker thread instances initialised with master thread value.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 39: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/39.jpg)
Shared and Private Variables Reloaded
Example:
int a=1, b=2, c=3
#pragma omp parallel for private( a) \
firstprivate( b) \
shared(c)
for (i=0; i<10; i++) {
// before first iteration:
// a : ?? | b : ?? | c : ??
a++; b++; c=i;
}
// a : ?? | b : ?? | c : ??
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 40: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/40.jpg)
Shared and Private Variables Reloaded
Example:
int a=1, b=2, c=3
#pragma omp parallel for private( a) \
firstprivate( b) \
shared( c)
for (i=0; i<10; i++) {
// before first iteration:
// a : undef | b : 2 | c : undef
a++; b++; c=i;
}
// a : 1 | b : 2 | c : undef
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 41: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/41.jpg)
Conditional Parallelisation
Problem:
I Parallel execution of aloop incurs overhead:
I creation of workerthreads
I schedulingI synchronisation barrier
I This overhead must beoutweighed by sufficientworkload.
I Workload depends onI loop body,I trip count.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 42: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/42.jpg)
Conditional Parallelisation
Problem:
I Parallel execution of aloop incurs overhead:
I creation of workerthreads
I schedulingI synchronisation barrier
I This overhead must beoutweighed by sufficientworkload.
I Workload depends onI loop body,I trip count.
Example:
if (len < 1000) {
for (i=0; i<len; i++)
{
res[i] = a[i] * b[i];
}
}
else {
#pragma omp parallel for
for (i=0; i<len; i++)
{
res[i] = a[i] * b[i];
}
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 43: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/43.jpg)
Conditional Parallelisation
Introducing the if-clause:
if (len < 1000) {
for (i=0; i<len; i++) {
res[i] = a[i] * b[i];
}
}
else {
#pragma omp parallel for
for (i=0; i<len; i++) {
res[i] = a[i] * b[i];
}
}
#pragma omp parallel for if (len >= 1000)
for (i=0; i<len; i++) {
res[i] = a[i] * b[i];
}
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 44: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/44.jpg)
Programming Multi-Core Systems with OpenMP
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 45: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/45.jpg)
Loop Scheduling
Definition:
I Loop scheduling determines which iterations are executed bywhich thread.
Aim:
I Equal workload distribution
timeworkwait
task 4
task 3
task 2
task 1
sy
nc
hro
niz
ati
on
ba
rrie
r
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 46: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/46.jpg)
Loop Scheduling
Problem:
I Different situations require different techniques
The schedule clause:
#pragma omp parallel for schedule( <type > [, <chunk >])
for (...)
{
...
}
Properties:
I Clause selects one out of a set of scheduling techniques.
I Optionally, a chunk size can be specified.
I Default chunk size depends on scheduling technique.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 47: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/47.jpg)
Loop Scheduling
Static scheduling:
#pragma omp parallel for schedule( static)
I Loop is subdivided into as many chunks as threads exist.
I Often called block scheduling.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 48: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/48.jpg)
Loop Scheduling
Static scheduling:
#pragma omp parallel for schedule( static)
I Loop is subdivided into as many chunks as threads exist.
I Often called block scheduling.
Static scheduling with chunk size:
#pragma omp parallel for schedule( static , <n>)
I Loop is subdivided into chunks of n iterations.
I Chunks are assigned to threads in a round-robin fashion.
I Also called block-cyclic scheduling.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 49: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/49.jpg)
Loop Scheduling
Dynamic scheduling:
#pragma omp parallel for schedule( dynamic , <n>)
I Loop is subdivided into chunks of n iterations.
I Chunks are dynamically assigned to threads on their demand.
I Also called self scheduling.
I Default chunk size: 1 iteration.
Properties:
I Allows for dynamic load distribution and adjustment.
I Requires additional synchronization.
I Generates more overhead than static scheduling.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 50: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/50.jpg)
Loop Scheduling
Dilemma of chunk size selection:
I Small chunk sizes mean good load balancing, but highsynchronisation overhead.
I Large chunk sizes reduce synchronisation overhead, but resultin poor load balancing.
Rationale of guided scheduling:
I In the beginning, large chunks keep synchronisation overheadsmall.
I When approaching the final barrier, small chunks balanceworkload.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 51: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/51.jpg)
Loop Scheduling
Dilemma of chunk size selection:
I Small chunk sizes mean good load balancing, but highsynchronisation overhead.
I Large chunk sizes reduce synchronisation overhead, but resultin poor load balancing.
Rationale of guided scheduling:
I In the beginning, large chunks keep synchronisation overheadsmall.
I When approaching the final barrier, small chunks balanceworkload.
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 52: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/52.jpg)
Loop Scheduling
Guided scheduling:
#pragma omp parallel for schedule( guided , <n>)
I Chunks are dynamically assigned to threads on their demand.
I Initial chunk size is implementation dependent.
I Chunk size decreases exponentially with every assignment.
I Also called guided self scheduling.
I Minimum chunk size: n (default: 1)
Example:
I Total number of iterations: 250
I Initial / minimal chunk size: 50 / 5
I Current chunk size: 80% of last chunk size:50 – 40 – 32 – 26 – 21 – 17 – 14 – 12 – 10 – 8 – 6 – 5 – 5 – 4
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 53: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/53.jpg)
Programming Multi-Core Systems with OpenMP
OpenMP at a Glance
Loop Parallelization
Scheduling
Outlook
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 54: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/54.jpg)
What’s More ?
More in OpenMP-2:
I Decouple parallel regions from work sharing
I Control synchronisation barriers
I Task parallel sections
I Low-level locks and condition variables
I ...
More in OpenMP-3:
I Nested parallel regions
I Spawning and synchronisation of tasks
I ...
More information:
I www.openmp.org
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 55: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/55.jpg)
What’s More ?
More in OpenMP-2:
I Decouple parallel regions from work sharing
I Control synchronisation barriers
I Task parallel sections
I Low-level locks and condition variables
I ...
More in OpenMP-3:
I Nested parallel regions
I Spawning and synchronisation of tasks
I ...
More information:
I www.openmp.org
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 56: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/56.jpg)
What’s More ?
More in OpenMP-2:
I Decouple parallel regions from work sharing
I Control synchronisation barriers
I Task parallel sections
I Low-level locks and condition variables
I ...
More in OpenMP-3:
I Nested parallel regions
I Spawning and synchronisation of tasks
I ...
More information:
I www.openmp.org
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP
![Page 57: Programming Multi-Core Systems with OpenMP · 2016. 6. 27. · Target Multi-core Systems Small-scale general-purpose (x86) multicore processors: I Intel / AMD commodity processors](https://reader033.vdocument.in/reader033/viewer/2022053022/60506f11c2436a2b7241643d/html5/thumbnails/57.jpg)
The End: Questions ?
Clemens Grelck, University of Amsterdam Programming Multi-Core Systems with OpenMP