parallel processing with openmpusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfmulti-core...
TRANSCRIPT
![Page 1: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/1.jpg)
Parallel processingwith OpenMP
#pragma omp
1
![Page 2: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/2.jpg)
Bit-level parallelism long words
Instruction-level parallelism automatic
SIMD: vector instructions vector types
Multiple threads OpenMP
GPU CUDA
GPU + CPU in parallel CUDA
2
![Page 3: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/3.jpg)
Multi-corecomputers• Modern CPUs have multiple cores (Maari-A: 4)
• Each core has its own execution units,runs its own thread of code, independently
• Each core has access to the main memory
• Some shared resources (e.g. some caches)
3
![Page 4: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/4.jpg)
Multi-coreprogramming• Simply launch multiple threads
• easy with OpenMP
• If you have ≤ 4 threads running in Maari-A, each thread will be on a different core
• if you have 1000 threads, operating systemwill have to do some time slicing…
4
![Page 5: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/5.jpg)
OpenMP• Extension of C, C++, Fortran
• Standardised, widely supported
• Just compile and link your code with: • gcc -fopenmp • g++ -fopenmp
5
![Page 6: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/6.jpg)
OpenMP• You add #pragma omp directives in your
code to tell what to parallelise and how
• Compiler & operating systemtakes care of everything else
• You can often write your code so that itworks fine even if you ignore all #pragmas
6
![Page 7: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/7.jpg)
a(); #pragma omp parallel { b(); } c();
7
a()
b()
b()
b()
b()
c()
![Page 8: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/8.jpg)
a(); #pragma omp parallel { b(); #pragma omp for for (int i = 0; i < 10; ++i) { c(i); } } d();
8
a()
b()
b()
b()
b()
d()
c(0) c(1) c(2)
c(3) c(4) c(5)
c(6) c(7)
c(8) c(9)
![Page 9: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/9.jpg)
a(); #pragma omp parallel { #pragma omp for for (int i = 0; i < 10; ++i) { c(i); } d(); } e();
9
a()d()
c(0) c(1) c(2)
c(3) c(4) c(5)
c(6) c(7)
c(8) c(9)
e()
d()
d()
d()
![Page 10: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/10.jpg)
a(); #pragma omp parallel { #pragma omp for nowait for (int i = 0; i < 10; ++i) { c(i); } d(); } e();
10
a()d()
c(0) c(1) c(2)
c(3) c(4) c(5)
c(6) c(7)
c(8) c(9)
e()
d()
d()
d()
![Page 11: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/11.jpg)
#pragma omp parallel { #pragma omp for for (int i = 0; i < 10; ++i) { c(i); } }
#pragma omp parallel for for (int i = 0; i < 10; ++i) { c(i); }
11
=
![Page 12: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/12.jpg)
a(); #pragma omp parallel { b(); #pragma omp critical { c(); } } d();
12
a()
b()
b()
b()
b()
d()
c()
c()
c()
c()
![Page 13: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/13.jpg)
a(); #pragma omp parallel { b(); #pragma omp for nowait for (int i = 0; i < 10; ++i) { c(i); } #pragma omp critical { d(); } } e();
13
a()d()
c(0) c(1) c(2)
c(3) c(4) c(5)
c(6) c(7)
c(8) c(9)
e()
d()
d()
d()
b()
b()
b()
b()
![Page 14: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/14.jpg)
global_initialisation(); #pragma omp parallel { local_initialisation(); #pragma omp for nowait for (int i = 0; i < 10; ++i) { do_some_work(i); } #pragma omp critical { update_global_data(); } } report_result();
14
![Page 15: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/15.jpg)
// shared variable int sum_shared = 0; #pragma omp parallel { // private variables (one for each thread) int sum_local = 0; #pragma omp for nowait for (int i = 0; i < 10; ++i) { sum_local += i; } #pragma omp critical { sum_shared += sum_local; } } print(sum_shared);
15
![Page 16: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/16.jpg)
OpenMP:memory model
what can be done without synchronisation?
16
![Page 17: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/17.jpg)
OpenMPmemory model• Contract between programmer & system
• Local “temporary view”, global “memory” • threads read & write temporary view • may or may not be consistent with memory
• Consistency guaranteed only after a “flush”
17
![Page 18: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/18.jpg)
OpenMPmemory model• Implicit “flush” e.g.:
• when entering/leaving “parallel” regions • when entering/leaving “critical” regions
• Mutual exclusion: • for “critical” regions
18
![Page 19: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/19.jpg)
int a = 0; #pragma omp parallel { #pragma omp critical { a += 1; } }
19
0
0
? 0
0 1
0 ?
0 ?
1
?
1
1 2 2
1 ? ?
1 ? 2
2
2
0thread 1:
thread 2:
memory
parallel
critical
critical
![Page 20: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/20.jpg)
Simple rules• Permitted (without explicit synchronisation):
• multiple threads reading, no thread writing • one thread writing, same thread reading
• Forbidden (without explicit synchronisation): • multiple threads writing • one thread writing, another thread reading
20
![Page 21: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/21.jpg)
Simple rules• Smallest meaningful unit = array element
• Many threads can access the same array
• Just be careful if they accessthe same array element
• even if you try to manipulate different bits
21
![Page 22: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/22.jpg)
Simple rules• Safe:
• thread 1: p[0] = q[0] + q[1] • thread 2: p[1] = q[1] + q[2] • thread 3: p[2] = q[2] + q[3]
22
![Page 23: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/23.jpg)
Simple rules• Safe:
• thread 1: p[0] = p[0] + q[1] • thread 2: p[1] = p[1] + q[2] • thread 3: p[2] = p[2] + q[3]
23
![Page 24: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/24.jpg)
Simple rules• Not permitted without synchronisation:
• thread 1: p[0] = q[0] + p[1] • thread 2: p[1] = q[1] + p[2] • thread 3: p[2] = q[2] + p[3]
• “Data race”, unspecified behaviour
24
![Page 25: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/25.jpg)
Simple rules• Not permitted without synchronisation:
• thread 1: p[0] = q[0] + q[1] • thread 2: p[0] = q[1] + q[2] • thread 3: p[0] = q[2] + q[3]
• “Data race”, unspecified behaviour
25
![Page 26: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/26.jpg)
Simple rules• Not permitted without synchronisation:
• thread 1: p[0] = 1 • thread 2: p[0] = 1 • thread 3: p[0] = 1
• “Data race”, unspecified behaviour
26
![Page 27: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/27.jpg)
Filtering is very easyvoid filter(const int* data, int* result) { #pragma omp parallel for for (int i = 0; i < n; ++i) { result[i] = compute(data[i]); } }
27
![Page 28: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/28.jpg)
Filtering is very easystatic void median(const array_t x, array_t y) { #pragma omp parallel for for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { int ia = (i + n - 1) % n; int ib = (i + 1) % n; int ja = (j + n - 1) % n; int jb = (j + 1) % n; y[i][j] = median(x[i][j], x[ia][j], x[ib][j], x[i][ja], x[i][jb]); } } }
28
![Page 29: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/29.jpg)
OpenMP: variables
private or shared?
29
![Page 30: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/30.jpg)
// shared variable int sum_shared = 0; #pragma omp parallel { // private variables (one for each thread) int sum_local = 0; #pragma omp for nowait for (int i = 0; i < 10; ++i) { sum_local += i; } #pragma omp critical { sum_shared += sum_local; } } print(sum_shared);
30
![Page 31: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/31.jpg)
Two kinds of variables• Shared variables
• shared among all threads • be very careful with data races!
• Private variables • each thread has its own variable • safe and easy
31
![Page 32: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/32.jpg)
// OK: for (int i = 0; i < n; ++i) { float tmp = x[i]; y[i] = tmp * tmp; }
// OK: float tmp; for (int i = 0; i < n; ++i) { tmp = x[i]; y[i] = tmp * tmp; }
32
![Page 33: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/33.jpg)
// OK: #pragma omp parallel for for (int i = 0; i < n; ++i) { float tmp = x[i]; y[i] = tmp * tmp; }
// Bad (data race): float tmp; #pragma omp parallel for for (int i = 0; i < n; ++i) { tmp = x[i]; y[i] = tmp * tmp; }
33
![Page 34: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/34.jpg)
// OK (just unnecessarily complicated): #pragma omp parallel { float tmp; #pragma omp for for (int i = 0; i < n; ++i) { tmp = x[i]; y[i] = tmp * tmp; } }
34
![Page 35: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/35.jpg)
Two kinds of variables• Shared variables and private variables
• If necessary, you can customise this: • #pragma omp parallel private(x) • #pragma omp parallel shared(x) • #pragma omp parallel firstprivate(x)
• Seldom needed, defaults usually fine35
![Page 36: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/36.jpg)
Best practices• Use subroutines!
• much easier to avoid accidentswith shared variables this way
• Keep the function with #pragmasas short as possible
• just e.g. call another function in a for loop
36
![Page 37: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/37.jpg)
OpenMP:synchronisation
critical sections and atomics
37
![Page 38: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/38.jpg)
// Good, no critical section needed: #pragma omp parallel for for (int i = 0; i < 10000000; ++i) { ++v[i]; }
// Bad, very slow: #pragma omp parallel for for (int i = 0; i < 10000000; ++i) { #pragma omp critical { ++v[i]; } }
38
![Page 39: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/39.jpg)
// Good, no critical section needed: #pragma omp parallel for for (int i = 0; i < 10000000; ++i) { ++v[i]; }
// Bad, very slow: #pragma omp parallel for for (int i = 0; i < 10000000; ++i) { #pragma omp critical { ++v[i]; } }
4 ms
40 000 ms39
![Page 40: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/40.jpg)
// Bad — no data race but undefined output: int a = 0; #pragma omp parallel { int b; #pragma omp critical { b = a; } ++b; #pragma omp critical { a = b; } }
40
![Page 41: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/41.jpg)
// OK: int a = 0; #pragma omp parallel { int b; #pragma omp critical { b = a; ++b; a = b; } }
41
![Page 42: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/42.jpg)
// Bad: same output file #pragma omp parallel for for (int i = 0; i < 10; ++i) { int v = calculate(i); std::cout << v << std::endl; }
// OK (but no guarantees on the order of lines) #pragma omp parallel for for (int i = 0; i < 10; ++i) { int v = calculate(i); #pragma omp critical { std::cout << v << std::endl; } }
42
![Page 43: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/43.jpg)
Namingcritical sections• You can give names to critical sections:
• #pragma omp critical (myname)
• Different threads can enter simultaneouslycritical sections with different names
• No name = the same name
43
![Page 44: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/44.jpg)
#pragma omp parallel for for (int i = 0; i < 10; ++i) { b(); #pragma omp critical (xxx) { c(); } #pragma omp critical (yyy) { d(); } }
b()
b()
b()
d()
c()
c()
c()
d()
d()
44
![Page 45: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/45.jpg)
#pragma omp parallel for for (int i = 0; i < 10; ++i) { int v = calculate(i); #pragma omp critical (result) { result += v; } #pragma omp critical (output) { std::cout << v << std::endl; } }
45
![Page 46: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/46.jpg)
Atomic operation• Like a tiny critical section
• Very restricted:just for e.g. updating a single variable
• Much more efficient
46
![Page 47: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/47.jpg)
for (int i = 0; i < n; ++i) { int l = v[i] % m; ++p[l]; }
#pragma omp parallel for for (int i = 0; i < n; ++i) { int l = v[i] % m; #pragma omp atomic ++p[l]; }
#pragma omp parallel for for (int i = 0; i < n; ++i) { int l = v[i] % m; #pragma omp critical { ++p[l]; } }
200 ms
70 ms
40 000 ms
47
![Page 48: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/48.jpg)
OpenMP: scheduling
#pragma omp for
48
![Page 49: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/49.jpg)
a(); #pragma omp parallel for for (int i = 0; i < 16; ++i) { c(i); } d();
a()
c(0)
c(4)
c(8)
c(12)
d()
c(1) c(2) c(3)
c(5) c(6) c(7)
c(9) c(10)
c(13) c(14)
c(11)
c(15)
49
![Page 50: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/50.jpg)
// Good memory locality: // each thread scans a consecutive part of array #pragma omp parallel for for (int i = 0; i < n; ++i) { c(x[i]); }
memory:thread 1 thread 2 thread 3 thread 4
50
![Page 51: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/51.jpg)
a(); #pragma omp parallel for for (int i = 0; i < 16; ++i) { c(i); } d();
c(0)
c(4)
c(8)
c(12)
c(1) c(2) c(3)
c(5) c(6) c(7)
c(9) c(10)
c(13) c(14)
c(11)
c(15)
51
![Page 52: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/52.jpg)
a(); #pragma omp parallel for schedule(static,1) for (int i = 0; i < 16; ++i) { c(i); } d();
c(0) c(4) c(8) c(12)
c(1)
c(2)
c(3)
c(5)
c(6)
c(7)
c(9)
c(10)
c(13)
c(14)
c(11) c(15)
52
![Page 53: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/53.jpg)
a(); #pragma omp parallel for dynamic for (int i = 0; i < 16; ++i) { c(i); } d();
c(0) c(4) c(8)
c(12)c(1)
c(2)
c(3)
c(5)
c(6)
c(7)
c(9)
c(10) c(13)
c(14)c(11)
c(15)
53
![Page 54: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/54.jpg)
OpenMP scheduling• Performance, n = 100 000 000:
• sequential: 50 ms • parallel: 50 ms • schedule(static,1): 200 ms • schedule(dynamic): 4000 ms
for (int i = 0; i < n; ++i) ++v[i];54
![Page 55: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/55.jpg)
OpenMP scheduling• Performance, n = 100 000 000:
• sequential: 800 ms • parallel: 300 ms • schedule(static,1): 300 ms • schedule(dynamic): 4000 ms
for (int i = 0; i < n; ++i) v[i] = sqrt(i);55
![Page 56: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/56.jpg)
OpenMP: reductions
… just a convenient shorthand
56
![Page 57: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/57.jpg)
int g = 0; #pragma omp parallel { int l = 0; #pragma omp for for (int i = 0; i < n; ++i) { l += v[i]; } #pragma omp atomic g += l; }
int g = 0; #pragma omp parallel for reduction(+:g) for (int i = 0; i < n; ++i) { g += v[i]; }
≈
57
![Page 58: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/58.jpg)
OpenMP:speedups in practice
58
measure!
![Page 59: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/59.jpg)
images/second
threads
MF2
2000 × 2000 pixels 21 × 21 window
linear
MF1
59
![Page 60: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/60.jpg)
images/second
threads
4 cores
Hyper- Threading
60
![Page 61: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/61.jpg)
images/second
threads
MF3
2000 × 2000 pixels 21 × 21 window
linear
61
![Page 62: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/62.jpg)
images/second
threads
MF3
4000 × 4000 pixels 201 × 201 window
linear
2 × 12 cores
62
![Page 63: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/63.jpg)
Hyper-threading• Maari-A:
• 4 physical cores • each core can run 2 hardware threads
• OpenMP defaults: • use 4 × 2 = 8 threads
63
![Page 64: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/64.jpg)
Withouthyper-threading• Each core has 1 thread (1 instruction stream)
• CPU looks at the instruction stream(up to certain distance) and executesthe next possible instruction
• must be independent! • must have execution units available!
64
![Page 65: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/65.jpg)
With hyper-threading• Each core has 2 threads (2 instruction streams)
• CPU looks at both instruction streams
• Possibly more opportunities for findinginstructions that can be executed now
65
![Page 66: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/66.jpg)
Example:multiply and add• Maari-A computers have “Ivy Bridge” CPUs
• They have independent parallel units for: • floating point multiplication (vectorised) • floating point addition (vectorised)
• Throughput: one “+” and one “×” per cycle
66
![Page 67: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/67.jpg)
Example:multiply and add• Code: ++++++… (all independent)
• 1 instruction / cycle
• Hyper-threading does not help • one thread keeps “+” unit busy • “×” unit has nothing to do
67
![Page 68: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/68.jpg)
Example:multiply and add• Code: ××××××… (all independent)
• 1 instruction / cycle
• Hyper-threading does not help • one thread keeps “×” unit busy • “+” unit has nothing to do
68
![Page 69: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/69.jpg)
Example:multiply and add• Code: +×+×+×+×+×+×… (all independent)
• 2 instructions / cycle
• Hyper-threading does not help • one thread enough to keep both units busy
69
![Page 70: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/70.jpg)
Example:multiply and add• Code: ++++++…××××××… (all independent)
• 1 instructions / cycle • CPU does not see far enough • first “+” unit busy, “×” unit idle • then “+” unit idle, “×” unit busy
70
![Page 71: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/71.jpg)
Example:multiply and add• Code:
• thread 1: ++++++… (all independent) • thread 2: ××××××… (all independent)
• 1 instruction / cycle without hyper-threading
• 2 instructions / cycle with hyper-threading
71
![Page 72: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/72.jpg)
Example:multiply and add• If everything is already perfectly interleaved
in your code, hyper-threading does not help • getting no speedups may be a good sign!
• May make it easier to program • perfect instruction-level parallelism not
necessary for maximum performance
72
![Page 73: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/73.jpg)
Example:multiply and add• Maybe a good idea to not rely too much
on hyper-threading? • typically only more expensive CPUs
support hyper-threading • almost the same performance with
cheaper CPUs and careful implementation?
73
![Page 74: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/74.jpg)
OpenMP: summary
74
![Page 75: Parallel processing with OpenMPusers.ics.aalto.fi/suomela/ppc-2016/ppc-lectures-2.pdfMulti-core computers • Modern CPUs have multiple cores (Maari-A: 4) • Each core has its own](https://reader034.vdocument.in/reader034/viewer/2022042316/5f048a9e7e708231d40e7b80/html5/thumbnails/75.jpg)
OpenMP: summary• Splits work across multiple threads,
benefits from multiple CPU cores
• Each thread still needs to worry about e.g.: • instruction-level parallelism • vectorisation • getting data from memory to CPU…
75