speed up your openmp code without doing much · 2020. 8. 16. · 30 “speed up your openmp code...
TRANSCRIPT
![Page 1: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/1.jpg)
“Speed Up Your OpenMP Code Without Doing Much”
Ruud van der PasDistinguished Engineer
Speed Up Your OpenMP CodeWithout Doing Much
Oracle Linux and Virtualization EngineeringOracle, Santa Clara, CA, USA
![Page 2: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/2.jpg)
“Speed Up Your OpenMP Code Without Doing Much”2
Agenda
• Low Hanging Fruit
• A Case Study
• Motivation
![Page 3: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/3.jpg)
“Speed Up Your OpenMP Code Without Doing Much”3
Motivation
![Page 4: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/4.jpg)
“Speed Up Your OpenMP Code Without Doing Much”
OpenMP and Performance
And your code will scale
If you do things in the right way
Easy =/= Stupid
You can get good performance with OpenMP
4
![Page 5: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/5.jpg)
“Speed Up Your OpenMP Code Without Doing Much”
Ease of Use ?The ease of use of OpenMP is a mixed blessing
(but I still prefer it over the alternative)
5
Ideas are easy and quick to implementBut some constructs are more expensive than others
Often, a small change can have a big impact
In this talk we show some of this low hanging fruit
![Page 6: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/6.jpg)
“Speed Up Your OpenMP Code Without Doing Much”6
Low Hanging Fruit
![Page 7: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/7.jpg)
“Speed Up Your OpenMP Code Without Doing Much”7
About Single Thread Performance and Scalability
Why? If your code performs badly on 1 core, what do you think will happen on 10 cores, 20 cores, … ?
You have to pay attention to single thread performance
Scalability can mask poor performance!(a slow code tends to scale better …)
![Page 8: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/8.jpg)
“Speed Up Your OpenMP Code Without Doing Much”
The Basics For All Users
Never tune your code without using a profiler
Do not share data unless you have to(in other words, use private data as much as you can)
Do not parallelize what does not matter
8
One “parallel for” is fine. More is EVIL.
Think BIG(maximize the size of the parallel regions)
![Page 9: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/9.jpg)
“Speed Up Your OpenMP Code Without Doing Much”9
The Wrong and Right Way Of Doing Things#pragma omp parallel for { <code block 1> }
#pragma omp parallel for { <code block n> }
Parallel region overhead repeated “n” times No potential for the “nowait” clause
#pragma omp parallel{ #pragma omp for { <code block 1> }
#pragma omp for nowait { <code block n> }} // End of parallel region
Parallel region overhead only once Potential for the “nowait” clause
![Page 10: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/10.jpg)
“Speed Up Your OpenMP Code Without Doing Much”10
More Basics
The same is true for locks and critical regions(use atomic constructs where possible)
Every barrier matters(and please use them carefully)
EVERYTHING Matters(minor overheads get out of hand eventually)
![Page 11: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/11.jpg)
“Speed Up Your OpenMP Code Without Doing Much”11
Another Example#pragma omp single{ <some code>} // End of single region
#pragma omp barrier
<more code>
#pragma omp single{ <some code>} // End of single region
#pragma omp barrier
<more code>
The second barrier is redundant because the single construct has an implied barrier already
(this second barrier will still take time though)
![Page 12: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/12.jpg)
“Speed Up Your OpenMP Code Without Doing Much”12
One More Examplea[npoint] = value;#pragma omp parallel …{
} // End of parallel region
#pragma omp for for (int64_t k=0; k<npoint; k++) a[k] = -1;#pragma omp for for (int64_t k=npoint+1; k<n; k++) a[k] = -1;
<more code>
![Page 13: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/13.jpg)
“Speed Up Your OpenMP Code Without Doing Much”13
What Is Wrong With This?a[npoint] = value;#pragma omp parallel …{
} // End of parallel region
#pragma omp for for (int64_t k=0; k<npoint; k++) a[k] = -1;#pragma omp for for (int64_t k=npoint+1; k<n; k++) a[k] = -1;<more code>
✓ There are 2 barriers
✓ Two times serial and parallel overhead
✓ Performance benefit depends on the value of “npoint” and “n”
![Page 14: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/14.jpg)
“Speed Up Your OpenMP Code Without Doing Much”14
The Sequence Of Operationsnpoint
value
npoint+1 … n-1
-1 …. -1 -1
-1 -1 …. -1 -1 -1
0 … npoint-1
barrier
barrier
![Page 15: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/15.jpg)
“Speed Up Your OpenMP Code Without Doing Much”15
The Actual Operation Performed
-1 -1 …. -1 -1 -1 value -1 … -1 -1
0 … npoint-1 npoint+1 … n-1npoint
![Page 16: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/16.jpg)
“Speed Up Your OpenMP Code Without Doing Much”16
The Modified Code#pragma omp parallel …{
} // End of parallel region
#pragma omp for for (int64_t k=0; k<n; k++) a[k] = -1;
#pragma omp single nowait {a[npoint] = value;}
<more code>
✓ Only one barrier
✓ One time serial and parallel overhead
✓ Performance benefit depends on the value of “n” only
![Page 17: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/17.jpg)
“Speed Up Your OpenMP Code Without Doing Much”17
A Case Study
![Page 18: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/18.jpg)
“Speed Up Your OpenMP Code Without Doing Much”18
The ApplicationThe OpenMP reference version of the Graph 500 benchmark
Structure of the code:
For the benchmark score, only the search time matters
• Construct an undirected graph of the specified size• Randomly select a key and conduct a BFS search• Verify the result is a tree
Repeat <n> times
![Page 19: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/19.jpg)
“Speed Up Your OpenMP Code Without Doing Much”19
Testing CircumstancesThe code uses a parameter SCALE to set the size of the graph
The value used for SCALE is 24 (~9 GB of RAM used)
All experiments were conducted in the Oracle Cloud (“OCI”)
Used a VM instance with 1 Skylake socket (8 cores, 16 threads)
Operating System: Oracle Linux; compiler: gcc 8.3.1
The Oracle Performance Analyzer was used to make the profiles
![Page 20: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/20.jpg)
“Speed Up Your OpenMP Code Without Doing Much”20
The Dynamic Behaviour
Graph Generation Search and verification for 64 keys
Search Verification
Note: The Oracle Performance Analyzer was used to generate this time line
![Page 21: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/21.jpg)
“Speed Up Your OpenMP Code Without Doing Much”21
The Performance For SCALE = 24 (9 GB)
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
0
50
100
150
200
250
300
350
1 2 4 6 8 10 12 14 16
Para
llel S
peed
Up
Sear
ch T
ime
(sec
onds
)
Number of OpenMP Threads
Search Time and Parallel Speed Up ✓ Search time reduces as threads are added
✓ Benefit from all 16 (hyper) threads
✓ But the 3.2x parallel speed up is disappointing
![Page 22: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/22.jpg)
“Speed Up Your OpenMP Code Without Doing Much”22
Action Plan
Found several opportunities to improve the OpenMP part
These are actually shown earlier in this talk
Although simple changes, the improvement is substantial:
Used a profiling tool* to identify the time consuming parts
*) Use your preferred tool; we used the Oracle Performance Analyzer
![Page 23: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/23.jpg)
“Speed Up Your OpenMP Code Without Doing Much”23
Performance Of The Original and Modified Code
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
0
50
100
150
200
250
300
350
1 2 4 6 8 10 12 14 16
Para
llel S
peed
Up
Sear
ch T
ime
(sec
onds
)
Number of OpenMP Threads
Search Time and Parallel Speed UpOriginal Modified Original Modified
✓ A noticeable reduction in the search time at 4 threads and beyond
✓ The parallel speed up increases to 6.5x
✓ The search time is reduced by 2x2x
![Page 24: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/24.jpg)
“Speed Up Your OpenMP Code Without Doing Much”24
Are We Done Yet?The 2x reduction in the search time is encouraging
The efforts to achieve this have been limited
The question is whether there is more to be gained
Let’s look at the dynamic behaviour of the threads:
![Page 25: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/25.jpg)
“Speed Up Your OpenMP Code Without Doing Much”25
A Comparison Between 1 And 2 Threads
Note: The Oracle Performance Analyzer was used to generate this time line
Faster
SearchVerification
Faster
Both phases benefit from using 2 threads
![Page 26: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/26.jpg)
“Speed Up Your OpenMP Code Without Doing Much”26
Zoom In On The Second Thread
Note: The Oracle Performance Analyzer was used to generate this time line
SearchVerification
master thread is fine
second thread has gaps
![Page 27: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/27.jpg)
“Speed Up Your OpenMP Code Without Doing Much”27
How About 4 Threads?
Adding threads makes things worse
Note: The Oracle Performance Analyzer was used to generate this time line
![Page 28: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/28.jpg)
“Speed Up Your OpenMP Code Without Doing Much”
The Problem#pragma omp forfor (int64_t k = k1; k < oldk2; ++k) { <code> for (int64_t vo = XOFF(v); vo < veo; ++vo) { <more code> if (bfs_tree[j] == -1) { <even more code and irregular > } // End of if } // End of for-loop on vo} // End of parallel for-loop on k
Regular, structured parallel loop
Irregular loop
Unbalanced flow
Load Imbalance!
![Page 29: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/29.jpg)
“Speed Up Your OpenMP Code Without Doing Much”29
The Solution#pragma omp for schedule(runtime)for (int64_t k = k1; k < oldk2; ++k) {\ <code> for (int64_t vo = XOFF(v); vo < veo; ++vo) { <more code> if (bfs_tree[j] == -1) { <even more code and irregular > } // End of if } // End of for-loop on vo} // End of parallel for-loop on k
$ export OMP_SCHEDULE=“dynamic,25"
![Page 30: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/30.jpg)
“Speed Up Your OpenMP Code Without Doing Much”30
The Performance With Dynamic Scheduling Added
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
0
50
100
150
200
250
300
350
1 2 4 6 8 10 12 14 16
Para
llel S
peed
Up
Sear
ch T
ime
(sec
onds
)
Number of OpenMP Threads
Search Time and Parallel Speed UpOriginal Modified dynamic Original Modified dynamic Linear
✓ A 1% slow down on a single thread
✓ Near linear scaling for up to 8 threads
✓ The parallel speed up increased to 9.5x
✓ The search time is reduced by 3x
3x
![Page 31: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/31.jpg)
“Speed Up Your OpenMP Code Without Doing Much”31
The Improvement
Original static scheduling
Dynamic scheduling
Note: The Oracle Performance Analyzer was used to generate this time line
![Page 32: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/32.jpg)
“Speed Up Your OpenMP Code Without Doing Much”32
Conclusions
With some simple improvements in the use of OpenMP, a 2x reduction in the search time is realized
Compared to the original code, adding dynamic scheduling reduces the search time by 3x
A good profiling tool is indispensable when tuning applications
![Page 33: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00](https://reader035.vdocument.in/reader035/viewer/2022071102/5fdb9571c6bac43e3612538f/html5/thumbnails/33.jpg)
“Speed Up Your OpenMP Code Without Doing Much”
Ruud van der PasDistinguished Engineer
Thank You And … Stay Tuned!
Oracle Linux and Virtualization EngineeringOracle, Santa Clara, CA, USA