speed up your openmp code without doing much · 2020. 8. 16. · 30 “speed up your openmp code...

33
“Speed Up Your OpenMP Code Without Doing Much” Ruud van der Pas Distinguished Engineer Speed Up Your OpenMP Code Without Doing Much Oracle Linux and Virtualization Engineering Oracle, Santa Clara, CA, USA

Upload: others

Post on 26-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”

Ruud van der PasDistinguished Engineer

Speed Up Your OpenMP CodeWithout Doing Much

Oracle Linux and Virtualization EngineeringOracle, Santa Clara, CA, USA

Page 2: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”2

Agenda

• Low Hanging Fruit

• A Case Study

• Motivation

Page 3: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”3

Motivation

Page 4: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”

OpenMP and Performance

And your code will scale

If you do things in the right way

Easy =/= Stupid

You can get good performance with OpenMP

4

Page 5: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”

Ease of Use ?The ease of use of OpenMP is a mixed blessing

(but I still prefer it over the alternative)

5

Ideas are easy and quick to implementBut some constructs are more expensive than others

Often, a small change can have a big impact

In this talk we show some of this low hanging fruit

Page 6: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”6

Low Hanging Fruit

Page 7: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”7

About Single Thread Performance and Scalability

Why? If your code performs badly on 1 core, what do you think will happen on 10 cores, 20 cores, … ?

You have to pay attention to single thread performance

Scalability can mask poor performance!(a slow code tends to scale better …)

Page 8: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”

The Basics For All Users

Never tune your code without using a profiler

Do not share data unless you have to(in other words, use private data as much as you can)

Do not parallelize what does not matter

8

One “parallel for” is fine. More is EVIL.

Think BIG(maximize the size of the parallel regions)

Page 9: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”9

The Wrong and Right Way Of Doing Things#pragma omp parallel for { <code block 1> }

#pragma omp parallel for { <code block n> }

Parallel region overhead repeated “n” times No potential for the “nowait” clause

#pragma omp parallel{ #pragma omp for { <code block 1> }

#pragma omp for nowait { <code block n> }} // End of parallel region

Parallel region overhead only once Potential for the “nowait” clause

Page 10: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”10

More Basics

The same is true for locks and critical regions(use atomic constructs where possible)

Every barrier matters(and please use them carefully)

EVERYTHING Matters(minor overheads get out of hand eventually)

Page 11: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”11

Another Example#pragma omp single{    <some code>} // End of single region

#pragma omp barrier

  <more code>

#pragma omp single{    <some code>} // End of single region

#pragma omp barrier

  <more code>

The second barrier is redundant because the single construct has an implied barrier already

(this second barrier will still take time though)

Page 12: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”12

One More Examplea[npoint] = value;#pragma omp parallel …{

} // End of parallel region

#pragma omp for for (int64_t k=0; k<npoint; k++) a[k] = -1;#pragma omp for for (int64_t k=npoint+1; k<n; k++) a[k] = -1;

<more code>

Page 13: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”13

What Is Wrong With This?a[npoint] = value;#pragma omp parallel …{

} // End of parallel region

#pragma omp for for (int64_t k=0; k<npoint; k++) a[k] = -1;#pragma omp for for (int64_t k=npoint+1; k<n; k++) a[k] = -1;<more code>

✓ There are 2 barriers

✓ Two times serial and parallel overhead

✓ Performance benefit depends on the value of “npoint” and “n”

Page 14: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”14

The Sequence Of Operationsnpoint

value

npoint+1 … n-1

-1 …. -1 -1

-1 -1 …. -1 -1 -1

0 … npoint-1

barrier

barrier

Page 15: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”15

The Actual Operation Performed

-1 -1 …. -1 -1 -1 value -1 … -1 -1

0 … npoint-1 npoint+1 … n-1npoint

Page 16: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”16

The Modified Code#pragma omp parallel …{

} // End of parallel region

#pragma omp for for (int64_t k=0; k<n; k++) a[k] = -1;

#pragma omp single nowait {a[npoint] = value;}

<more code>

✓ Only one barrier

✓ One time serial and parallel overhead

✓ Performance benefit depends on the value of “n” only

Page 17: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”17

A Case Study

Page 18: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”18

The ApplicationThe OpenMP reference version of the Graph 500 benchmark

Structure of the code:

For the benchmark score, only the search time matters

• Construct an undirected graph of the specified size• Randomly select a key and conduct a BFS search• Verify the result is a tree

Repeat <n> times

Page 19: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”19

Testing CircumstancesThe code uses a parameter SCALE to set the size of the graph

The value used for SCALE is 24 (~9 GB of RAM used)

All experiments were conducted in the Oracle Cloud (“OCI”)

Used a VM instance with 1 Skylake socket (8 cores, 16 threads)

Operating System: Oracle Linux; compiler: gcc 8.3.1

The Oracle Performance Analyzer was used to make the profiles

Page 20: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”20

The Dynamic Behaviour

Graph Generation Search and verification for 64 keys

Search Verification

Note: The Oracle Performance Analyzer was used to generate this time line

Page 21: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”21

The Performance For SCALE = 24 (9 GB)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

0

50

100

150

200

250

300

350

1 2 4 6 8 10 12 14 16

Para

llel S

peed

Up

Sear

ch T

ime

(sec

onds

)

Number of OpenMP Threads

Search Time and Parallel Speed Up ✓ Search time reduces as threads are added

✓ Benefit from all 16 (hyper) threads

✓ But the 3.2x parallel speed up is disappointing

Page 22: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”22

Action Plan

Found several opportunities to improve the OpenMP part

These are actually shown earlier in this talk

Although simple changes, the improvement is substantial:

Used a profiling tool* to identify the time consuming parts

*) Use your preferred tool; we used the Oracle Performance Analyzer

Page 23: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”23

Performance Of The Original and Modified Code

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

0

50

100

150

200

250

300

350

1 2 4 6 8 10 12 14 16

Para

llel S

peed

Up

Sear

ch T

ime

(sec

onds

)

Number of OpenMP Threads

Search Time and Parallel Speed UpOriginal Modified Original Modified

✓ A noticeable reduction in the search time at 4 threads and beyond

✓ The parallel speed up increases to 6.5x

✓ The search time is reduced by 2x2x

Page 24: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”24

Are We Done Yet?The 2x reduction in the search time is encouraging

The efforts to achieve this have been limited

The question is whether there is more to be gained

Let’s look at the dynamic behaviour of the threads:

Page 25: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”25

A Comparison Between 1 And 2 Threads

Note: The Oracle Performance Analyzer was used to generate this time line

Faster

SearchVerification

Faster

Both phases benefit from using 2 threads

Page 26: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”26

Zoom In On The Second Thread

Note: The Oracle Performance Analyzer was used to generate this time line

SearchVerification

master thread is fine

second thread has gaps

Page 27: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”27

How About 4 Threads?

Adding threads makes things worse

Note: The Oracle Performance Analyzer was used to generate this time line

Page 28: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”

The Problem#pragma omp forfor (int64_t k = k1; k < oldk2; ++k) { <code> for (int64_t vo = XOFF(v); vo < veo; ++vo) { <more code> if (bfs_tree[j] == -1) { <even more code and irregular > } // End of if } // End of for-loop on vo} // End of parallel for-loop on k

Regular, structured parallel loop

Irregular loop

Unbalanced flow

Load Imbalance!

Page 29: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”29

The Solution#pragma omp for schedule(runtime)for (int64_t k = k1; k < oldk2; ++k) {\ <code> for (int64_t vo = XOFF(v); vo < veo; ++vo) { <more code> if (bfs_tree[j] == -1) { <even more code and irregular > } // End of if } // End of for-loop on vo} // End of parallel for-loop on k

$ export OMP_SCHEDULE=“dynamic,25"

Page 30: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”30

The Performance With Dynamic Scheduling Added

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

0

50

100

150

200

250

300

350

1 2 4 6 8 10 12 14 16

Para

llel S

peed

Up

Sear

ch T

ime

(sec

onds

)

Number of OpenMP Threads

Search Time and Parallel Speed UpOriginal Modified dynamic Original Modified dynamic Linear

✓ A 1% slow down on a single thread

✓ Near linear scaling for up to 8 threads

✓ The parallel speed up increased to 9.5x

✓ The search time is reduced by 3x

3x

Page 31: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”31

The Improvement

Original static scheduling

Dynamic scheduling

Note: The Oracle Performance Analyzer was used to generate this time line

Page 32: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”32

Conclusions

With some simple improvements in the use of OpenMP, a 2x reduction in the search time is realized

Compared to the original code, adding dynamic scheduling reduces the search time by 3x

A good profiling tool is indispensable when tuning applications

Page 33: Speed Up Your OpenMP Code Without Doing Much · 2020. 8. 16. · 30 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added 0.00 2.00 4.00

“Speed Up Your OpenMP Code Without Doing Much”

Ruud van der PasDistinguished Engineer

Thank You And … Stay Tuned!

Oracle Linux and Virtualization EngineeringOracle, Santa Clara, CA, USA