a new approach for performance analysis of …• a hybrid performance tool for openmp programs –...
TRANSCRIPT
![Page 1: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/1.jpg)
1
A New Approach for Performance Analysis of OpenMP Programs
Xu Liu, John Mellor-Crummey, Mike FaganRice University
http://hpctoolkit.org
![Page 2: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/2.jpg)
Motivation
• Multicore is used everywhere– cell phones– laptops– supercomputers
• Threads per node grow rapidly– IBM Blue Gene/Q: 64 threads– IBM Power 7: 128 threads– Intel MIC: ~200 threads
• Taking advantage of massive threads is important
2
![Page 3: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/3.jpg)
Programming model for multi-threading
• Low level– pthread
• High level– language based: Cilk– library based: Intel thread building block (TBB)– compiler based: OpenMP
3
![Page 4: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/4.jpg)
Programming model for multi-threading
• Low level– pthread
• High level– language based: Cilk– library based: Intel thread building block (TBB)– compiler based: OpenMP
3
• standardized• portable
![Page 5: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/5.jpg)
Programming model for multi-threading
• Low level– pthread
• High level– language based: Cilk– library based: Intel thread building block (TBB)– compiler based: OpenMP
3
• standardized• portable
OpenMP is the most popular programming model for multi-threading
![Page 6: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/6.jpg)
OpenMP’s fork-join parallelism
4
Fork
Join
Fork
Join
Fork
Join
parallel region
parallel region
parallel region
![Page 7: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/7.jpg)
Challenges in programming with OpenMP
• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance
5
Fork
Join
Fork
Join
Fork
Join
sync
![Page 8: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/8.jpg)
Challenges in programming with OpenMP
• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance
5
Fork
Join
Fork
Join
Fork
Join
sync
![Page 9: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/9.jpg)
Challenges in programming with OpenMP
• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance
5
Fork
Join
Fork
Join
Fork
Join
sync
![Page 10: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/10.jpg)
Challenges in programming with OpenMP
• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance
5
Fork
Join
Fork
Join
Fork
Join
sync
![Page 11: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/11.jpg)
Challenges in programming with OpenMP
• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance
5
Performance tool is needed
Fork
Join
Fork
Join
Fork
Join
sync
![Page 12: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/12.jpg)
Previous work
• Instrumentation based tools: TAU, Scalasca– high measurement overhead
• Sampling based tools: Intel Vtune– no specific OpenMP support
• Hybrid tools: Sun Studio/Oracle Analyzer API for OpenMP– support low-overhead, sampling-based measurement– use large disk space for measurement data– insufficient support for statically-linked applications– insufficient mechanisms to blame root causes of performance losses
• attributes waiting to “symptoms” rather than “causes”
6
![Page 13: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/13.jpg)
Previous work
• Instrumentation based tools: TAU, Scalasca– high measurement overhead
• Sampling based tools: Intel Vtune– no specific OpenMP support
• Hybrid tools: Sun Studio/Oracle Analyzer API for OpenMP– support low-overhead, sampling-based measurement– use large disk space for measurement data– insufficient support for statically-linked applications– insufficient mechanisms to blame root causes of performance losses
• attributes waiting to “symptoms” rather than “causes”
6
Goal: build a tool to overcome all the shortcomings in existing tools
![Page 14: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/14.jpg)
Our approach
• A hybrid performance tool for OpenMP programs– quantify performance losses– pinpoint problematic code for optimization– low measurement overhead– support both dynamically and statically linked binaries
• With the feedback from our tool, we were able to get 1.1x-5.7x speedup for four well-known benchmarks
7
![Page 15: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/15.jpg)
Overview of our tool
• OpenMP tools API– minimum set of support to tools– low overhead
• HPCToolkit– a leading sampling-based performance tool for parallel programs– call path profiler + static binary analyzer + GUI– on top of OpenMP tools API
8
![Page 16: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/16.jpg)
OpenMP tools API• Maintain states• Provide callbacks
9
Fork
Join
Fork
Join
Fork
Join
![Page 17: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/17.jpg)
OpenMP tools API• Maintain states• Provide callbacks
9
working
Fork
Join
Fork
Join
Fork
Join
![Page 18: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/18.jpg)
OpenMP tools API• Maintain states• Provide callbacks
9
working
Fork
Join
Fork
Join
Fork
Join
idle
![Page 19: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/19.jpg)
OpenMP tools API• Maintain states• Provide callbacks
9
working
Fork
Join
Fork
Join
Fork
Join
idle overhead
![Page 20: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/20.jpg)
OpenMP tools API• Maintain states• Provide callbacks
9
working
Fork
Join
Fork
Join
Fork
Join
idle overhead
entrance callback
![Page 21: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/21.jpg)
OpenMP tools API• Maintain states• Provide callbacks
9
working
Fork
Join
Fork
Join
Fork
Join
idle overhead
entrance callback
exit callback
![Page 22: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/22.jpg)
OpenMP tools API• Maintain states• Provide callbacks
9
working
Fork
Join
Fork
Join
Fork
Join
idle overhead
entrance callback
exit callback
state callback
![Page 23: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/23.jpg)
OpenMP tools API• Maintain states• Provide callbacks
9
working
Fork
Join
Fork
Join
Fork
Join
idle overhead
entrance callback
exit callback
state callback
• Low overhead• Try to make it into
OpenMP standard
![Page 24: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/24.jpg)
Enhancement of HPCToolkit
• Measure and attribute costs to full user-program calling contexts– unified view of calling contexts across all threads
• Shift blame for idleness from symptoms to causes• Support both OpenMP profiling and tracing
10
![Page 25: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/25.jpg)
Problem: separate views for different threads Worker threads don’t know the full user-level context for work
11
![Page 26: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/26.jpg)
Problem: separate views for different threads Worker threads don’t know the full user-level context for work
11
main→fn.0→fn.1→fn.2
![Page 27: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/27.jpg)
Call stack snapshot in OpenMP threads
12
regions in gray have
distributed calling contexts
![Page 28: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/28.jpg)
Call stack snapshot in OpenMP threads
12
regions in gray have
distributed calling contexts
![Page 29: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/29.jpg)
Call stack snapshot in OpenMP threads
12
regions in gray have
distributed calling contexts
![Page 30: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/30.jpg)
Call stack snapshot in OpenMP threads
12
regions in gray have
distributed calling contexts
![Page 31: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/31.jpg)
Call stack snapshot in OpenMP threads
12
regions in gray have
distributed calling contexts
![Page 32: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/32.jpg)
Call stack snapshot in OpenMP threads
12
regions in gray have
distributed calling contexts
![Page 33: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/33.jpg)
Call stack snapshot in OpenMP threads
12
regions in gray have
distributed calling contexts
too tiny
![Page 34: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/34.jpg)
Call stack snapshot in OpenMP threads
12
regions in gray have
distributed calling contexts
too tiny
Online deferred context resolution
![Page 35: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/35.jpg)
Results of deferred context construction
13
main._omp_fn.*:outlined functions
correspond to <parallel region>
![Page 36: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/36.jpg)
Enhancement of HPCToolkit
• Measure and attribute costs to full user-program calling contexts• Shift blame for idleness from symptoms to causes
– undirected blame: load imbalance, serialization– directed blame: lock and critical section contention
• Support both OpenMP profiling and tracing
14
![Page 37: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/37.jpg)
Problem: meaningless hotspots
15
hotspot is do_wait,but don’t know why
![Page 38: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/38.jpg)
Undirected blame
• do_wait is the symptom• Causes are working threads
– e.g. last thread has more work than other threads• Blaming do_wait to working threads
16
Join
Fork
do_wait
do_wait
do_wait
do_wait
![Page 39: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/39.jpg)
Undirected blame
• do_wait is the symptom• Causes are working threads
– e.g. last thread has more work than other threads• Blaming do_wait to working threads
16
Join
Fork
do_wait
do_wait
do_wait
do_wait
![Page 40: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/40.jpg)
Undirected blame
• do_wait is the symptom• Causes are working threads
– e.g. last thread has more work than other threads• Blaming do_wait to working threads
16
Join
Fork
do_wait
do_wait
do_wait
do_wait
2 idle threads
3 working threads
![Page 41: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/41.jpg)
Undirected blame
• do_wait is the symptom• Causes are working threads
– e.g. last thread has more work than other threads• Blaming do_wait to working threads
16
Join
Fork
do_wait
do_wait
do_wait
do_wait
2 idle threads
3 working threads
take 2/3 idleness
![Page 42: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/42.jpg)
Code-centric view: hypre_BoomerAMGRelax
17
20% idleness and 80% work in this parallel
region
![Page 43: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/43.jpg)
Serial Code in AMG2006 8 PE, 8 Threads
18
7/8 threads are idle: sequential
code
![Page 44: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/44.jpg)
LULESH running on 48 cores
19
free in both LULESH benchmark and libgomp
causes high idleness
4x speedup after changing to
Google’s tcmalloc
![Page 45: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/45.jpg)
Directed blame
• Thread waiting at lock is the symptom• Cause is the lock holder• Blame lock waiting to lock holder
20
Join
Fork
lockwait
acquire lock release lock
![Page 46: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/46.jpg)
Directed blame
• Thread waiting at lock is the symptom• Cause is the lock holder• Blame lock waiting to lock holder
20
Join
Fork
lockwait
acquire lock release lock
accumulate samples indexed
by the lock address
![Page 47: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/47.jpg)
Directed blame
• Thread waiting at lock is the symptom• Cause is the lock holder• Blame lock waiting to lock holder
20
Join
Fork
lockwait
acquire lock release lock
accumulate samples indexed
by the lock address
lock holder takes these samples at
lock release
![Page 48: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/48.jpg)
Directed blame
• Thread waiting at lock is the symptom• Cause is the lock holder• Blame lock waiting to lock holder
20
Join
Fork
lockwait
acquire lock release lock
accumulate samples indexed
by the lock address
lock holder takes these samples at
lock release
![Page 49: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/49.jpg)
Lock blame shifting for NAS UA
21
• Lots of locks• 8.4% of execution
time waiting for locks • 34% of lock waiting
due to locks acquired at highlighted call site
![Page 50: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/50.jpg)
Reducing lock contention for NAS UA
22
• Use omp_test_lock• Defer the lock
acquisition to the next iteration
• Eliminate most lock contention time
![Page 51: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/51.jpg)
Enhancement of HPCToolkit
• Measure and attribute costs to full user-program calling contexts• Shift blame for idleness from symptoms to causes• Support both OpenMP profiling and tracing
23
![Page 52: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/52.jpg)
Tracing: AMG 2006 (solver phase)
24
time line
proc
ess/
thre
ad ra
nk
![Page 53: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/53.jpg)
Tool performance
• Measurement overhead– OpenMP tools API– HPCToolkit scalability
• Insightful optimization feedbacks
25
![Page 54: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/54.jpg)
Benchmarks
• Four case studies– AMG2006
• one of Sequoia benchmarks from LLNL• running on 48 threads
– LULESH• an application benchmark from LLNL• uses work-sharing parallel regions without nesting and tasking• 48 threads
– BT-MZ.B• BT in multi-zone NPB with workload B• uses nested parallel regions without tasking• 32 threads: 4 for outer region and 8 for inner region
– HEALTH• a benchmark in Barcelona tasking benchmarks• uses tasking: more than 17 million tasks• 8 threads, using medium input
26
![Page 55: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/55.jpg)
Profiling overhead
• OpenMP tools API implemented in GNU OpenMP (GOMP)• HPCToolkit takes 200 samples/s/thread
27
Benchmarks GOMP GOMP + tools API
GOMP+tools API + profiling
AMG2006 54.02s +0.15% +5.0%
LULESH 402.34s +0.05% +3.6%
NAS BT-MZ 32.10s +0.16% +6.6%
HEALTH 71.74s +0.64% +3.5%
![Page 56: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/56.jpg)
Performance gains
28
Benchmarks Problem Optimization Improvement
AMG2006 high idleness in parallel regions
use dynamic scheduling for load balance
1.12x
LULESH serialization in free
switch to tcmalloc 3.93x
NAS BT-MZ parallelize insignificant work
eliminate necessary parallelism
1.09x
HEALTHhigh lock
contention for task queue
coarsen task granularity 5.65x
![Page 57: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low](https://reader033.vdocument.in/reader033/viewer/2022060511/5f27f42a46885c1c8064ac59/html5/thumbnails/57.jpg)
Summary
• OpenMP tools API -- we are involved in designing it– incurs very little runtime overhead– supports efficient sampling-based measurement tools– supports root cause analysis of performance bottlenecks– try to make it into OpenMP standard
• HPCToolkit -- we developed it– provides low-overhead profiling and tracing– supports all OpenMP features– delivers insights into performance of OpenMP codes
• Now enhanced HPCToolkit works on IBM Blue Gene/Q
29