1 scaling applications to massively parallel machines using projections performance analysis tool...
TRANSCRIPT
![Page 1: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/1.jpg)
1
Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool
Presented by Chee Wai Lee
Authors: L. V. Kale, Gengbin Zheng,Chee Wai Lee, Sameer Kumar
![Page 2: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/2.jpg)
2
Motivation Performance optimization is increasingly
challenging– Modern applications are complex and dynamic– Some may involve small amount of
computation per step– Performance issues and obstacles change:
Need very good Performance Analysis tools– Feedback at the level of applications– Analysis capabilities– Scalable views– Automatic instrumentation
![Page 3: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/3.jpg)
3
Projections
Outline:– Projections: trace generation– Projections: views– Case Study: NAMD, Molecular Dynamics
program that won a Gordon Bell award at SC’02 by scaling MD for biomolecules to 3,000 procs
– Case Study: CPAIMD, a Car-parrinello ab initio MD application.
– Performance Analysis on next generation supercomputers: Challenges.
![Page 4: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/4.jpg)
4
Trace Generation Automatic instrumentation by runtime system Detailed
– In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.
Summary– reduces the size of output files and memory overhead.– It produces (in the default mode) a few lines of output data per
processor.– This data is recorded in bins corresponding to intervals of size
1ms by default. Flexible
– APIs and runtime options for instrumenting user events and data generation control.
![Page 5: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/5.jpg)
5
![Page 6: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/6.jpg)
6
Post mortem analysis: views Utilization Graph
– As a function of time interval or processor– Shows processor utilization– As well as: time spent on specific parallel methods
Timeline: – upshot-like, but more details– Pop-up views of method execution, message arrows, user-
level events Profile: stacked graphs:
– For a given period, breakdown of the time on each processor
• Includes idle time, and message-sending, receiving times
![Page 7: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/7.jpg)
7
![Page 8: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/8.jpg)
8
Projections Views: continued Overview
– Like a timeline, but includes all processors, and all time!
– Each pixel (x,y) represents utilization of processor y at time x
Histogram of method execution times– How many method-execution instances had a time of 0-
1 ms? 1-2 ms? .. Performance counters
– Associated with each entry method– Usual counters, interface to PAPI
![Page 9: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/9.jpg)
9
Projections and Performance Analysis
Identify performance bottlenecks. Verification of performance.
![Page 10: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/10.jpg)
10
Case Studies: Outline
We illustrate the use of Projections– Through case studies of NAMD & CPAIMD.– Illustrate the use of different visualization
options.– Show performance debugging methodology.
![Page 11: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/11.jpg)
11
NAMD: A Production MD program NAMD
• Fully featured programNIH-funded development• Distributed free of charge (~5000 downloads so far) Binaries and source code• Installed at NSF centersUser training and support• Large published simulations (e.g., aquaporin simulation featured in SC’02 keynote)
Collaboration with K. Schulten, R. Skeel, and co-workers
![Page 12: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/12.jpg)
12
Molecular Dynamics in NAMD Collection of [charged] atoms, with bonds
– Newtonian mechanics– Thousands of atoms (10,000 - 500,000)
At each time-step– Calculate forces on each atom
• Bonds:• Non-bonded: electrostatic and van der Waal’s
– Short-distance: every timestep– Long-distance: using PME (3D FFT)– Multiple Time Stepping : PME every 4 timesteps
– Calculate velocities and advance positions Challenge: femtosecond time-step, millions needed!
![Page 13: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/13.jpg)
13
700 VPs
192 + 144 VPs
30,000 VPs
NAMD Parallelization using Charm++ with PME
These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system
![Page 14: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/14.jpg)
14
Grainsize Issues
A variant of Amdahl’s law, for objects:– The fastest time can be no shorter than the time
for the biggest single object!– Lesson from previous efforts
Splitting computation objects:– 30,000 nonbonded compute objects– Instead of approx 10,000
![Page 15: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/15.jpg)
15
Mode: 700 us
Distribution of execution times of
non-bonded force computation objects (over 24 steps)
![Page 16: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/16.jpg)
16
Effect of Multicast Optimization on Integration Overhead
By eliminating overhead of message copying and allocation.
Message Packing Overhead and Multicast
![Page 17: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/17.jpg)
17
Processor Utilization against Time on 128 and 1024 processors
On 128 processor, a single load balancing step suffices, but
On 1024 processors, we need a “refinement” step.
Load Balancing
Aggressive Load Balancing
Refinement Load
Balancing
![Page 18: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/18.jpg)
18
Load Balancing Steps
Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing : object migration
Refinement Load Balancing
![Page 19: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/19.jpg)
19
Processor Utilization across processors after (a) greedy load balancing and (b) refining.Note that the underloaded processors are left underloaded (as they don’t impact performance); refinement deals only with the overloaded ones
Some overloaded processors
![Page 20: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/20.jpg)
20
Benefits of Avoiding Barrier Problem with barriers:
– Not the direct cost of the operation itself as much
– But it prevents the program from adjusting to small variations
• E.g. K phases, separated by barriers (or scalar reductions)
• Load is effectively balanced. But– In each phase, there may be slight non-determistic load
imbalance– Let Li,j be the load on I’th processor in j’th phase
∑=
k
jjii L
1, }{max }{max
1,∑
=
k
jjii LWith barrier: Without:
![Page 21: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/21.jpg)
21
100 milliseconds
![Page 22: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/22.jpg)
22
Handling Stretches Challenge
– NAMD still did not scale well to 3000 procs with 4 procs per node
– due to stretches : inexplicable increase in compute time or communication gaps at random (but few) points
– Stretches caused by: Operating system, file system and resource management daemons interfering with the job
– Badly configured network API • Messages waiting for the rendezvous of the previous message
to be acknowledged, leading to stretches in the ISends
Managing stretches– Use blocking receives– Giving OS time when the job process is idle, to run
daemons– Fine tuning the network layer
![Page 23: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/23.jpg)
23
Stretched Computations Jitter in computes up to 80ms
– On 1000+ processors using 4 processors per node
– NAMD ATPase 3000 processors time steps of 12 ms
– Within that time: each processor sends and receives :• Approximately 60-70 messages of 4-6 KB each
– OS Context switch time is 10 ms
– OS and Communication layer can have “hiccups”• “Hiccups” termed as stretches
– Stretches can be a large performance impediment
![Page 24: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/24.jpg)
24
Stretch RemovalHistogram Views
Number of function executions vs. their granularityNote: log scale on Y-axis
Before Optimizations
Over 16 large stretched calls
After Optimizations
About 5 large stretched calls, largest of them much smaller, and
almost all calls take less than 3.2 ms
![Page 25: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/25.jpg)
25
Activity Priorities
Identified a portion of CPAIMD that ran too early via the Time Profile tool.
![Page 26: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/26.jpg)
26
Serial Performance
The use of performance counters helped identify serial performance issues like cache performance.
Projections makes use of PAPI to measure performance counters.
![Page 27: 1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin](https://reader038.vdocument.in/reader038/viewer/2022110321/56649f425503460f94c62370/html5/thumbnails/27.jpg)
27
Challenges Ahead
Scalable Performance Data generation– Meaningful restrictions on Trace data
generation.– Data compression.– Online analysis.
Scalable Performance Visualization– Automatic identification of performance
problems.