statistical performance analysis for scientific applications presentation at the xsede14 conference...
Post on 27-Dec-2015
216 Views
Preview:
TRANSCRIPT
Statistical Performance Analysis for Scientific Applications
Presentation at the XSEDE14 ConferenceAtlanta, GA
Fei Xing • Haihang You • Charng-Da Lu
July 15, 2014
2
Running Time Analysis
• Causes of slow run on supercomputer
– Improper memory usage
– Poor parallelism
– Too much I/O
– Not optimize the program efficiently
– …
• Examine user’s code: profiling tools
• Profiling = physical exam for applications
– Communication – Fast Profiling library for MPI (FPMPI)
– Processor & memory – Performance Application Programming Interface (PAPI)
– Overall performance & Optimization opportunity – CrayPat
3
Profiling Reports
• Profiling tools produce comprehensive reports covering a wider spectrum of application performance
• Imagine, as a scientist and supercomputer user, you see…
• Question: how to make sense of these information from the report?
– Meaning of the variables
– Indication of the numbers
I/O read timeI/O write time
MPI communication time
MPI synchronization time MPI calls
Level 1 Cache miss
Memory usage
TLB miss L1 Cache access
MPI imbalance
MPI communication imbalance
More are coming!!!
4
Research Framework
• Select an HPC benchmark to create baseline kernels
• Use profiling tools to capture the peak performance
• Apply statistical approach to extract synthetic features that are easy to interpret
• Run real applications, and compare their performance with “role models”
How about…
Courtesy of C.-D. Lu
5
Gears for the Experiment
• Benchmarks – HPC Challenge (HPCC)
– Gauge supercomputers toward peak performance
– 7 representative kernels:• DGEMM, FFT, HPL, Random Access, PTRANS, Latency Bandwidth, Stream• HPL is used in the TOP 500 ranking
– 3 parallelism regimes• Serial / Single Processor• Embarrassingly Parallel• MPI Parallel
• Profiling tools – FPMPI and PAPI
• Testing environment – Kraken (Cray XT5)
7
Training Set Design
• 2,954 observations
– Various kernels, wide range of matrix sizes, different compute nodes
• 11 performance metrics – gathered from FPMPI and PAPI
– MPI communication time, MPI synchronization time, MPI calls, total MPI bytes, memory, FLOPS, total instructions, L2 data cache access, L1 data cache access, synchronization imbalance, communication imbalance
• Data preprocessing
– Convert some metrics to unit-less rates: divide by wall-time
– Normalization
FLOPS Memory … MPI calls
HPL_1000
*_FFT_2000
…
M_RA_300,000
Performance Metrics
Obs.
8
Extract Synthetic Features
• Extract synthetic & accessible Performance Indices (PIs)
• Solution: Variable Clustering + Principal Component Analysis (PCA)
• PCA: decorrelate the data
• Problem of using PCA alone: variables with small loadings may over influence the PC score
• Standardization & modified PCA do not work well
9
Variable Clustering
• Given a partition of X, Pk = (C1, …, Ck)
• Centroid of cluster Ci
– is the Pearson Correlation
– is 1st Principle Component of Ci
• Homogeneity of Ci
• Quality of clustering , is
• Optimal partition
2,
R
argmaxjn
j i
i u xu x C
y r
, cov( , ) /x y x yr x y
2,( )i jj i
i y xx CH C r
1
( ) ( )k
k ii
P H C
H
iy
H kP
*kP
10
Variable Clustering – Visualize This!
• Optimal partition:
Given a partition:P4 = (C1, …, C4)
Centroid of Ck:1st PC of Ck
H(C1) H(C2) H(C3) H(C4)Quality of P4: 4( )PH = +++
* argmax ( )k k
k kP
P P
P
H
11
Implementation
• Theoretical optimum is computationally complex
• Agglomerative hierarchical clustering
– Start with the points as individual clusters
– At each step, merge the closest pair of clusters until only one cluster left
• Result can be visualized as a dendrogram
• ClustOfVar in R
12
Simulation Output
PI2: Memory PI1: Communication
0.53
* 0.52
* 0.49
*0.46
* 1.00
*
-0.1
5*
-0.0
7*
0.81
*
-0.3
0*
0.45
*
-0.1
4*
+ + + + + + ++PI3: Computation
14
PI1 vs PI2
• 2 distinct strata on memory
– Upper – multiple node runs, need extra memory buffers
– Lower – single node runs, shared memory
• High PI2 for HPL
PI1. Communication
PI2
. M
emor
y
15
PI1 vs PI3
• Similar PI3 pattern for HPL and DGEMM
– Computation intensive
– HPL utilize DGEMM routine extensively
• Similar all PIs for stream & random access
PI1. Communication
PI3
. C
ompu
tatio
n
17
Applications
• 9 real-world scientific applications in weather forecasting, molecular dynamics and quantum physics
– Amber: molecular dynamics
– ExaML: molecular sequencing
– GADGET: cosmology
– Gromacs: molecular dynamics
– HOMME: climate modeling
– LAMMPS: molecular dynamics
– MILC: quantum chromodynamics
– NAMD: molecular dynamics
– WRF: weather research
Voronoi Diagram
PI1. Communication
PI3
. Com
puta
tion
18
Conclusion and Future Work
We have
• Proposed a statistical approach to give users a better insights into massive performance datasets;
• Created a performance scoring system using 3 PIs to capture high-dimensional performance space;
• Gave user accessible performance implications and improvement hints.
We will
• Test the method on other machine and systems;
• Define and develop a set of baseline kernels that better represent HPC workloads;
• Construct a user-friendly system incorporating statistical techniques to drive more advanced performance analysis for non-experts.
top related