statistical performance analysis for scientific applications

19
Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing • Haihang You • Charng-Da Lu July 15, 2014

Upload: tanuja

Post on 22-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

Statistical Performance Analysis for Scientific Applications. Fei Xing • Haihang You • Charng-Da Lu. July 15, 2014. Presentation at the XSEDE14 Conference Atlanta, GA. Running Time Analysis. Causes of slow run on supercomputer Improper memory usage Poor parallelism Too much I/O - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Performance Analysis for Scientific Applications

Statistical Performance Analysis for Scientific Applications

Presentation at the XSEDE14 ConferenceAtlanta, GA

Fei Xing • Haihang You • Charng-Da Lu

July 15, 2014

Page 2: Statistical Performance Analysis for Scientific Applications

2

Running Time Analysis

• Causes of slow run on supercomputer– Improper memory usage– Poor parallelism– Too much I/O– Not optimize the program efficiently– …

• Examine user’s code: profiling tools

• Profiling = physical exam for applications– Communication – Fast Profiling library for MPI (FPMPI)– Processor & memory – Performance Application Programming

Interface (PAPI)– Overall performance & Optimization opportunity – CrayPat

Page 3: Statistical Performance Analysis for Scientific Applications

3

Profiling Reports

• Profiling tools produce comprehensive reports covering a wider spectrum of application performance

• Imagine, as a scientist and supercomputer user, you see…

• Question: how to make sense of these information from the report?– Meaning of the variables – Indication of the numbers

I/O read timeI/O write time

MPI communication timeMPI synchronization time MPI calls

Level 1 Cache miss

Memory usage

TLB miss L1 Cache access

MPI imbalance

MPI communication imbalance

More are coming!!!

Page 4: Statistical Performance Analysis for Scientific Applications

4

Research Framework

• Select an HPC benchmark to create baseline kernels

• Use profiling tools to capture the peak performance

• Apply statistical approach to extract synthetic features that are easy to interpret

• Run real applications, and compare their performance with “role models”

How about…

Courtesy of C.-D. Lu

Page 5: Statistical Performance Analysis for Scientific Applications

5

Gears for the Experiment

• Benchmarks – HPC Challenge (HPCC) – Gauge supercomputers toward peak performance– 7 representative kernels:

• DGEMM, FFT, HPL, Random Access, PTRANS, Latency Bandwidth, Stream• HPL is used in the TOP 500 ranking

– 3 parallelism regimes• Serial / Single Processor• Embarrassingly Parallel• MPI Parallel

• Profiling tools – FPMPI and PAPI

• Testing environment – Kraken (Cray XT5)

Page 6: Statistical Performance Analysis for Scientific Applications

6

HPCC

Mode 1 means serial/single processor, * means embarrassingly parallel, M means MPI parallel

Page 7: Statistical Performance Analysis for Scientific Applications

7

Training Set Design

• 2,954 observations– Various kernels, wide range of matrix sizes, different compute nodes

• 11 performance metrics – gathered from FPMPI and PAPI– MPI communication time, MPI synchronization time, MPI calls, total

MPI bytes, memory, FLOPS, total instructions, L2 data cache access, L1 data cache access, synchronization imbalance, communication imbalance

• Data preprocessing– Convert some metrics to unit-less rates: divide by wall-time– Normalization

FLOPS Memory … MPI calls

HPL_1000

*_FFT_2000

M_RA_300,000

Performance Metrics

Obs.

Page 8: Statistical Performance Analysis for Scientific Applications

8

Extract Synthetic Features

• Extract synthetic & accessible Performance Indices (PIs)

• Solution: Variable Clustering + Principal Component Analysis (PCA)

• PCA: decorrelate the data

• Problem of using PCA alone: variables with small loadings may over influence the PC score

• Standardization & modified PCA do not work well

Page 9: Statistical Performance Analysis for Scientific Applications

9

Variable Clustering

• Given a partition of X, Pk = (C1, …, Ck)

• Centroid of cluster Ci

– is the Pearson Correlation– is 1st Principle Component of Ci

• Homogeneity of Ci

• Quality of clustering , is

• Optimal partition

2,

R

argmaxjn

j i

i u xu x C

y r

, cov( , ) /x y x yr x y

2,( )i jj i

i y xx CH C r

1

( ) ( )k

k ii

P H C

H

iy

H kP

*kP

Page 10: Statistical Performance Analysis for Scientific Applications

10

Variable Clustering – Visualize This!

• Optimal partition:

Given a partition:P4 = (C1, …, C4)

Centroid of Ck:1st PC of Ck

H(C1) H(C2) H(C3) H(C4)Quality of P4: 4( )PH = +++

* argmax ( )k k

k kP

P P

P

H

Page 11: Statistical Performance Analysis for Scientific Applications

11

Implementation

• Theoretical optimum is computationally complex

• Agglomerative hierarchical clustering– Start with the points as individual clusters– At each step, merge the closest pair of clusters until only one

cluster left

• Result can be visualized as a dendrogram

• ClustOfVar in R

Page 12: Statistical Performance Analysis for Scientific Applications

12

Simulation Output

PI2: Memory PI1: Communication

0.53

* 0.52

* 0.49

*0.46

* 1.00

*

-0.1

5*

-0.0

7*

0.81

* -0.3

0*

0.45

*-0.1

4*

+ + + + + + ++PI3: Computation

Page 13: Statistical Performance Analysis for Scientific Applications

13

PIs for Baseline Kernels

Page 14: Statistical Performance Analysis for Scientific Applications

14

PI1 vs PI2

• 2 distinct strata on memory– Upper – multiple node runs,

need extra memory buffers– Lower – single node runs, shared

memory• High PI2 for HPL

PI1. Communication

PI2

. Mem

ory

Page 15: Statistical Performance Analysis for Scientific Applications

15

PI1 vs PI3

• Similar PI3 pattern for HPL and DGEMM– Computation intensive– HPL utilize DGEMM routine

extensively• Similar all PIs for stream &

random access

PI1. Communication

PI3

. Com

puta

tion

Page 16: Statistical Performance Analysis for Scientific Applications

16

Courtesy of C.-D. Lu

Page 17: Statistical Performance Analysis for Scientific Applications

17

Applications

• 9 real-world scientific applications in weather forecasting, molecular dynamics and quantum physics– Amber: molecular dynamics– ExaML: molecular sequencing– GADGET: cosmology – Gromacs: molecular dynamics– HOMME: climate modeling– LAMMPS: molecular dynamics– MILC: quantum chromodynamics– NAMD: molecular dynamics– WRF: weather research

Voronoi Diagram

PI1. Communication

PI3

. Com

puta

tion

Page 18: Statistical Performance Analysis for Scientific Applications

18

Conclusion and Future Work

We have

• Proposed a statistical approach to give users a better insights into massive performance datasets;

• Created a performance scoring system using 3 PIs to capture high-dimensional performance space;

• Gave user accessible performance implications and improvement hints.

We will

• Test the method on other machine and systems;

• Define and develop a set of baseline kernels that better represent HPC workloads;

• Construct a user-friendly system incorporating statistical techniques to drive more advanced performance analysis for non-experts.

Page 19: Statistical Performance Analysis for Scientific Applications

19

Thanks for your attention!Questions?