combining phase identification and statistic modeling for automated parallel benchmark generation

Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation

Ye Jin, Xiaosong Ma, Mingliang Liu,Qing Liu, Jeremy Logan, Norbert Podhorszki,

Jong Youl Choi, Scott Klasky

Systems and Applications More Complex

2

Powerful supercomputers

Large-scale codes from diverse scientific domains solving real-world problems

• Large number of nodes• Deeper memory and I/O

stack• Heterogeneous architecture• System-specific interconnect

More Choices for HPC Platforms

3

10 Gb

Single Server

Local Cluster

HPC Center

HPC in Cloud

Performance Study Crucial

• Yet remains challenging– Single-platform analysis resource-consuming– Not to mention cross-platform

• Benchmarks more important than ever– Evaluate machines– Validate hardware/software design– Select from candidate platforms

• Realistic benchmarks hard to find

4

Benchmarks and Generation ToolsType Example Pros Cons

Kernels NPB, SPEC,Intel-MPI

Real, parametric

Simple, non-HPC, less flexible

Manually extracted FlashIO, GTC-Bench

Realistic, parametric

Labor-intensive, easily obsolete

Trace-based ScalaBenGen Automatic

Replay-based,non-parametric, platform-dependent

Specialized (I/O) IOR, Skel Parametric I/O phase only

5

Automatic, full-application benchmark extraction?

Outline

6

• Motivation• Our recent work related to automatic

application benchmark extraction• APPrime framework (SIGMETRICS15,[21])

• Led by NCSU PhD student Ye Jin• Collaboration with ORNL

• Cypress tool for communication trace compression (SC14, [7])

• Closing remarks

Desired Features

7

• Based on real, large-scale applications

• Leveraging existing tracing tools

• Automatic source code generation

• Concise, configurable, portable benchmarks

• With relative performance retained

TracesTrace

Our system

Sample Use Case 1: Cross-Platform Performance Estimation

8

• Estimate relative performance on candidate machines

Speed-up ratio, Titan supercomputer to Sith cluster (ORNL)

Relative performance highly case-dependent, varying across• Applications, execution scales, tasks (computation, communication, I/O)

9

Sample Use Case 2: I/O Method Selection

Compute nodes

I/O nodes

SAN

Main Mem. SSD

Staging nodes

Interconnection network Simulation

job

Parallel I/O with multi-level data staging

• Lots of I/O options available # of files Sync or async? # staging nodes Use local or remote SSD? I/O library Stripe width? I/O frequency …

Realistic I/O benchmarks allow users and I/O system designers to• consider interplay between I/O and other activities• evaluate I/O options with portable, light-weight benchmarks

• Assess candidate I/O design/configurations

APPrime Overview

10

• Automatic, whole-application benchmark generation– Input: parallel execution traces of application A on one platform– Output: “fake application” A’, simulating A’s behavior

• Computation, communication, I/O, scaling• Portable, shorter source code using few libraries

• APPrime Main idea– Get information from traces, but do not replay– Differentiate between regular and irregular behavior

• Be exact with regular activity (loops)• Model any irregularity as statistical distribution (histograms)

• Current status– Ready: overall framework, communication, I/O– To-do: computation kernel

• Iterative parallel applications have regular execution patterns

• in form of I(C*W)*F [1] I: one-time initialization phase (head) F: one-time finalization phase (tail) C: timestep computation phase (w. communication) W: periodic I/O phase

Assumed Computation Model

11

I FW

Repeated x times

CCC ... ...

Repeated x time

WCCC ...

Repeated y times

APPrime: Automatically identifies phases from traces - without any involvement of programmer/user

Event

Bubble

Event

Bubble...

Complications from Real Large-scale Apps

12

• Challenges– Noises (irregular activities)

• Found to be minor across all applications we studied– Multiple I/O phases– Heterogeneous C-phase communication behavior

• Identical event sequence, different parameters

• Solutions– Extend C to C[0, a]D0|1C[b, |C|]

• Allow minor noise phase D• Ignored in benchmark generation

– Extend W to Wi

• Multiple I/O phases, each with individual (fixed) frequency– Use Marcov-Chain Model (MCM) to simulate transitions between

multiple C phases

APPrime Workflow

13

DumpiTraces

Scala-Traces

Head (I)

Noise (D)Phases

Identifier

TraceParser

CodeGenerator

Head

I/O (W)I/O

Translator

Major Loops

(C)

MCMBuilder

MCStates

Configuration Parameter File Source Code

APPrime Benchmark

Extractor

Generator

Static phases

Phases in each table

…

Tail (F) Tail

APPrime Automatic Benchmark Generation Framework

Input Output

Mergingcross tables

Event Tables

ParserFactory

Trace Parsing: Trace to Event Table …• MPI_Bcast entering at walltime 102625.244, int count=1, MPI_Datatype

datatype=4 (MPI_INT), int root=0, MPI_Comm comm=4 (user-defined-comm),

MPI_Bcast returning at walltime 102625.244.

Sample Joint Per-process Event Table

• MPI_Barrier entering at walltime 102625.245, MPI_Comm comm=5 (user-defined-comm), MPI_Barrier returning at walltime 102625.253.

• MPI_File_open entering at walltime 102627.269, MPI_Comm comm=5 (user-defined-comm), int amode=0 (CREATE), filename=“simple.out”, MPI_Info info=0 (MPI_INFO_NULL), MPI_File file=1 (user-file), MPI_File_open returning at walltime 102627.439

Original ASCII DUMPI Trace

MPI function name Start End Data

count Root Comm. rank

File access mode

Phase ID

Phasetype …

… … … … … … … … … …

MPI_Bcast …5.244 …5.245 1 0 4 N/A …

MPI_Barrier …5.245 …5.253 N/A N/A 5 N/A …

MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE …

… … … … … … … … … …

To be filled

14

Event Table to Trace String

15

MPI function name Start End Data

count Root Comm. rank

File access mode

Phase ID

Phasetype …

… … … … … … … … … …

MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …

MPI_Barrier …5.245 …7.253 N/A N/A 5 N/A N/A N/A …

MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE N/A N/A …

… … … … … … … … … …

MPI_init => ‘a’,MPI_Barrier => ‘c’, MPI_Bcast => ‘d’, MPI_File_open => ‘f’, …

MPI_Finalize => ‘h’

ab…ccd…ccd…ef…ccd…ccd…ef…ghCompact trace string

APPrime: Deploys new string processing algorithm to• automatically identifies all phases• based on searching for partitioning that maximizes inter-iteration repetition

Computation Gap cross Timesteps

16

Event table of one process’s first timestep (C phase)

MPI_Bcast(…)

MPI_Isend(…)

Timestep 2 Timestep nMPI_Bcast(…)

MPI_Isend(…)

…

Bubble 2.1

Bubble 2.2

…

Bubble n.1

Bubble n.2

…

Histograms

…MPI function

name Start End Data count Root Comm.

rankFile

access mode

Phase ID

Phasetype …

… … … … … … … … … …

MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …

MPI_Barrier …7.245 …7.253 N/A N/A 5 N/A N/A N/A …

… … … … … … … … … …

MPI_Bcast(…)

MPI_Barrier(…)

Bubble 1.1

Bubble 1.2

Timestep 1

…

MPI function name

Data Count Type Dest. Src. Comm.

rank …

… … … … … … …

MPI_Irecv 20 MPI_INT N/A 4 4 …

MPI_Send 20 MPI_INT 4 N/A 4 …

… … … … … … …

MPI function name


rank …

… … … … … … …



… … … … … … …

Inter-Process Event Table Merging

17

MPI function name

Data Count Type Dest. Src.

Comm.

rank…

… … … … … … …

MPI_Irecv 20 MPI_INT N/A {4,8, …} 4 …


… … … … … … …

Merging per-process event tables

MPI function name


rank …

… … … … … … …



… … … … … … …

Markov Chain Model for C-Phase States

18

No. Name Count Type Dest. Src.1 MPI_Irecv 20 MPI_INT N/A {4, 8, …}2 MPI_Send 20 MPI_INT 4 N/A… … … … … …

MC State Rank 1

MC State Rank 2

Merged Timestep m

Merged Timestep m+n

State 1

State 2

State 1 0.3 0.7

State 2 0.7 0.3

Transition Probabilities Matrix

No. Name Count Type Dest. Src.1 MPI_Irecv 80 MPI_INT N/A {1, 7, …}2 MPI_Send 80 MPI_INT 1 N/A… … … … … …

…

0.3

0.70.7

0.3

Benchmark Code Generation1 int main(int argc, char* argv[]){2 apprime_init(argc, argv); 3 init_phase();4 // major loop5 for(timestep = 0; timestep < total_timestep; timestep++) {6 run_state_for_C_phase(state_rank, event_tables);7 // update next state rank8 state_rank = trans_state(state_rank, timestep); 9 // periodic I/O phases10 if(timesteps+1 % restart_period_1 == 0)11 W_phase_1();12 …13 }14 final_phase(); 15 apprime_finalize();16 return 0;17 }

I/O phase Wi, here i =1

Direct replay I phase

Direct replay F phase

Select the next MC state for C phase

19

Evaluation

20

• Platforms: Titan and Sith at ORNL

• Workloads:• Real-world HPC applications:

• Quantum turbulence code: BEC2• Gyrokinetic particle simulations: XGC and GTS

• NAS benchmarks: BTIO, LU, SP, CG

Name # of Nodes

Cores per node

Mem. per node OS File

System

Titan 18,688 16 32 GB Cray xk7 Lustre

Sith 40 32 64 GB Linux x86_64 Lustre

HPC Applications

21

Name DomainTypical prod. run Scale (# of cores)

Open Source? Status

XGC Gyrokinetic 225,280 No Done

GTS Gyrokinetic 262,144 No Done

BEC2 Unitary qubit 110,592 No Done

QMC-Pack Electronic molecular 256 – 16,000 No Applicable

S3D Molecular physics 96,000 – 180,000 No Applicable

AWP-ODC Wave propagation 223,074 No Applicable

NAMD Molecular dynamics 1,000 – 20,000 No Applicable

HFODD Nuclear 299,008 Yes Applicable

LAMMPS Molecular dynamics 12,500 – 130,000 Yes Applicable

SPEC Multi-domain Benchmarks 150 - 600 Yes Applicable

NPB Multi-domain Benchmarks N/A Yes Done

Applications’ Trace FeaturesApp # of

procs# of TSs

Trace size

Table size

# events in one state

String size

# unique funcs

# of states D% TSV

%Profile

sizeBTIO 64 250 832MB 584MB 183 44.4KB 16 1 0% 2.1% 2.2MB

BTIO 256 250 7.02GB 4.75GB 266 91.5KB 16 1 0% 4.3% 8.1MB

CG 64 100 1.42GB 1.00GB 783 77.8KB 11 1 0% 1.5% 3.4MB

CG 256 100 7.51GB 5.50GB 1478 101KB 11 1 0% 1.8% 11MB

SP 64 500 1.44GB 960MB 139 68.1KB 15 1 0% 1.4% 1.4MB

SP 256 500 11.7GB 7.43GB 278 138KB 15 1 0% 3.5% 6.1MB

LU 64 300 18GB 12.3GB 1604 471KB 11 1 0% 2.3% 11MB

LU 256 300 75GB 51.3GB 1604 471KB 11 1 0% 3.8% 44MB

BEC2 64 100 142MB 101MB 74 7,5KB 14 1 0% 1.8% 1.1MB

BEC2 256 200 1.08GB 800MB 74 14.7KB 14 1 0% 2.7% 3.6MB

XGC 64 100 262MB 243MB 73 11.5KB 28 2 0.1% 4.3% 1.0MB

XGC 256 200 2.1GB 1.64GB 103 15.3KB 28 2 0.1% 5.8% 1.4MB

GTS 64 50 213MB 137MB 391 11.6KB 38 2 0.3% 5.6% 1.9MB

GTS 256 100 1.83GB 1.15GB 391 24.9KB 38 2 0.3% 5.9% 7.2MB22

Results: A vs. A’

23

• Comparing target application A with APPrime generated benchmark A’– A’ much more compact and easier to build– If A has multiple C-phase states, they take no more than dozens

of timesteps to be discovered

NameLines of code

Max # of TS tested Max # of TS requiredA A’

BEC2 1.5K 856 1,000 1

XGC 93.7K 7.7K 1,000 36

GTS 178.4K 13.7K 200 2

Cross-Platform Relative Performance

24

BTIO CG

SPLU

Asynchronous I/O Configuration Assessment

25

BEC2 GTS

XGC

Comparing with Other Profile-based Benchmark Generation Techniques

26

APPrime [21] BenchMaker [12] HBench [13]Generated Benchmark

Large scale iterative parallel benchmark

Single process (Multi-threaded) benchmark

JAVA benchmark

Application Specific

Yes Yes Yes

Source of profile

Own processing of execution traces

User’s input JVM profilers

Target of profiling

• Recurrent event sequence

• Event parameter/inter-arrival distribution

• Instruction mix• Branch

probabilities• Instruction level

parallelism• Locality

• Methods frequently invoked

• Function invoking counts

• Time cost

Other Related Work

27

• Communication trace collection• TAU [3], DUMPI [4], ScalaTrace [5]

• Trace reduction• Lossy: Xu’s work [6], Cypress [7]• Lossless: ScalaTrace [5]

• Profiling• HPCtoolkit [8], Scalasca Performance Toolset [9]

• Trace-based application analysis• ScalaExtrap [10], Casas’ work [11]

• Benchmark generation• Trace-based: ScalaBenchGen [14]• Source code slicing: FACT [15]

Ongoing Work

28

• Filling in computation kernel generation– Currently using histogram to model “bubble size”– Planned COMPrime tool

• recursive step on single-process computation kernel• Instruction mix, memory access (more challenging)

• Modeling scaling behavior– Take input traces of app A collected at different scale

• Problem size, execution size– Can we simulate weak/strong scaling behavior with A’

• Connecting with collaborative work on scalable tracing

• Release full benchmarks!

Thanks!

29

APPrime References1. L.T Yang; Xiaosong Ma; Frank Mueller, "Cross-Platform Performance Prediction of Parallel

Applications Using Partial Execution," Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference , vol., no., pp.40,40, 12-18 Nov. 2005

2. Marc Casas, Rosa M. Badia, and Jesus Labarta. “Automatic Phase Detection and Structure Extraction of MPI Applications.” Int. J. High Perform. Comput. Appl. 24, 3 (August 2010), 335-360.

3. S. Shende and A. D. Malony. TAU: The tau parallel performance system. International Journal of High Performance Computing Applications, 20(2), 2006.

4. J. P. Kenny, G. Hendry, B. Allan, and D. Zhang. Dumpi: The mpi profiler from the sst simulator suite, 2011.

5. M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable Compression and Replay of Communication Traces for High-Performance Computing. J. Parallel Distrib. Comput., 2009.

6. Q. Xu, J. Subhlok, R. Zheng, and S. Voss. Logicalization of Communication Traces from Parallel Execution. In IEEE IISWC, 2009.

7. J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen. Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC14, 2014.

8. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. CCPE, 2010.

9. M. Geimer, F. Wolf, B. J. N. Wylie, E. Abraham, D. Becker, and B. Mohr. The Scalasca Performance Toolset Architecture. In CCPE, 2010.

10. X. Wu and F. Mueller. ScalaExtrap: Trace-based Communication Extrapolation for SPMD Programs. In ACM PPoPP, 2011.

30

APPrime References11. M. Casas, R. M. Badia, and J. Labarta. Automatic Phase Detection and Structure

Extraction of MPI Applications. IJHPCA, 2010.12. J. Dujmovic. Automatic Generation of Benchmark and Test Workloads. In WOSP/SIPEW, 2010.13. Xiaolan Zhang and Margo Seltzer. 2000. HBench:Java: an application-specific benchmarking framework for Java virtual machines. In Proceedings of the ACM 2000 conference on Java Grande (JAVA '00).14. X. Wu, V. Deshpande, and F. Mueller. ScalaBenchGen: Auto-Generation of Communication Benchmarks Traces. In IEEE IPDPS, 2012.15. J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng. FACT: Fast Communication Trace

Collection for Parallel Applications Through Program Slicing. In SC09, 2009.16. GTC-benchmark in NERSC-8 suite, 2013.17. NASA. Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html,

2003.18.18. A. M. Joshi, L. Eeckhout, and L. K. John. The Return of Synthetic Benchmarks. In SPEC

Benchmark Workshop, 2008.19. J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian, and M. Wolf. Understanding I/O Performance Using I/O Skeletal Applications. In Euro-Par. Springer-Verlag, 2012.20. Hongzhang Shan , John Shalf. Using IOR to Analyze the I/O Performance for HPC

Platforms. CUG, 2007

31

combining phase identification and statistic modeling for automated parallel benchmark generation

Software