combining phase identification and statistic modeling for automated parallel benchmark generation
TRANSCRIPT
Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation
Ye Jin, Xiaosong Ma, Mingliang Liu,Qing Liu, Jeremy Logan, Norbert Podhorszki,
Jong Youl Choi, Scott Klasky
Systems and Applications More Complex
2
Powerful supercomputers
Large-scale codes from diverse scientific domains solving real-world problems
• Large number of nodes• Deeper memory and I/O
stack• Heterogeneous architecture• System-specific interconnect
More Choices for HPC Platforms
3
10 Gb
Single Server
Local Cluster
HPC Center
HPC in Cloud
Performance Study Crucial
• Yet remains challenging– Single-platform analysis resource-consuming– Not to mention cross-platform
• Benchmarks more important than ever– Evaluate machines– Validate hardware/software design– Select from candidate platforms
• Realistic benchmarks hard to find
4
Benchmarks and Generation ToolsType Example Pros Cons
Kernels NPB, SPEC,Intel-MPI
Real, parametric
Simple, non-HPC, less flexible
Manually extracted FlashIO, GTC-Bench
Realistic, parametric
Labor-intensive, easily obsolete
Trace-based ScalaBenGen Automatic
Replay-based,non-parametric, platform-dependent
Specialized (I/O) IOR, Skel Parametric I/O phase only
5
Automatic, full-application benchmark extraction?
Outline
6
• Motivation• Our recent work related to automatic
application benchmark extraction• APPrime framework (SIGMETRICS15,[21])
• Led by NCSU PhD student Ye Jin• Collaboration with ORNL
• Cypress tool for communication trace compression (SC14, [7])
• Closing remarks
Desired Features
7
• Based on real, large-scale applications
• Leveraging existing tracing tools
• Automatic source code generation
• Concise, configurable, portable benchmarks
• With relative performance retained
TracesTrace
Our system
Sample Use Case 1: Cross-Platform Performance Estimation
8
• Estimate relative performance on candidate machines
Speed-up ratio, Titan supercomputer to Sith cluster (ORNL)
Relative performance highly case-dependent, varying across• Applications, execution scales, tasks (computation, communication, I/O)
9
Sample Use Case 2: I/O Method Selection
Compute nodes
I/O nodes
SAN
Main Mem. SSD
Staging nodes
Interconnection network Simulation
job
Parallel I/O with multi-level data staging
• Lots of I/O options available # of files Sync or async? # staging nodes Use local or remote SSD? I/O library Stripe width? I/O frequency …
Realistic I/O benchmarks allow users and I/O system designers to• consider interplay between I/O and other activities• evaluate I/O options with portable, light-weight benchmarks
• Assess candidate I/O design/configurations
APPrime Overview
10
• Automatic, whole-application benchmark generation– Input: parallel execution traces of application A on one platform– Output: “fake application” A’, simulating A’s behavior
• Computation, communication, I/O, scaling• Portable, shorter source code using few libraries
• APPrime Main idea– Get information from traces, but do not replay– Differentiate between regular and irregular behavior
• Be exact with regular activity (loops)• Model any irregularity as statistical distribution (histograms)
• Current status– Ready: overall framework, communication, I/O– To-do: computation kernel
• Iterative parallel applications have regular execution patterns
• in form of I(C*W)*F [1] I: one-time initialization phase (head) F: one-time finalization phase (tail) C: timestep computation phase (w. communication) W: periodic I/O phase
Assumed Computation Model
11
I FW
Repeated x times
CCC ... ...
Repeated x time
WCCC ...
Repeated y times
APPrime: Automatically identifies phases from traces - without any involvement of programmer/user
Event
Bubble
Event
Bubble...
Complications from Real Large-scale Apps
12
• Challenges– Noises (irregular activities)
• Found to be minor across all applications we studied– Multiple I/O phases– Heterogeneous C-phase communication behavior
• Identical event sequence, different parameters
• Solutions– Extend C to C[0, a]D0|1C[b, |C|]
• Allow minor noise phase D• Ignored in benchmark generation
– Extend W to Wi
• Multiple I/O phases, each with individual (fixed) frequency– Use Marcov-Chain Model (MCM) to simulate transitions between
multiple C phases
APPrime Workflow
13
DumpiTraces
Scala-Traces
Head (I)
Noise (D)Phases
Identifier
TraceParser
CodeGenerator
Head
I/O (W)I/O
Translator
Major Loops
(C)
MCMBuilder
MCStates
Configuration Parameter File Source Code
APPrime Benchmark
Extractor
Generator
Static phases
Phases in each table
…
Tail (F) Tail
APPrime Automatic Benchmark Generation Framework
Input Output
Mergingcross tables
Event Tables
ParserFactory
Trace Parsing: Trace to Event Table …• MPI_Bcast entering at walltime 102625.244, int count=1, MPI_Datatype
datatype=4 (MPI_INT), int root=0, MPI_Comm comm=4 (user-defined-comm),
MPI_Bcast returning at walltime 102625.244.
Sample Joint Per-process Event Table
• MPI_Barrier entering at walltime 102625.245, MPI_Comm comm=5 (user-defined-comm), MPI_Barrier returning at walltime 102625.253.
• MPI_File_open entering at walltime 102627.269, MPI_Comm comm=5 (user-defined-comm), int amode=0 (CREATE), filename=“simple.out”, MPI_Info info=0 (MPI_INFO_NULL), MPI_File file=1 (user-file), MPI_File_open returning at walltime 102627.439
Original ASCII DUMPI Trace
MPI function name Start End Data
count Root Comm. rank
File access mode
Phase ID
Phasetype …
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A …
MPI_Barrier …5.245 …5.253 N/A N/A 5 N/A …
MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE …
… … … … … … … … … …
To be filled
14
Event Table to Trace String
15
MPI function name Start End Data
count Root Comm. rank
File access mode
Phase ID
Phasetype …
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …
MPI_Barrier …5.245 …7.253 N/A N/A 5 N/A N/A N/A …
MPI_File_open …7.269 …7.439 N/A N/A 5 CREATE N/A N/A …
… … … … … … … … … …
MPI_init => ‘a’,MPI_Barrier => ‘c’, MPI_Bcast => ‘d’, MPI_File_open => ‘f’, …
MPI_Finalize => ‘h’
ab…ccd…ccd…ef…ccd…ccd…ef…ghCompact trace string
APPrime: Deploys new string processing algorithm to• automatically identifies all phases• based on searching for partitioning that maximizes inter-iteration repetition
Computation Gap cross Timesteps
16
Event table of one process’s first timestep (C phase)
MPI_Bcast(…)
MPI_Isend(…)
Timestep 2 Timestep nMPI_Bcast(…)
MPI_Isend(…)
…
Bubble 2.1
Bubble 2.2
…
Bubble n.1
Bubble n.2
…
Histograms
…MPI function
name Start End Data count Root Comm.
rankFile
access mode
Phase ID
Phasetype …
… … … … … … … … … …
MPI_Bcast …5.244 …5.245 1 0 4 N/A N/A N/A …
MPI_Barrier …7.245 …7.253 N/A N/A 5 N/A N/A N/A …
… … … … … … … … … …
MPI_Bcast(…)
MPI_Barrier(…)
Bubble 1.1
Bubble 1.2
Timestep 1
…
MPI function name
Data Count Type Dest. Src. Comm.
rank …
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 4 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
MPI function name
Data Count Type Dest. Src. Comm.
rank …
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 8 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
Inter-Process Event Table Merging
17
MPI function name
Data Count Type Dest. Src.
Comm.
rank…
… … … … … … …
MPI_Irecv 20 MPI_INT N/A {4,8, …} 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
Merging per-process event tables
MPI function name
Data Count Type Dest. Src. Comm.
rank …
… … … … … … …
MPI_Irecv 20 MPI_INT N/A 4 4 …
MPI_Send 20 MPI_INT 4 N/A 4 …
… … … … … … …
Markov Chain Model for C-Phase States
18
No. Name Count Type Dest. Src.1 MPI_Irecv 20 MPI_INT N/A {4, 8, …}2 MPI_Send 20 MPI_INT 4 N/A… … … … … …
MC State Rank 1
MC State Rank 2
Merged Timestep m
Merged Timestep m+n
State 1
State 2
State 1 0.3 0.7
State 2 0.7 0.3
Transition Probabilities Matrix
No. Name Count Type Dest. Src.1 MPI_Irecv 80 MPI_INT N/A {1, 7, …}2 MPI_Send 80 MPI_INT 1 N/A… … … … … …
…
0.3
0.70.7
0.3
Benchmark Code Generation1 int main(int argc, char* argv[]){2 apprime_init(argc, argv); 3 init_phase();4 // major loop5 for(timestep = 0; timestep < total_timestep; timestep++) {6 run_state_for_C_phase(state_rank, event_tables);7 // update next state rank8 state_rank = trans_state(state_rank, timestep); 9 // periodic I/O phases10 if(timesteps+1 % restart_period_1 == 0)11 W_phase_1();12 …13 }14 final_phase(); 15 apprime_finalize();16 return 0;17 }
I/O phase Wi, here i =1
Direct replay I phase
Direct replay F phase
Select the next MC state for C phase
19
Evaluation
20
• Platforms: Titan and Sith at ORNL
• Workloads:• Real-world HPC applications:
• Quantum turbulence code: BEC2• Gyrokinetic particle simulations: XGC and GTS
• NAS benchmarks: BTIO, LU, SP, CG
Name # of Nodes
Cores per node
Mem. per node OS File
System
Titan 18,688 16 32 GB Cray xk7 Lustre
Sith 40 32 64 GB Linux x86_64 Lustre
HPC Applications
21
Name DomainTypical prod. run Scale (# of cores)
Open Source? Status
XGC Gyrokinetic 225,280 No Done
GTS Gyrokinetic 262,144 No Done
BEC2 Unitary qubit 110,592 No Done
QMC-Pack Electronic molecular 256 – 16,000 No Applicable
S3D Molecular physics 96,000 – 180,000 No Applicable
AWP-ODC Wave propagation 223,074 No Applicable
NAMD Molecular dynamics 1,000 – 20,000 No Applicable
HFODD Nuclear 299,008 Yes Applicable
LAMMPS Molecular dynamics 12,500 – 130,000 Yes Applicable
SPEC Multi-domain Benchmarks 150 - 600 Yes Applicable
NPB Multi-domain Benchmarks N/A Yes Done
Applications’ Trace FeaturesApp # of
procs# of TSs
Trace size
Table size
# events in one state
String size
# unique funcs
# of states D% TSV
%Profile
sizeBTIO 64 250 832MB 584MB 183 44.4KB 16 1 0% 2.1% 2.2MB
BTIO 256 250 7.02GB 4.75GB 266 91.5KB 16 1 0% 4.3% 8.1MB
CG 64 100 1.42GB 1.00GB 783 77.8KB 11 1 0% 1.5% 3.4MB
CG 256 100 7.51GB 5.50GB 1478 101KB 11 1 0% 1.8% 11MB
SP 64 500 1.44GB 960MB 139 68.1KB 15 1 0% 1.4% 1.4MB
SP 256 500 11.7GB 7.43GB 278 138KB 15 1 0% 3.5% 6.1MB
LU 64 300 18GB 12.3GB 1604 471KB 11 1 0% 2.3% 11MB
LU 256 300 75GB 51.3GB 1604 471KB 11 1 0% 3.8% 44MB
BEC2 64 100 142MB 101MB 74 7,5KB 14 1 0% 1.8% 1.1MB
BEC2 256 200 1.08GB 800MB 74 14.7KB 14 1 0% 2.7% 3.6MB
XGC 64 100 262MB 243MB 73 11.5KB 28 2 0.1% 4.3% 1.0MB
XGC 256 200 2.1GB 1.64GB 103 15.3KB 28 2 0.1% 5.8% 1.4MB
GTS 64 50 213MB 137MB 391 11.6KB 38 2 0.3% 5.6% 1.9MB
GTS 256 100 1.83GB 1.15GB 391 24.9KB 38 2 0.3% 5.9% 7.2MB22
Results: A vs. A’
23
• Comparing target application A with APPrime generated benchmark A’– A’ much more compact and easier to build– If A has multiple C-phase states, they take no more than dozens
of timesteps to be discovered
NameLines of code
Max # of TS tested Max # of TS requiredA A’
BEC2 1.5K 856 1,000 1
XGC 93.7K 7.7K 1,000 36
GTS 178.4K 13.7K 200 2
Cross-Platform Relative Performance
24
BTIO CG
SPLU
Asynchronous I/O Configuration Assessment
25
BEC2 GTS
XGC
Comparing with Other Profile-based Benchmark Generation Techniques
26
APPrime [21] BenchMaker [12] HBench [13]Generated Benchmark
Large scale iterative parallel benchmark
Single process (Multi-threaded) benchmark
JAVA benchmark
Application Specific
Yes Yes Yes
Source of profile
Own processing of execution traces
User’s input JVM profilers
Target of profiling
• Recurrent event sequence
• Event parameter/inter-arrival distribution
• Instruction mix• Branch
probabilities• Instruction level
parallelism• Locality
• Methods frequently invoked
• Function invoking counts
• Time cost
Other Related Work
27
• Communication trace collection• TAU [3], DUMPI [4], ScalaTrace [5]
• Trace reduction• Lossy: Xu’s work [6], Cypress [7]• Lossless: ScalaTrace [5]
• Profiling• HPCtoolkit [8], Scalasca Performance Toolset [9]
• Trace-based application analysis• ScalaExtrap [10], Casas’ work [11]
• Benchmark generation• Trace-based: ScalaBenchGen [14]• Source code slicing: FACT [15]
Ongoing Work
28
• Filling in computation kernel generation– Currently using histogram to model “bubble size”– Planned COMPrime tool
• recursive step on single-process computation kernel• Instruction mix, memory access (more challenging)
• Modeling scaling behavior– Take input traces of app A collected at different scale
• Problem size, execution size– Can we simulate weak/strong scaling behavior with A’
• Connecting with collaborative work on scalable tracing
• Release full benchmarks!
Thanks!
29
APPrime References1. L.T Yang; Xiaosong Ma; Frank Mueller, "Cross-Platform Performance Prediction of Parallel
Applications Using Partial Execution," Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference , vol., no., pp.40,40, 12-18 Nov. 2005
2. Marc Casas, Rosa M. Badia, and Jesus Labarta. “Automatic Phase Detection and Structure Extraction of MPI Applications.” Int. J. High Perform. Comput. Appl. 24, 3 (August 2010), 335-360.
3. S. Shende and A. D. Malony. TAU: The tau parallel performance system. International Journal of High Performance Computing Applications, 20(2), 2006.
4. J. P. Kenny, G. Hendry, B. Allan, and D. Zhang. Dumpi: The mpi profiler from the sst simulator suite, 2011.
5. M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable Compression and Replay of Communication Traces for High-Performance Computing. J. Parallel Distrib. Comput., 2009.
6. Q. Xu, J. Subhlok, R. Zheng, and S. Voss. Logicalization of Communication Traces from Parallel Execution. In IEEE IISWC, 2009.
7. J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen. Cypress: Combining static and dynamic analysis for top-down communication trace compression. In SC14, 2014.
8. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. CCPE, 2010.
9. M. Geimer, F. Wolf, B. J. N. Wylie, E. Abraham, D. Becker, and B. Mohr. The Scalasca Performance Toolset Architecture. In CCPE, 2010.
10. X. Wu and F. Mueller. ScalaExtrap: Trace-based Communication Extrapolation for SPMD Programs. In ACM PPoPP, 2011.
30
APPrime References11. M. Casas, R. M. Badia, and J. Labarta. Automatic Phase Detection and Structure
Extraction of MPI Applications. IJHPCA, 2010.12. J. Dujmovic. Automatic Generation of Benchmark and Test Workloads. In WOSP/SIPEW, 2010.13. Xiaolan Zhang and Margo Seltzer. 2000. HBench:Java: an application-specific benchmarking framework for Java virtual machines. In Proceedings of the ACM 2000 conference on Java Grande (JAVA '00).14. X. Wu, V. Deshpande, and F. Mueller. ScalaBenchGen: Auto-Generation of Communication Benchmarks Traces. In IEEE IPDPS, 2012.15. J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng. FACT: Fast Communication Trace
Collection for Parallel Applications Through Program Slicing. In SC09, 2009.16. GTC-benchmark in NERSC-8 suite, 2013.17. NASA. Nas parallel benchmarks. http://www.nas.nasa.gov/publications/npb.html,
2003.18.18. A. M. Joshi, L. Eeckhout, and L. K. John. The Return of Synthetic Benchmarks. In SPEC
Benchmark Workshop, 2008.19. J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian, and M. Wolf. Understanding I/O Performance Using I/O Skeletal Applications. In Euro-Par. Springer-Verlag, 2012.20. Hongzhang Shan , John Shalf. Using IOR to Analyze the I/O Performance for HPC
Platforms. CUG, 2007
31