s3d: performance impact of hybrid xt3/xt4 sameer shende [email protected]

50
S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende [email protected]

Post on 19-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

S3D: Performance Impact of Hybrid XT3/XT4Sameer Shende

[email protected]

TAU Performance SystemS3D Scalability Study 2

Acknowledgements

Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]

The performance data presented here is available at:

http://www.cs.uoregon.edu/research/tau/s3d

TAU Performance SystemS3D Scalability Study 3

TAU Parallel Performance System

http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation

Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system

Computer system architectures and operating systems Different programming languages and compilers

Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid

TAU Performance SystemS3D Scalability Study 4

The Story So Far... Scalability study of S3D using TAU

MPI_Wait I/O (WRITE_SAVEFILE) Loop: ComputeSpeciesDiffFlux (630-656) [Rice, SDSC] Loop: ReactionRateBounds (374-386) [exp]

3D Scatter plots pointed to a single “slow” node before Identifying individual nodes by mapping ranks to nodes

within TAU Cray utilities: nodeinfo, xtshowmesh, xtshowcabs Ran a 6400 core simulation to identify XT3/XT4

partition performance issues (removed -feature=xt3)

TAU Performance SystemS3D Scalability Study 5

Total Runtime Breakdown by Events - Time

MPI_Wait

WRITE_SAVEFILE

TAU Performance SystemS3D Scalability Study 6

Relative Efficiency

TAU Performance SystemS3D Scalability Study 7

MPI Scaling

TAU Performance SystemS3D Scalability Study 8

Relative Efficiency & Speedup for One Event

TAU Performance SystemS3D Scalability Study 9

ParaProf’s Source Browser (8 core profile)

TAU Performance SystemS3D Scalability Study 10

Case Study

Harness testcase Platform: Jaguar Combined Cray XT3/XT4 at ORNL

6400p Goal:

To evaluate the performance impact of combined XT3/XT4 nodes on S3D executions

Performance evaluation of MPI_Wait Study mapping of MPI ranks to nodes

TAU Performance SystemS3D Scalability Study 11

TAU: ParaProf Profile

TAU Performance SystemS3D Scalability Study 12

Overall Mean Profile: Exclusive Wallclock Time

TAU Performance SystemS3D Scalability Study 13

Overall Inclusive Time

TAU Performance SystemS3D Scalability Study 14

Mean Mflops observed over all ranks

TAU Performance SystemS3D Scalability Study 15

Inclusive Total Instructions Executed

TAU Performance SystemS3D Scalability Study 16

Total Instructions Executed (Exclusive)

TAU Performance SystemS3D Scalability Study 17

Comparing Exclusive PAPI Counters, MFlops

TAU Performance SystemS3D Scalability Study 18

3D Scatter Plots

Plot four routines along X, Y, Z, and Color axes Each routine has a range (max, min) Each process (rank) has a unique position along the three

axes and a unique color Allows us to examine the distribution of nodes (clusters)

TAU Performance SystemS3D Scalability Study 19

Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters!

TAU Performance SystemS3D Scalability Study 20

3D Triangle Mesh Display

Plot MPI rank, routine name, and exclusive time along X, Y and Z axes

Color can be shown by a fourth metric Scalable view Suitable for very large number of processors

TAU Performance SystemS3D Scalability Study 21

MPI_Wait: 3D View

TAU Performance SystemS3D Scalability Study 22

3D View: Zooming In... Jagged Edges!

TAU Performance SystemS3D Scalability Study 23

3D View: Uh Oh!

TAU Performance SystemS3D Scalability Study 24

Zoom, Change Color to L1 Data Cache Misses

• Loop in ComputeSpeciesDiffFlux (630-656) has high L1 DCMs (red)• Takes longer to execute on this “slice” of processors. So do other routines. Slower memory?

TAU Performance SystemS3D Scalability Study 25

Changing Color to MFLOPS

• Loop in ComputeSpeciesDiffFlux (630-656) lower Mflops (dark blue)

TAU Performance SystemS3D Scalability Study 26

Getting Back to MPI_Wait()

• Why does MPI_Wait take less time on these cores?

• What does the profile of MPI_Wait look like?

TAU Performance SystemS3D Scalability Study 27

MPI_Wait - Sorted by Exclusive Time

• MPI_Wait takes 435.84 seconds on rank 3101• It takes 59.6 s on rank 3233 and 29.2 s on rank 3200• It takes 15.49 seconds on rank 0!• How is rank 3101 different from rank 0?

TAU Performance SystemS3D Scalability Study 28

Comparing Ranks 3101 and 0 (extremes)

TAU Performance SystemS3D Scalability Study 29

Comparing Inclusive Times - Same for S3D

TAU Performance SystemS3D Scalability Study 30

Comparing PAPI Floating Point Instructions

• PAPI_FP_INS are the same - as expected

TAU Performance SystemS3D Scalability Study 31

Comparing Performance - MFLOPS

• For the memory intensive loop in ComputeSpeciesDiffFlux, rank 0 gets 65% Mflops of rank 3101 (114 vs 174 Mflops)!

TAU Performance SystemS3D Scalability Study 32

Comparing MFLOPS: Rank 3101 vs Rank 0

• Rank 0 appears to be “slower” than rank 3101• Are there other nodes that are similarly slow with less wait times? • How does the MPI_Wait profile look like over all nodes?

TAU Performance SystemS3D Scalability Study 33

MPI_Wait Profile

What is this rank?

TAU Performance SystemS3D Scalability Study 34

MPI_Wait Profile Shifts at rank 114!

• Ranks 0 through 113 take less time in MPI_Wait than 114...

TAU Performance SystemS3D Scalability Study 35

Another Shift in MPI_Wait()

• This shift is observed in ranks 3200 through 3313• Again 114 processors... (like ranks 0 through 113) • Hmm...• How do other routines perform on these ranks? • What are the physical node ids?

TAU Performance SystemS3D Scalability Study 36

MPI_Wait

• While MPI_Wait takes less time on these cpus, other routines take longer • Points to a load imbalance!

TAU Performance SystemS3D Scalability Study 37

Identifying Physical Processors using Metadata

TAU Performance SystemS3D Scalability Study 38

MetaData for Ranks 3200 and 0

• Rank 3200 and 0 both lie on the same physical node nid03406!

TAU Performance SystemS3D Scalability Study 39

Mapping Ranks from TAU to Physical Processors

• Ranks 0..113 lie on processors 3406..3551• Ranks 3200..3313 are also on 3406..3551

TAU Performance SystemS3D Scalability Study 40

Results from Cray’s nodeinfo Utility

• Processors 3406..3551 (physical ids) are located on the XT3 partition• XT3 partition has slow DDR-400 memory (5986 MB/s)• XT3 has a slower SS1 (1109 MB/s) interconnect• XT4 partition has faster DDR2-667 memory modules (7147 MB/s) and faster Seastar2 (SS2) (2022 MB/s) interconnect

TAU Performance SystemS3D Scalability Study 41

Location of Physical Nodes in the Cabinets

• Using Cray utilities xtshowcabs, and xtshowmesh utilities • All nodes marked with a Job “c” came from our S3D job

TAU Performance SystemS3D Scalability Study 42

xtshowcabs

• Nodes marked with a “c” are from our S3D run• What does the mesh look like?

TAU Performance SystemS3D Scalability Study 43

xtshowmesh (1 of 2)

• Nodes marked with a “c” are from our S3D run

TAU Performance SystemS3D Scalability Study 44

xtshowmesh (2 of 2)

• Nodes marked with a “c” are from our S3D run

TAU Performance SystemS3D Scalability Study 45

Conclusions Using a combination of XT3/XT4 nodes slowed down parts of

S3D The application spends a considerable amount of time

spinning/polling in MPI_Wait The load imbalance is probably caused by non-uniform nodes Conducted a performance characterization of S3D This data will help derive communication models that explain the

performance data observed [John Mellor-Crummey, Rice] Techniques to improve cache memory utilization in the loops

identified by TAU will help overall performance [SDSC, Rice] I/O characterization of S3D will help identify I/O scaling issues

TAU Performance SystemS3D Scalability Study 46

S3D - Building with TAU Change name of compiler in build/make.XT3

ftn=> tau_f90.sh cc => tau_cc.sh

Set compile time environment variables setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/

Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi Disabled tracking message communication statistics in TAU MPI_Comm_compare() is not called inside TAU’s MPI wrapper Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation

setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ Selective instrumentation file eliminates instrumentation in lightweight routines Pre-process Fortran source code using cpp before compiling

Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:

export TAU_THROTTLE=1 export COUNTER1 GET_TIME_OF_DAY export COUNTER2 PAPI_FP_INS export COUNTER3 PAPI_L1_DCM export COUNTER4 PAPI_TOT_INS export COUNTER5 PAPI_L2_DCM

TAU Performance SystemS3D Scalability Study 47

Selective Instrumentation in TAU

% cat select.tauBEGIN_EXCLUDE_LIST

MCADIF

GETRATES

TRANSPORT_M::MCAVIS_NEW

MCEDIF

MCACON

CKYTCP

THERMCHEM_M::MIXCP

THERMCHEM_M::MIXENTH

THERMCHEM_M::GIBBSENRG_ALL_DIMT

CKRHOY

MCEVAL4

THERMCHEM_M::HIS

THERMCHEM_M::CPS

THERMCHEM_M::ENTROPY

END_EXCLUDE_LIST

BEGIN_INSTRUMENT_SECTION

loops routine="#"

END_INSTRUMENT_SECTION

TAU Performance SystemS3D Scalability Study 48

Getting Access to TAU on Jaguar set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) Choose Stub Makefiles (TAU_MAKEFILE env. var.) from

/spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* Makefile.tau-mpi-pdt-pgi (flat profile) Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)

Binaries of S3D can be found in: ~sameer/scratch/S3D-BINARIES

withtau» papi, multiplecounters, mpi, pdt, pgi options

without_tau

TAU Performance SystemS3D Scalability Study 49

Concluding Discussion Performance tools must be used effectively More intelligent performance systems for productive use

Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process

Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use

Develop next-generation tools and deliver to community Open source with support by ParaTools, Inc. http://www.cs.uoregon.edu/research/tau

TAU Performance SystemS3D Scalability Study 50

Support Acknowledgements

Department of Energy (DOE)

Office of Science LLNL, LANL, ORNL, ASC PERI