s3d: comparing performance of xt3+xt4 with xt4 sameer shende [email protected]

36
S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende [email protected]

Post on 22-Dec-2015

224 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

S3D: Comparing Performance of XT3+XT4 with XT4Sameer Shende

[email protected]

Page 2: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 2

Acknowledgements

Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]

The performance data presented here is available at:

http://www.cs.uoregon.edu/research/tau/s3d

Page 3: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 3

TAU Parallel Performance System

http://www.cs.uoregon.edu/research/tau/ Multi-level performance instrumentation

Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system

Computer system architectures and operating systems Different programming languages and compilers

Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid

Page 4: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 4

The Story So Far...

Scalability study of S3D using TAU 3D Scatter plots and mapping of ranks to physical

processors points to partitioning in XT3/XT4 Memory and network on XT3 partition cause the rest of

the application to slow down Hypothesis: Running S3D on a ‘pure’ XT4 system will

help improve the performance significantly Ran a 6400 core simulation on an XT4 partition to

compare with XT3+XT4 (used #PBS -lfeature=xt4)...

Page 5: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 5

3D Scatter Plots

Plot four routines along X, Y, Z, and Color axes Each routine has a range (max, min) Each process (rank) has a unique position along the three

axes and a unique color Allows us to examine the distribution of nodes (clusters)

Page 6: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 6

Scatter Plot: 6400 cores XT3/XT4 - 2 Clusters!

Previous work proved: Blue nodes are XT3, Red are XT4

Page 7: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 7

3D Triangle Mesh Display

Plot MPI rank, routine name, and exclusive time along X, Y and Z axes

Color can be shown by a fourth metric Scalable view Suitable for very large number of processors

Page 8: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 8

XT3+XT4: MPI_Wait

• Gap represents XT3 nodes

Page 9: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 9

3D View: Large MPI_Wait times on most CPUs

• To improve performance, we must reduce MPI_Wait time on other cpus

Page 10: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 10

3D View: XT3 Partition, Imbalance

On XT3: MPI_Wait takes less time, other routines take more time!

Page 11: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 11

Getting Back to MPI_Wait()

• MPI_Wait takes less time on XT3 nodes• Other routines take longer

Page 12: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 12

XT3+XT4: MPI_Wait - Sorted by Exclusive Time

• MPI_Wait takes 435.84 seconds on rank 3101• It takes 15.49 seconds on rank 0!• Rank 3101 is on XT4, rank 0 is on XT3

Page 13: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 13

Comparing XT4 and XT3 ranks (Best vs worst)

Page 14: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 14

Improving S3D Performance

Hypothesis: Running S3D on a ‘pure’ XT4 system

will help improve the performance significantly and reduce the time spent idling in MPI_Wait

Page 15: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 15

XT4 Profile: Main Window

Page 16: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 16

XT4: Mean Profile Sorted by Exclusive Time

• MPI_Wait has moved down!

Page 17: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 17

XT4: Mean Profile Sorted by Inclusive Time

Page 18: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 18

Comparing XT4 with XT3+XT4

• MPI_Wait takes 26% of time compared to combined XT3+XT4!

Page 19: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 19

Comparing Mean Inclusive Time

Page 20: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 20

XT4: 3D View

• The “exp” loop [~1GFlop] takes most time now!

Page 21: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 21

XT3+XT4: Scatter Plot (Before)

Page 22: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 22

XT4 Scatter Plot (After)

• MPI_Wait takes from 78 to 121 s now!

Page 23: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 23

Comparing Performance Hypothesis confirmed: XT4 is faster than XT3+XT4

Inclusive time down from 1935 to 1702 s 12% improvement Saved 24853.3 minutes (414 hours) of wallclock time!

Reduction in MPI_Wait time is most significant 390s (mean) down to 104s (mean)

Lessons learned: Slower XT3 nodes can have a significant impact on a

large scale S3D run S3D harness testcase does not perform well on non-

homogeneous nodes We recommend running S3D on XT4 partition only!

#PBS -lfeature=xt4

Page 24: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 24

Discussion Did we get optimal performance on XT4 nodes? Are the nodes performing at similar rates uniformly

now? Let us see the std. deviation plot of all routines...

Page 25: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 25

XT4: Standard Deviation

• IO routines!

Page 26: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 26

Scatter Plot: One CPU... WRITE_SAVEFILE

Page 27: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 27

WRITE_SAVEFILE

• Rank 0 is quicker!

Page 28: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 28

MPI_Barrier

Page 29: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 29

I/O is not performed uniformly

Page 30: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 30

I/O Becomes a Bottleneck: XT3, XT3+XT4...

MPI_Wait

WRITE_SAVEFILE

Page 31: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 31

Conclusions Using pure XT4 improved performance by 12% Need to investigate I/O in XT4/Lustre further to achieve better

performance... Discuss I/O issues with S3D developers

Page 32: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 32

S3D - Building with TAU Change name of compiler in build/make.XT3

ftn=> tau_f90.sh cc => tau_cc.sh

Set compile time environment variables setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/

Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi Disabled tracking message communication statistics in TAU MPI_Comm_compare() is not called inside TAU’s MPI wrapper Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation

setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’ Selective instrumentation file eliminates instrumentation in lightweight routines Pre-process Fortran source code using cpp before compiling

Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:

export TAU_THROTTLE=1 export COUNTER1 GET_TIME_OF_DAY export COUNTER2 PAPI_FP_INS export COUNTER3 PAPI_L1_DCM export COUNTER4 PAPI_TOT_INS export COUNTER5 PAPI_L2_DCM

Page 33: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 33

Selective Instrumentation in TAU

% cat select.tauBEGIN_EXCLUDE_LIST

MCADIF

GETRATES

TRANSPORT_M::MCAVIS_NEW

MCEDIF

MCACON

CKYTCP

THERMCHEM_M::MIXCP

THERMCHEM_M::MIXENTH

THERMCHEM_M::GIBBSENRG_ALL_DIMT

CKRHOY

MCEVAL4

THERMCHEM_M::HIS

THERMCHEM_M::CPS

THERMCHEM_M::ENTROPY

END_EXCLUDE_LIST

BEGIN_INSTRUMENT_SECTION

loops routine="#"

END_INSTRUMENT_SECTION

Page 34: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 34

Getting Access to TAU on Jaguar set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path) Choose Stub Makefiles (TAU_MAKEFILE env. var.) from

/spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.* Makefile.tau-mpi-pdt-pgi (flat profile) Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir) Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)

Binaries of S3D can be found in: ~sameer/scratch/S3D-BINARIES

withtau» papi, multiplecounters, mpi, pdt, pgi options

without_tau

Page 35: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 35

Concluding Discussion Performance tools must be used effectively More intelligent performance systems for productive use

Evolve to application-specific performance technology Deal with scale by “full range” performance exploration Autonomic and integrated tools Knowledge-based and knowledge-driven process

Performance observation methods do not necessarily need to change in a fundamental sense More automatically controlled and efficiently use

Develop next-generation tools and deliver to community Open source with support by ParaTools, Inc. http://www.cs.uoregon.edu/research/tau

Page 36: S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende tau-team@cs.uoregon.edu

TAU Performance SystemS3D Scalability Study 36

Support Acknowledgements

Department of Energy (DOE)

Office of Science LLNL, LANL, ORNL, ASC PERI