© 2011 pittsburgh supercomputing center getting the most out of the teragrid sgi altix uv systems...

58
© 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July 18, 2011 Salt Lake City

Upload: theodore-harrell

Post on 26-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center

Getting the Most Out of the TeraGrid SGI Altix UV Systems

Mahin Mahmoodi

Raghu Reddy

TeraGrid 11 Conference

July 18, 2011

Salt Lake City

Page 2: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 2

Outline

• Blacklight memory BW and latency w.r.t processor-core mapping

• GRU environment variable• Portable performance evaluation tools on

Blacklight– Case study: PSC Hybrid Benchmark– PAPI– IPM– SCALASCA– TAU

Page 3: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 3

Blacklight memory BW and latency with respect to processor-core mapping

Page 4: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 4

Blacklight per blade/processor/core Memory Layout

Node: 1 blade + 1 HUB

L1: 64KB per core

L2: 256KB per core

QPI

BLADE (2 Processors) 128 GB local memory

ProcessorProcessor (socket)

8 cores

L3L3: Last Level Cache = 24 MB

L1,L2

L3

HUB

QPI

Page 5: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 5

Blacklight Node Pair Architecture

“node pair”

NUMAlink-5“node”

UVHub

Intel Nehalem

EX-8

Intel Nehalem

EX-8

QPI

64 GBRAM

64 GBRAM

UVHub

Intel Nehalem

EX-8

Intel Nehalem

EX-8

QPI

64 GBRAM

64 GBRAM

Page 6: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 6

HPCC Stream benchmark• Memory bandwidth

is the rate that data can be read from or stored into processor memory

by processor• Stream measures

sustainable main memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernel.

• Compute a = b + α c (SAXPY)

Where b and c are two vectors of random 64-bit floating-point values for a given scalar value of α.

• Problem size

The STREAM benchmark is specifically designed to work with datasets much larger than the available cache on any given system, so that the results are more indicative of the performance of very large, vector style applications.

• Design purpose

It is designed to stress local memory bandwidth. The vectors may be allocated in an aligned manner such that no communication is required to perform the computation.

Page 7: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 7

Blacklight Memory Bandwidth w.r.t Process-core Mapping

• HPCC-stream used for memory BW (MB/s)

Single Triad (per core) 5

Star Triad per core, (per socket) 2.37, (8 * 2.37)

Cores/socket 8

Speed up/socket 3.792

Page 8: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 8

Effect of –openmp and omplace in Stream Benchmark Bandwidth• -openmp is the compilation flag• omplace is the run time command for an OpenMP code

to ensure that threads do not migrate across the cores

FunctionRate

(MB/S)-openmp

Rate (MB/S)

-openmp & omplace

Rate (MB/s)

Rate (MB/s)

omplace

Copy 840.06 4363.83 4205.21 4186.23

Scale 728.38 3946.61 3957.36 3968.78

Add 970.93 4934.30 5007.15 4977.13

Triad 979.62 4998.90 5017.49 4995.37

Take home message: If compiled with OpenMP be sure to use omplaceExample: mpirun -np 16 omplace –nt 4 ./myhybrid

Page 9: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 9

Modified STREAMs Benchmark

The notation is: blk-stride-arraysize

g: giga word; unit is in word (8 bytes)

blkstride

Arraysize

Function 200M-200M-200M

(MB/s)

8-8-200M

(MB/s)

1-1-200M

(MB/s)

Strided CP 4200 5175 2145

Random CP 4200 1050 288

Single core modified streams is benchmarked

Page 10: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 10

Remote Memory Access• Modified Stream code is benchmarked• Data is initialized on thread 0 and resides on thread 0• Data accessed by thread <n> (remote access)• Block=blk, Stride=S, Arraysize=n

Accessingthread

BW (MB/s)blk=200MS=200Mn=200M

BW (MB/s)blk=8S=8

n=200M

BW (MB/s)blk=1S=8

n=200M

0 1826.18 1624.53 557.39

8 1410.17 1376.59 463.20

16 594.83 641.88 187.24

24 673.43 622.25 188.44

32 541.75 534.93 156.57

48 481.22 459.93 140.14

0, 1, 2, 3, 4, 5, 6, 7 8, 9, 10,11,12, 13, 14, 15

16,17,18,19,20,21,22,23 24,25,26,27,28,29,30,31

HUB

HUB

QPI

QPI

Page 11: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 11

HPCC Ping-pong Benchmark• Latency

Time required to send an 8-byte message from one process to another

• What does ping pong benchmark mean?

The ping pong benchmark is executed on two processes. From the client process a message (ping) is sent to the server process and then bounced back to the client (pong). MPI standard blocking send and receive is used. The ping-pong patterns are done in a loop. To achieve the communication time of one message, the total communication time is measured on the client process and divided by twice the loop length. Additional startup latencies are masked out by starting the measurement after one non-measured ping-pong. The benchmark in hpcc uses 8 byte messages and loop length = 8 for benchmarking the communication latency. The benchmark is repeated 5 times and the shortest latency is reported. To measure the communication bandwidth, 2,000,000 byte messages with loop length 1 are repeated twice.

• How is ping pong measured on more than 2 processors?

The ping-pong benchmark reports the maximum latency and minimum bandwidth for a number of non-simultaneous ping-pong tests. The ping-pongs are performed between as many as possible (there is an upper bound on the time it takes to complete this test) distinct exclusive pairs of processors.

Reference: http://icl.cs.utk.edu/hpcc/faq/index.html

Page 12: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 12

Blacklight Latency with Respect to process-core Mapping

• HPCC-pingpong used for latency measurement

• Ranks send and recv 8-byte msg one at a time

Cores msg length (byte)

MPI Latency(microseconds)

1024 8 1.6 - 2.0

Page 13: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 13

GRU Environment Variable

Page 14: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 14

Global Reference Unit (GRU) Hardware Overview

UV HUB

Nehalem EX4,6,or 8 cores

Nehalem EX4,6,or 8 cores

QPI

QPI QPI

2 GRU Chiplets per HUB NUMALINK 5

Memory DIMMS Memory DIMMS

• GRU is a coprocessor that resides in HUB (node controller) of a UV system• GRU provide high BW & low latency socket communication• SGI MPT library uses GRU features for optimizing node communication

Page 15: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 15

Run-time tuning with GRU in PSC Hybrid Benchmark

• Setting GRU_RESOURCE_FACTOR variable at run-time may improve the communication time.

• That is: ‘setenv GRU_RESOURCE_FACTOR <n>’, n=2,4,6,8• All runs are on 64 cores• (rank, thread): (64, 1), (8, 8), (8,4)

No-GRU GRU=2 GRU=4 GRU=6 GRU=8

64R1T

8R8T

16R4T

64R1T

8R8T

16R4T

64R1T

8R8T

16R4T

64R1T

8R8T

16R4T

64R1T

8R8T

16R4T

174 184 170 167 143 134 169 141 130 169 140 131 169 140 136

93 101 88 87 60 53 89 57 48 89 56 50 88 57 55

Walltime

CommTime

Page 16: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 16

Effect of GRU on HPCC Pingpong BW

• HPCC-pingpong is used in two runs• Following environments are set in one of the

runs:setenv MPI_GRU_CBS 0setenv GRU_RESOURCE_FACTOR 4 setenv MPI_BUFFER_MAX 2048

Cores msg(bytes)

BW (MB/s)No-GRU

BW (MB/s)GRU

1024 2,000,000 1109.5 2663.6

Page 17: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 17

Case study: PSC Hybrid Benchmark

Page 18: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 18

A Case Study: PSC Hybrid Benchmark Code (Laplace Solver)

• Code uses MPI and OpenMP libraries to parallelize the solution of partial differential equation (PDE)

• Tests MPI/OpenMP performance of code on NUMA system

• Computation: each process is assigned the task of updating the entries on the part of the array it owns

• Communication: Each processor communicates with two neighbors only at block boundaries in order to receive values of neighbor points which are owned by another processor

• No collective communication

• Communications are simplified by allocating an overlap area at each processor for sorting the values to be received from neighbor

Page 19: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 19

The Laplace Equation

• To solve the equation, we want to find T(x,y) in the grid points subject to the following initial boundary conditions:– Initial T at top and left boundaries is 0.– T varies linearly from 0 to 100 along the right and bottom

boundaries.• Solution method is Known as:

The Point Jacobi Iteration T=

0

T=0

T=100

T=

100

Page 20: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 20

The Point Jacobi Iteration

• In this iterative method, value of each T(i,j) is replaced by the average of four neighbors until the convergence criteria are met.

• T(i,j) = 0.25 * [T(i-1,j)+T(i+1,j)+T(i,j-1)+T(i,j+1)]

T(i+1,j)T(i,j)

T(i,j+1)

T(i,j-1)

T(i-1,j)

Page 21: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 21

Data DecompositionIn PSC Laplace benchmark

• 1D block, row-wise block partition is used• Each processor (PEs):

compute Jacobi points in

its block and

communicate those

with neighbor(s) only

at block boundaries.

PE0

PE1

PE2

PE3

Page 22: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 22

Portable performance evaluation tools on Blacklight

Page 23: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 23

Portable Performance Evaluation Tools on blacklight

Goals:• Give an over view of the programming tools

suite available on blacklight• Explain the functionality of individual tools• Teach how to use the tools effectively

– Capabilities– Basic use– Hybrid profiling analysis– Reducing the profiling overhead– Common environment variables

Page 24: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 24

Available Open Source Performance Evaluation Tools on Blacklight

• PAPI• IPM• SCALASCA• TAU• module avail <tool> to view the available

versions • module load <tool> bring into the

environment

eg: module load tau

Page 25: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 25

What is PAPI?

• Middleware to provide a consistent programming interface for the hardware performance counter found in most major micro-processors.

• Countable hardware events:PRESET: platform neutral events

NATIVE : platform dependent events

Derived: preset events can be derived from multiple native events.

Multiplexed: events can be multiplexed if counters are limited.

Page 26: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 26

PAPI Utilities• Utilities are available in PAPI bin directory. Load the module first to

append it to the PATH or use the absolute path to the utility

Example:

% module load papi

% Which papi_avail

/usr/local/packages/PAPI/usr/4.1.3/bin/papi_avail

• Execute the utilities in compute nodes as mmtimer is not available in login nodes.

• Use <utility> -h for more information

Example:

% papi_cost –h

It computes min / max / mean / std. deviation for PAPI start/stop pairs; for PAPI reads, and for PAPI_accums.

Usage: cost [options] [parameters] …

Page 27: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 27

PAPI Utilities Cont.

• Execute papi_avail for PAPI preset events

% papi_avail

……

Name Code Avail Deriv Description (Note)

PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses

PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses

• Execute papi_native_avail for available native events

%papi_native_avail

…….

Event Code Symbol | Long Description |

0x40000005 LAST_LEVEL_CACHE_REFERENCES | This is an alias for LLC_REFERENCE || S

• Execute papi_event_chooser to select a compatible set of events that can be counted simultaneously.

% papi_event_chooser

Usage: papi_event_chooser NATIVE|PRESET evt1 evt2 ...

% papi_event_chooser PRESET PAPI_FP_OPS, PAPI_L1_DCM

event_chooser.c PASSED

Page 28: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 28

PAPI High-level Interface

• Meant for application programmers wanting coarse-grained measurements

• Calls the lower level API• Allows only PAPI preset events• Easier to use and less setup (less additional code) than

low-level• Supports 8 calls in C or Fortran:

Page 29: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 29

PAPI High-level Example

#include "papi.h” #define NUM_EVENTS 2 long_long values[NUM_EVENTS]; unsigned int

Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC};

/* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS);

/* What we are monitoring… */ do_work();

/* Stop counters and store results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);

Page 30: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 30

Low-level Interface

• Increased efficiency and functionality over the high level PAPI interface

• Obtain information about the executable, the hardware, and the memory environment

• Multiplexing• Callbacks on counter overflow• Profiling• About 60 functions

Page 31: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 31

PAPI Low-level Example

#include "papi.h”#define NUM_EVENTS 2int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC};int EventSet;long_long values[NUM_EVENTS];/* Initialize the Library */retval = PAPI_library_init(PAPI_VER_CURRENT);/* Allocate space for the new eventset and do setup */retval = PAPI_create_eventset(&EventSet);/* Add Flops and total cycles to the eventset */retval = PAPI_add_events(EventSet,Events,NUM_EVENTS);/* Start the counters */retval = PAPI_start(EventSet);

do_work(); /* What we want to monitor*/

/*Stop counters and store results in values */retval = PAPI_stop(EventSet,values);

Page 32: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 32

Example: FLOPS with PAPI callsprogram mflops_example

implicit none

#include 'fpapi.h'

integer :: i

double precision :: a, b, c

integer, parameter :: n = 1000000

integer (kind=8) :: flpops = 0

integer :: check

real (kind=4) :: real_time = 0., proc_time = 0., mflops = 0.

a = 1.e-8

b = 2.e-7

c = 3.e-6

call PAPIF_flops(real_time, proc_time, flpops, mflops, check)

print *, "first: ", flpops, proc_time, mflops, check

do i = 1, n

a = a + b * c

end do

call PAPIF_flops(real_time, proc_time, flpops, mflops, check)

print *, "second: ", flpops, proc_time, mflops, check

print *, 'sum = ', a

end program mflops_example

Compilation:% module load papi% ifort -fpp $PAPI_INC -o mflops mflops_example.f $PAPI_LIB

Execution:module load papi./ a.out

Output: flpops, proc_time, mflops, `checkfirst: 0 0.0000000E+00 0.0000000E+00 0 second: 1000009 1.4875773E-03 672.2400 0 sum = 6.100000281642980E-007

Page 33: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 33

IPM: Integrated Performance Monitoring

• Lightweight and easy to use• Profiles only MPI code (not serial, not OpenMP)• Profiles only MPI routines (not computational routines)• Accesses hardware performance counters using PAPI• Lists message size information• Provides communication topology• Reports walltime, comm%, flops, total memory usage, MPI

routines load imbalance and time breakdown • IPM-1 and IPM-2 (pre-release) are installed on blacklight• Generates text report and visual data (html-based)

Page 34: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 34

How to Use IPM on backlight: basicsCompilation• module load ipm• Link your code to IPM library at compile time

eg_1: icc test.c $PAPI_LIB $IPM_LIB -lmpi

eg_2: ifort –openmp test.f90 $PAPI_LIB $IPM_LIB -lmpi

Execution• Optionally, set the run time environment variables

Example:

export IPM_REPORT=FULL

export IPM_HPM = PAPI_FP_OPS,PAPI_L1_DCM ( a List of comma separated PAPI counters)

• % module load ipm• Execute the binary normally

(This step generates an xml file for visual data)

Profiling report• Text report will be available in the batch output after the execution completes• For html-based report, run ‘ipm_parse –html <xml_file>’. Transfer the generated directory on

your Windows workstation. Click on index.html for the visual data

Page 35: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 35

IPM Communication StatisticsPSC Hybrid Benchmark

Communication Event Statistics (100.00% detail, -5.4590e-03 error)

  Buffer Size

Ncalls Total Time Min Time Max Time %MPI %Wall

MPI_Wait 2097152 4999814 4907.002 4.764e-08 5.658e-01 76.10 7.98

MPI_Irecv 2097152 2520000 1374.856 1.050e-06 5.639e-01 21.32 2.24

MPI_Wait 192 40000 144.849 1.376e-07 3.014e-01 2.25 0.24

MPI_Isend 2097152 2520000 17.616 2.788e-07 5.527e-01 0.27 0.03

Page 36: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 36

IPM Profiling, Message Sizes

• Message size per MPI call: In 100% of comm time, 2MB msg is used in MPI_Wait and MPI_Irecv

Page 37: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 37

IPM Profiling: Load Imbalance Information

Page 38: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 38

SCALASCA

• Automated profile-based performance analysis• Automatic search for bottlenecks based on properties formalizing

expert knowledge– MPI wait states– Processor utilization hardware counters

Automatic performance analysis toolset

Scalable performance analysis of large-scale applications– Particularly focused on MPI & OpenMP paradigms– Analysis of communication & synchronization overheads

• Automatic and manual instrumentation capabilities• Runtime summarization and/or event trace analyses• Automatic search of event traces for pattern of inefficiency

Page 39: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 39

How to Use SCALASCA on backlight: basics

• module load scalasca • Run scalasca command (% scalasca) without argument for basic usage info.• ‘scalasca –h’ shows quick reference guide (pdf document)

• Instrumentation– Prepend skin (or scalasca –instrument) to compiler/link commands

Example: skin icc –openmp test.c –lmpi (hybrid code)• Measurement & analysis

– Prepend scan (or scalasca –analyze) to the usual execution command

(This step generates epik directory)– Example: omplace –nt 4 scan –t mpirun -np 16 ./exe (optional –t for trace generation)

• Report examination– Run square (or scalasca –examine) on the generated epik measurement directory for

interactively examining the report (visual data)– Example: square epik_a.out_32x2_sum

or– Run ‘cube3_score –s’ on the epik directory for text report

Page 40: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 40

Distribution of Time for Selected call tree by process/thread

Metric pane Call tree pane

process/thread pane

Page 41: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 41

Distribution of Load imbalance for work_sync routine by process/thread

Color code Profiling of 64 cores, 8 threads per rank job on Blacklight

Page 42: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 42

Global Computational Imbalance(not individual functions)

Page 43: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 43

SCALASCA Metric On-line Description(Right click on metric)

Page 44: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 44

Instruction for Scalasca Textual Report

% module load scalasca• Run cube3_score with –r flag on the

Cube file generated in the epik directory to see the text report

Example:

• Regions classification:

MPI (pure MPI functions)

OMP (pure OpenMP regions)

USR (user-level computational routines)

COM (combined USR + MPI/OpenMP)

ANY/ALL (aggregate of all regions type)

flt type max_tbc time % region

ANY 5788698 20951.46 100.00 (summary) ALL

MPI 5760322 8876.37 42.37 (summary) MPI

OMP 23384 12063.81 57.58 (summary) OMP

COM 4896 3.35 0.02 (summary) COM

USR 72 1.10 0.01 (summary) USR

MPI 2000050 16.38 0.08 MPI_Isend

MPI 1920024 7785.68 37.16 MPI_Wait

MPI 1840000 1063.18 5.07 MPI_Irecv

OMP 8800 56.31 0.27 !$omp parallel @homb.c:754

OMP 4800 8102.48 38.67 !$omp for @homb.c:758

COM 4800 3.26 0.02 work_sync

OMP 4800 3620.97 17.28 !$omp ibarrier @homb.c:765

OMP 4800 2.41 0.01 !$omp ibarrier @homb.c:773

MPI 120 11.03 0.05 MPI_Barrier

EPK 48 6.83 0.03 TRACING

OMP 44 0.03 0.00 !$omp parallel @homb.c:465

OMP 44 121.81 0.58 !$omp parallel @homb.c:557

MPI 40 0.01 0.00 MPI_Gather

MPI 40 0.00 0.00 MPI_Reduce

USR 24 0.00 0.00 gtimes_report

COM 24 0.00 0.00 timeUpdate

MPI 24 0.05 0.00 MPI_Finalize

OMP 24 23.46 0.11 !$omp ibarrier @homb.c:601

OMP 24 136.24 0.65 !$omp for @homb.c:569

COM 24 0.00 0.00 initializeMatrix

USR 24 1.10 0.01 createMatrix

% cube3_score -r epik_homb_8x8_sum/epitome.cube

Page 45: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 45

Scalasca Notable Run Time Environment Variables

• Set EPK_METRICS to colon seperated list of PAPI countersExample: setenv EPK_METRICS PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM

• Set ELG_BUFFER-SIZE to avoid intermediate flushes to diskExample: setenv ELG_BUFFER-SIZE 10000000 (bytes)

For ELG_BUFFER-SIZE, run the following command on the epik directory.% scalasca -examine –s epik_homb_8x8_sum…………………Estimated aggregate size of event trace (total_tbc): 41694664 bytesEstimated size of largest process trace (max_tbc):   5788698 bytes(Hint: When tracing set ELG_BUFFER_SIZE > max_tbc to avoid intermediate flushes or reduce requirements using file listing names of USR regions to be filtered.)

• Set EPK_FILTER to the name of filtered routines to reduce the instrumentation and measurement overhead.

Example: setenv EPK_FILTER routines_filt%cat routines_filtsumTracegmties_reportstatisticsstdoutIO

Page 46: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 46

Time Spent in omp Region Is Selected & Idle Threads

Source code

Idle threads greyed-out

Page 47: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 47

TAU Parallel Performance Evaluation Toolset

• Portable to essentially all computing platforms• Supported programming languages and paradigms:

Fortran, C/C++, Java, Python, MPI, OpenMP, hybrid,

multithreading• Supported instrumentation methods:

– Source code instrumentation, object and binary code, Library wrapping

• Levels of instrumentation:– routine, loop, block, IO BW & volume, memory tracking, Cuda, hardware counters, tracing

• Data analyzers: ParaProf, PerfExplorer, vampir, jumpshot• Throttling effect of frequently called small subroutines• Automatic and manual instrumentation• Interface to databases (Oracle, mysql, …)

….

Page 48: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 48

How to use TAU on Blacklight: basicsStep 0

% module avail tau (shows available tau versions)

% module load tau

Step 1: Compilation• Choose a TAU Makefile stub based on the kind of profiling you wish. Available Makefile stubs are here• ls $TAU_ROOT_DIR/x86_64/lib/Makefile*• eg: Makefile.tau-icpc-mpi-pdt-openmp-opari for MPI+OpenMP code

• Optionally set TAU_OPTIONS to specify compilation specific options– Eg: setenv TAU_OPTION “"-optVerbose -optKeepFiles“ for verbose & keeping the instrumented files.– export TAU_OPTIONS=‘-optTauSelectFile=select.tau –optVerbose’ (selective instrumentation)

• Use one of TAU wrapper script to compile your code (tau_f90.sh, tau_cc.sh, or tau_cxx.sh). – Eg, tau_cc.sh foo.c (generates instrumented binary)

Step 2: Execution• Optionally, set TAU runtime environment variables for generating desired choosing metrics

– eg, setenv TAU_CALLPATH 1 (for callgraph generation) – eg, setenv (papi counters)

• Run the instrumented binary ,from step 1, normally (profile file will be generated)

Step3: Data analysis• Run pprof, where profile files reside, for text profiling• Run paraprof for visual data• Run perfExplorer for multiple set of profiling• Run Jumpshot or vampir for trace files analysis

Page 49: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 49

Hybrid Code Profiled with TAU

Routines time breakdown per node/thread

Page 50: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 50

Hybrid code Profiled with TAU Cont.

Routines exclusive time %, on node0 & thread0

Routines exclusive time %, on rank3 & thread4

Page 51: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 51

TAU Profiling, Threads Load Imbalance in Hybrid Code MPI Routines

Page 52: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 52

Reducing TAU Instrumentation & Measurement Overhead

• By default TAU throttles routines that are called more than 100,000 times and each call takes less than 10 microsecond.– TAU accumulate the timer up to 100,000 time and then stops and adds the remaining time

to the parent of routine

• Tiny routines or selected routines (selective instrumentation) can be excluded from instrumentation/measurement by TAU directives

• Methods of selective instrumentation discussed next

Page 53: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 53

Selective Instrumentation Routines in TAU

• Specify a list of routines to exclude or include (case sensitive) in a text file (eg: select.tau)

• # is a wildcard in a routine name. It cannot appear in the first column.BEGIN_EXCLUDE_LISTFooBarD#EMM END_EXCLUDE_LIST

• Specify a list of routines to include for instrumentationBEGIN_INCLUDE_LISTint main(int, char **)F1F3END_INCLUDE_LIST

• Specify either an include list or an exclude list!

• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’

Page 54: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 54

Selective Instrumentation Files in TAU

• Optionally specify a list of files to exclude or include (case sensitive), in a text file

• * and ? may be used as wildcard characters in a file nameBEGIN_FILE_EXCLUDE_LISTf*.f90Foo?.cpp END_FILE_EXCLUDE_LIST

• Specify a list of routines to include for instrumentationBEGIN_FILE_INCLUDE_LISTmain.cppfoo.f90END_FILE_INCLUDE_LIST

• Specify either an include list or an exclude list!

• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’(select.tau is the selective text file)

Page 55: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 55

Instrumenting code section in TAU

• User instrumentation commands are placed in INSTRUMENT section• ? and * used as wildcard characters for file name, # for routine name• \ as escape character for quotes• Routine entry/exit, arbitrary code insertion• Outer-loop level instrumentation

BEGIN_INSTRUMENT_SECTIONloops file=“foo.f90” routine=“matrix#”memory file=“foo.f90” routine=“#” io routine=“matrix#”[static/dynamic] phase routine=“MULTIPLY”dynamic [phase/timer] name=“foo” file=“foo.cpp” line=22 to line=35file=“foo.f90” line = 123 code = " print *, \" Inside foo\""exit routine = “int foo()” code = "cout <<\"exiting foo\"<<endl;"END_INSTRUMENT_SECTION

• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’(select.tau is the selective text file)

Page 56: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 56

TAU Commonly Used Run-time Environment Variables

• ‘setenv TAU_CALLPATH 1’ to obtain callpath profiling and call graph • ‘setenv TAU_CALLPATH_DEPTH <n>’ (n specifies the depth of the callpath)

• set TAU_METRICS to a comma separated list of PAPI counters for HW event counts– Example, setenv TAU_METRICS PAPI_FP_OPS:PAPI_NATIVE_<event>

• ‘setenv TAU_TRACE 1’ for trace generation• ‘setenv TAU_COMM_MATRIX 1’ for communication topology generation

• TAU_TRACK_MEMORY_LEAKS, setting to 1 turns on leak detection (for use with tau_exec –memory)

• TAU_THROTTLE, set to 1 or 0 for turning on/off the throttling– TAU_THROTTLE 100000 Specifies the number of calls before testing for throttling

– TAU_THROTTLE_PERCALL 1 Specifies value in microseconds.

(Throttle a routine if it is called over 100000 times and takes less than 10 usec of inclusive time per call)

Page 57: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 57

Which Performance Tool to Use?

• IPM: low overhead tool for MPI communication statistics, message sizes, and PAPI event counts

• TAU: advanced profile and trace capability for MPI, OpenMP, Hybrid, Java, Python, etc.

Selective instrumentation reduces the

overhead.

• SCALASCA: ‘Automatic’ performance analysis tool for MPI and OpenMP routines.

Filtering out the computational routines reduces the

measurement overhead.

Page 58: © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July

© 2011 Pittsburgh Supercomputing Center 58

References

TAU• http://www.cs.uoregon.edu/research/tau/tau-usersguide.pdf• http://www.psc.edu/general/software/packages/tau/TAU-quickref.pdf• http://www.cs.uoregon.edu/research/tau/docs/newguide/bk03ch02.html

PAPI• http://icl.cs.utk.edu/papi/

SCALASCA• http://www.scalasca.org/

IPM

http://ipm-hpc.sourceforge.net/

Others:

https://www.teragrid.org/web/user-support/tau

http://www.psc.edu/general/software/packages/tau/

http://www.psc.edu/general/software/packages/ipm/