® 1 integrated mpi/openmp performance analysis kai software lab intel corporation & pallas,...

46
® 1 Integrated Integrated MPI/OpenMP MPI/OpenMP Performance Performance Analysis Analysis KAI Software Lab KAI Software Lab Intel Corporation & Intel Corporation & Pallas, Pallas, GmbH GmbH Bob Kuhn, Bob Kuhn, [email protected] [email protected] Hans-Christian Hoppe, Hans-Christian Hoppe, [email protected] [email protected]

Upload: brandon-stevens

Post on 20-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

1

Integrated Integrated MPI/OpenMP MPI/OpenMP Performance Performance

AnalysisAnalysis

KAI Software LabKAI Software LabIntel Corporation & Intel Corporation & Pallas, GmbHPallas, GmbH Bob Kuhn, Bob Kuhn, [email protected]@intel.com Hans-Christian Hoppe, Hans-Christian Hoppe, [email protected]@pallas.com

Page 2: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

2

OutlineOutline

Why integrated MPI/OpenMP Why integrated MPI/OpenMP programming?programming?

A performance tool for MPI/OpenMP A performance tool for MPI/OpenMP programming (Phase 1)programming (Phase 1)

Integrated performance analysis Integrated performance analysis capability for ASCI Apps (Phase 2)capability for ASCI Apps (Phase 2)

Page 3: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

3

Why Integrate Why Integrate MPI and OpenMP?MPI and OpenMP?Hardware trendsHardware trendsSimple example – How it is done now?Simple example – How it is done now?An FEA ExampleAn FEA ExampleASCI ExamplesASCI Examples

Page 4: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

4

Parallel Hardware Keeps Parallel Hardware Keeps ComingComing Example recently LLNL Example recently LLNL

ASCI clustersASCI clusters Parallel Capacity Parallel Capacity

Resource (PCR) clusterResource (PCR) cluster– Three clusters totaling Three clusters totaling

472 Pentium 4s; the 472 Pentium 4s; the largest with 252largest with 252

– Theoretical peak 857 Theoretical peak 857 gigaFLOP/s, gigaFLOP/s,

– Linux Linux

– NetworX via SGI FederalNetworX via SGI Federal

HPCWireHPCWire 8/31/01 8/31/01

Parallel global file Parallel global file system clustersystem cluster

– Total 48 Pentium 4 Total 48 Pentium 4 processors processors

– 1,024 clients/servers1,024 clients/servers

– Deliver I/O rates of over Deliver I/O rates of over 32 GB/s32 GB/s

– Fail-over and global lock Fail-over and global lock managermanager

– Linux open sourceLinux open source

– NetworX via SGI FederalNetworX via SGI Federal

HPCWireHPCWire 8/31/01 8/31/01

Page 5: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

5

Parallelism Performance Parallelism Performance AnalysisAnalysis

Effort

Cod

e P

erfo

rman

ce

OpenMP

MPIOpenMP Performance tools

MPI/OpenMP Performance

toolsDebuggers, IDEs

Page 6: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

6

Cost Effective Parallelism Cost Effective Parallelism Long Term Long Term Wealth of parallelism experience Wealth of parallelism experience

single person codes to large teamsingle person codes to large team TTEETTOONN CCRREETTIINN LLAASSNNEEXX

PPuurrppoossee rraaddiiaattiioonn ttrraannssppoorrtt nnoonn--LLTTEE pphhyyssiiccss IICCFF ssiimmuullaattiioonnss

AAggee ((yyeeaarrss)) ~~55--1100 ~~1100 ~~2255

SSiizzee ((lliinneess)) 2200 KK 110000 KK llaarrggee

DDeevveellooppeerrss 11--22 11 mmaannyy

CCoommpplleexxiittyy llooww mmooddeerraattee hhiigghh

PPaarraalllleell mmooddeell 11 lleevveell SSMMPP 11 lleevveell DDMMPP

vvaarriieedd lleevveell SSMMPP//DDMMPP

ssiinnggllee lleevveell SSMMPP

CCoommpplliiccaattiioonnss mmeemmoorryy mmaannaaggeemmeenntt

bbuuiilldd pprroocceessss

Page 7: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

7

ASCI ASCI Ultrascale Tools ProjectUltrascale Tools ProjectPathforward projectPathforward project

– RTS – Parallel System PerformanceRTS – Parallel System Performance

Ten Goals in three areas – Ten Goals in three areas – – ScalabilityScalability – Work with 10,000+ Processors– Work with 10,000+ Processors

– IntegrationIntegration – How about Hardware Monitors, – How about Hardware Monitors, Object Oriented, and Runtime Environment?Object Oriented, and Runtime Environment?

– Ease of UseEase of Use – Dynamic Instrumentation and Be – Dynamic Instrumentation and Be Prescriptive, not just Data Management Prescriptive, not just Data Management

Page 8: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

8

Architecture for Ultrascale Architecture for Ultrascale PerformancePerformance

1)1) GuideGuide – Source – Source InstrumentationInstrumentation

2)2) VampirtraceVampirtrace – – MPI/OpenMP MPI/OpenMP InstrumentationInstrumentation

3)3) VampirVampir – – MPI AnalysisMPI Analysis

4)4) GuideViewGuideView – – OpenMP Analysis OpenMP Analysis

Guide

Vampir

GuideView

Application Source

Executable

Guidetrace Library

VampirtraceLibrary

TraceFile

Object Files

Page 9: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

9

Phase One Goal – Phase One Goal – Integrated MPI/OpenMPIntegrated MPI/OpenMPPhase One Goals –Phase One Goals –

– Integrated MPI OpenMP TracingIntegrated MPI OpenMP Tracing– Mode most compatible with ASCI SystemsMode most compatible with ASCI Systems

–Whole Program ProfilingWhole Program Profiling– Integrate program profile with parallelismIntegrate program profile with parallelism

– Increased Scalability of Performance Increased Scalability of Performance AnalysisAnalysis

– 1000 processors1000 processors

Page 10: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

10

Vampir – Integrated Vampir – Integrated MPI/OpenMPMPI/OpenMP SWEEP3D run

on 4 MPI tasks with 4 OpenMP Threads each

Timeline shows OpenMP regions with glyph

Threaded activity during OpenMP region

Page 11: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

11

GuideView – Integrated GuideView – Integrated MPI/OpenMP & ProfileMPI/OpenMP & Profile

SWEEP3D run on 4 MPI tasks each with 4 OpenMP threads

All OpenMP regions for process summarized to one bar

Highlight (Red arrow) shows speedup curve for that set of threads

Thread view shows balance between MPI tasks and threads

Page 12: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

12

GuideView – Integrated GuideView – Integrated MPI/OpenMP & ProfileMPI/OpenMP & Profile

Sorting and filtering bring large amounts of information to manageable level

Profile allows comparison of MPI, OpenMP and Application activity inclusive and exclusive

Page 13: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

13

Guide –Guide –Compiler WorkhorseCompiler WorkhorseCompilation of Compilation of

OpenMP OpenMP Automatic Automatic

subroutine entry subroutine entry and exit and exit instrumentation –instrumentation –– FortranFortran

– C/C++C/C++

New compiler options –New compiler options –WGtraceWGtrace -- link with -- link with the Vampirtracethe Vampirtrace

WGprofWGprof -- -- subroutine subroutine entry/exit profiling entry/exit profiling

– – WGprof_leafprune WGprof_leafprune minimum size of minimum size of procedures to procedures to retain in profile retain in profile

Page 14: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

14

Vampirtrace –Vampirtrace –Profiling Profiling Support for pruning of short routines Support for pruning of short routines

ROUTINE X ENTRY

ROUTINE Y ENTRY

ROUTINE Y EXIT

> Δt < Δt

This tree will be pruned. ROUTINE X will be marked as having calltree info summarized.

All events that have not been pruned could now be written to the tracefile.

˚ ˚ ˚

ROUTINE Z ENTRYROUTINE Z may still be < Δt so cannot yet be written.

Page 15: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

15

Scalability on Phase OneScalability on Phase One

Timeline scaling to 256 Tasks/NodesTimeline scaling to 256 Tasks/NodesGathering of tasks in node into groupGathering of tasks in node into group

–Filtering by nodes Filtering by nodes

–Expand each nodeExpand each node

–Message statistics by nodesMessage statistics by nodes

Page 16: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

16

Phase Two – Integrating Phase Two – Integrating Capabilities for ASCI AppsCapabilities for ASCI AppsPhase Two Goals –Phase Two Goals –

– Deployment to other platformsDeployment to other platforms – – – Compaq, CPlant, SGICompaq, CPlant, SGI

– Thread-SafetyThread-Safety– ScalabilityScalability – –

– Grouping Grouping – Statistical Analysis Statistical Analysis – Integrated GuideViewIntegrated GuideView

– Hardware performance monitorsHardware performance monitors– Dynamic control of instrumentationDynamic control of instrumentation– Environmental awarenessEnvironmental awareness

Page 17: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

17

Thread SafetyThread Safety Collect data from Collect data from

each thread –each thread –– Thread-safeThread-safe

Vampirtrace libraryVampirtrace library

– Per threadPer thread profiling profiling datadata

– Previous release, Previous release, only master thread only master thread logged datalogged data

Improves accuracy Improves accuracy of dataof data

Value to users –Value to users –– Enhances integration Enhances integration

between MPI and between MPI and OpenMPOpenMP

– Enhances visibility into Enhances visibility into functional balance functional balance between threadsbetween threads

Page 18: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

18

Scalability: GroupingScalability: Grouping Up to end of FY00Up to end of FY00

– Fixed hierarchy levels Fixed hierarchy levels (system, nodes, CPUs)(system, nodes, CPUs)

– Fixed grouping of Fixed grouping of processesprocesses

– Eg, Impossible to reflect Eg, Impossible to reflect communicatorscommunicators

Need more levelsNeed more levels– Threads are a fourth Threads are a fourth

groupgroup– Systems with deeper Systems with deeper

hierarchies (30T)hierarchies (30T)– Reduce number of on-Reduce number of on-

screen entities for screen entities for scalabilityscalability

Whole system

Node nNode 1

CPU 1 CPU c

T_1 T_p

t_1 t_c

Quadboard

Page 19: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

19

Default GroupingDefault Grouping

Default Default GroupingGrouping– By NodesBy Nodes

– By ProcessesBy Processes

– By Master By Master ThreadsThreads

– All ThreadsAll Threads

Can be changed Can be changed in configuration in configuration filefile

All Cluster

All Processes

All Masters

Node n

Process n

T_1 T_pT_0 …

Node 1

Process 0

T_1 T_pT_0 …

All Threads

All Cluster

All Processes

All Masters

Node n

Process n

T_1 T_pT_0 …T_1 T_pT_0 …

Node 1

Process 0

T_1 T_pT_0 …T_1 T_pT_0 …

All Threads

Page 20: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

20

Scalability: Scalability: GroupingGrouping

Filter processes Filter processes dialogdialog– Select groups Select groups

combo-boxcombo-box

Display of groupsDisplay of groups– By aggregationBy aggregation

– By representativeBy representative

Grouping applies toGrouping applies to– ““Timeline bars”Timeline bars”

– Counter streamsCounter streams

Page 21: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

21

Scalability by GroupingScalability by Grouping

Parallelism display showing all threads

Parallelism display showing only master threads alternating between MPI and OpenMP parallelism

Page 22: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

22

Statistical Information Statistical Information GatheringGathering Collects basic Collects basic

statistics statistics at runtimeat runtime Saves statistics in Saves statistics in

an ASCII-filean ASCII-file View statisticsView statistics

– your favorite your favorite spreadsheet ...spreadsheet ...

Reduced overhead Reduced overhead compared to compared to tracingtracing

Parallel Executable

Tracefile(big)

Statsfile(small)

Perl filter

Excel, ...

Page 23: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

23

Statistical Information Statistical Information GatheringGathering Can work independent of tracingCan work independent of tracing Significantly lower overhead (memory, Significantly lower overhead (memory,

runtime) runtime) Restriction: for the whole application run ...Restriction: for the whole application run ...

What Organization Data Subroutines Per process Min/max/total

time # of calls

Messages Per sender/receiver

Min/max/total bytes

# of messages

Parallel region

Per process

Page 24: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

24

Statistical Information Statistical Information GatheringGathering<act>:<sym>:<proc>:<calls>:<minexcl>:<maxexcl>:<totalexcl>:<minincl>:<maxincl>:<totalincl> INFO ACTSTATS Application:PK_2112_YBDRYS:0:16:3.539324e-04:5.249977e-04:7.470846e-03:3.539324e-04:5.249977e-04:7.470846e-03 INFO ACTSTATS Application:PK_2112_YBDRYS:1:16:3.600121e-04:5.509853e-04:7.577062e-03:3.600121e-04:5.509853e-04:7.577062e-03 INFO ACTSTATS Application:PK_2112_YBDRYS:2:16:3.390312e-04:5.350113e-04:7.542133e-03:3.390312e-04:5.350113e-04:7.542133e-03 INFO ACTSTATS Application:PK_2112_YBDRYS:3:16:3.429651e-04:5.450249e-04:7.494092e-03:3.429651e-04:5.450249e-04:7.494092e-03

01

23

45

67

PK_814_CALCHYDZRDPARAM

PK_562_CALCHYDY

0,00E+005,00E+041,00E+051,50E+05

2,00E+05

2,50E+05

3,00E+05

3,50E+05

PK_814_CALCHYDZ

RDPARAM

PK_562_CALCHYDY

Page 25: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

25

GuideView Integrated GuideView Integrated Inside VampirInside Vampir

Creating an Creating an extension API extension API in Vampirin Vampir

– insert menu insert menu itemsitems

– include new include new displaysdisplays

– have access to have access to trace data & trace data & statisticsstatistics

Trace data(in memory)

Vampir menus

Vampir GUI engine

New GuideView

invoke

access

control

Motif graphics library

display

Page 26: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

26

New GuideView New GuideView Whole Program ViewWhole Program View Goals –Goals –

– Improve Improve MPI/OpenMP MPI/OpenMP integrationintegration

– Improve Improve scalabilityscalability

– Integrate look Integrate look and feeland feel

Works like old Works like old GuideView!GuideView!

Load time – Load time – Fast!Fast!

Page 27: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

27

New GuideView New GuideView Region ViewRegion ViewLooks like old Looks like old

Region view Region view turned on the turned on the side!side!

Scalability test Scalability test – 16 MPI tasks16 MPI tasks

– 16 OpenMP 16 OpenMP threadsthreads

– 300 Parallel 300 Parallel regionsregions

Page 28: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

28

Hardware Performance Hardware Performance MonitorsMonitors

1)1) User can call HPM API User can call HPM API in the source codein the source code

2)2) User can define User can define events in Config file events in Config file for Guide for Guide instrumentationinstrumentation

3)3) HPM counter events HPM counter events are also logged from are also logged from Guidetrace and Guidetrace and Vampirtrace library Vampirtrace library

4)4) Underlying HPM Underlying HPM library is PAPIlibrary is PAPI

Guide

Vampir

GuideView

Application Source

Executable

Guidetrace

Vampirtrace

TraceFile

Object Files

Config File

PAPI

Page 29: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

29

int main(int argc, char **argv) { int set_id; int inner,outer,other; set_id = VT_create_event_set(“MySet”); VT_add_event(set_id,PAPI_L1_DCM); VT_add_event(EventSet,PAPI_L2_DCM); VT_symdef(outer, “OUTER”, “USERSTATES”); VT_symdef(inner, “INNER”, “USERSTATES”);

VT_symdef(other, “OTHER”, “USERSTATES”);

VT_change_hpm(set_id);

VT_begin(outer);

foo();

VT_begin(inner);

bar();

VT_end(inner);

foo();

VT_end(outer);}

Create a new event set to measure L1 & L2 data cache

misses.

PAPI – PAPI – Hardware Performance Hardware Performance MonitorsMonitors Standardizes Standardizes

names across names across platformsplatforms

Users define Users define counter setscounter sets

User could User could instrument instrument by-hand --by-hand --

But better, But better, Counters are Counters are instrumented instrumented at OpenMP at OpenMP and subrsand subrs

Activate the event set

Collect the events over two user-defined

intervals

Can’t support unsup-ported

counters

Page 30: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

30

Hardware Performance ExampleHardware Performance ExampleMPI tasks on timeline

Floating pt instructions correlated but in different window

Or, per MPI task activity correlated in same window

Page 31: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

31

Hardware Performance Hardware Performance Can Be RichCan Be Rich

4 x 4 SWEEP3D run showing L1 Data Cache Miss

Cycles Stalled Waiting for Memory Accesses

Page 32: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

32

Hardware Performance Hardware Performance in GuideViewin GuideView

You can see the HPM data on all GuideView windows

L1 data cache misses and stalls in Cycle due to memory stalls in per MPI task profile view

Page 33: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

33

Derived Hardware Derived Hardware CountersCounters

Vampir and GuideView displays present derived counters

In this menu you can arithmetically combine measured counters into derived counters

Page 34: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

34

Environmental Environmental CountersCounters

ParameterParameter MeaningMeaning

utime utime user time used user time used

stime stime system time used system time used

maxrss maxrss

max resident set size max resident set size

ixrss ixrss

shared memory size shared memory size

idrss idrss

unshared data sizeunshared data size

minflt minflt

page reclaims page reclaims

majflt majflt

page faults page faults

nswap nswap

swaps swaps

inblock inblock

block input operations block input operations

oublock oublock

block output operations block output operations

minflt minflt

page reclaims page reclaims

majflt majflt

page faults page faults

Select Select rusagerusage information like HPMs information like HPMs

Data appears Data appears in Vampir and in Vampir and GuideView like GuideView like HPM dataHPM data

Time-varying Time-varying OS counters –OS counters –•Config variable Config variable sets sampling sets sampling frequencyfrequency•Difficult to Difficult to attribute to attribute to source code source code preciselyprecisely

Page 35: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

35

Environmental AwarenessEnvironmental Awareness

ParameterParameter MeaningMeaning

MP_EUIDEVICEMP_EUIDEVICE adapter set to be used for message adapter set to be used for message passing passing

MP_EUILIBMP_EUILIB communication subsystem library communication subsystem library implementation implementation

MP_INFOLEVELMP_INFOLEVEL level of message reporting level of message reporting

MP_BUFFER_MEMMP_BUFFER_MEM size of unexpected message buffers size of unexpected message buffers

MP_CSS_INTERRUPTMP_CSS_INTERRUPT generate interrupts for arriving generate interrupts for arriving packetspackets

MP_EAGER_LIMITMP_EAGER_LIMIT threshold for switching to threshold for switching to rendezvous protocol rendezvous protocol

MP_USE_FLOW_CONTMP_USE_FLOW_CONTROLROL

enforce flow control for outgoing enforce flow control for outgoing messages messages

Type 1: Collects IBM MPI information Type 1: Collects IBM MPI information – Treated as static (one time) event in tracefileTreated as static (one time) event in tracefile

– Over 50 parametersOver 50 parameters

Page 36: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

36

Dynamic Control of InstrumentationDynamic Control of Instrumentation1)1) In source, User puts In source, User puts

VT_confsync() callsVT_confsync() calls

2)2) At runtime, At runtime, TotalView TotalView is attachedis attached and and breakpoint is inserted breakpoint is inserted

3)3) From process #0, From process #0, user adjusts several user adjusts several instrumentation instrumentation settingssettings

4)4) VTconfigchanged VTconfigchanged flag is set, breakpoint flag is set, breakpoint is exited,is exited,

Guide

Vampir

GuideView

Application Source

Executable

TotalView

VampirtraceLibrary

TraceFile

Object Files

Tracefile reflects change after Tracefile reflects change after nextnext VT_confsync() VT_confsync()

Page 37: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

37

Dynamic Control of InstrumentationDynamic Control of InstrumentationKeywordKeyword DescriptionDescription Default ValueDefault Value

LOGFILE-NAMELOGFILE-NAME Tracefile nameTracefile name <argv[0]>.bvt<argv[0]>.bvt

LOGFILE-PREFIXLOGFILE-PREFIX Tracefile path prefixTracefile path prefix Null stringNull string

ACTIVITYACTIVITY Trace activities (User defined)Trace activities (User defined) * ON* ON

SYMBOLSYMBOL Trace symbols (Often subroutines)Trace symbols (Often subroutines) * ON* ON

COUNTERCOUNTER Trace countersTrace counters * ON* ON

OPENMPOPENMP Trace OpenMP regionsTrace OpenMP regions * ON* ON

PCTRACEPCTRACE Record return addressRecord return address OFFOFF

SUM-MPITESTSSUM-MPITESTS Collapse MPI probe and test routinesCollapse MPI probe and test routines ONON

CLUSTERCLUSTER Trace cluster nodesTrace cluster nodes All enabledAll enabled

PROCESSPROCESS Trace processesTrace processes All enabledAll enabled

ENVIRONMENTENVIRONMENT Record environment informationRecord environment information ONON

MEM-MAXBLOCKSMEM-MAXBLOCKS Maximum number of memory blocksMaximum number of memory blocks UnlimitedUnlimited

MEM-OVERWRITEMEM-OVERWRITE Overwrite in–core buffersOverwrite in–core buffers OFFOFF

PRUNE-LIMITPRUNE-LIMIT Execution time thresholdExecution time threshold No pruningNo pruning

Page 38: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

38

Structured Trace FilesStructured Trace FilesFrames Manage ScalabilityFrames Manage Scalability

A Section of the Timeline

A Set of Processors

Instances of a subroutine

OpenMP Regions

Messages or Collectives

Page 39: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

39

Structured Trace FilesStructured Trace Files Consist of Frames Consist of FramesFrames are defined Frames are defined

In the source code –In the source code –– int VT framedef( int VT framedef( char char name, name, unsigned int unsigned int type mask, int * type mask, int * frame handle )frame handle )

– int VT int VT framestart( int framestart( int frame handle )frame handle )

– int VT int VT framestop( int framestop( int frame handle )frame handle )

Type_mask defines the Type_mask defines the types of data collected –types of data collected –– VT FUNCTIONVT FUNCTION– VT REGIONVT REGION– VT PAR REGIONVT PAR REGION– VT OPENMPVT OPENMP– VT COUNTERVT COUNTER– VT MESSAGEVT MESSAGE– VT COLL OPVT COLL OP– VT COMMUNICATIONVT COMMUNICATION– VT ALLVT ALL

Analyze time frames will Analyze time frames will be availablebe available

Page 40: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

40

Structured Trace FilesStructured Trace FilesRapid Access By FramesRapid Access By Frames

Index File

FrameFrame

FrameFrame

1) Structured Tracefile

3) Selecting Thumbnails

Displays Frames in Vampir

2) Vampir Thumbnail Displays

Represent Frames

Page 41: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

41

Object Oriented Object Oriented Performance AnalysisPerformance AnalysisHow to avoid SOOX – Instrument with API How to avoid SOOX – Instrument with API

(Scalability Object Oriented eXplosion)(Scalability Object Oriented eXplosion)– C++ templates, classes make it much easier C++ templates, classes make it much easier

– Can be used with or Can be used with or without without sourcesource

VT ActivityVT Activity//InformerMappingsInformerMappings

MPI_SendMPI_Send MPI_RecvMPI_Recv MPI_FinalizeMPI_Finalize Func AFunc A Func InitFunc Init Func X Func Y Func ZFunc X Func Y Func Z

ImYImX ImZ

I_A I_B I_C I_DInformersInformers

EventsEvents

ImQ

Use Use TAU TAU

modelmodel

Page 42: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

42

Example of OO InformersExample of OO Informersclass Matrix {class Matrix {

public:public:

InformerMapping im;InformerMapping im;

Matrix(int rows, int columns) {Matrix(int rows, int columns) {

if (rows * columns > 500)if (rows * columns > 500)

im.Rename(“LargeMatrix”);im.Rename(“LargeMatrix”);

else im.Rename(“Matrix”); }else im.Rename(“Matrix”); }

void invert () {void invert () {

Informer(im, “invert”, 12, 15, “Example.C”);Informer(im, “invert”, 12, 15, “Example.C”);

#pragma omp parallel #pragma omp parallel

{ .... }{ .... }

MPI_send(...);MPI_send(...);

}}

void compl () {void compl () {

Informer(im, “Informer(im, “typeid(…)typeid(…)” );” );

........

}}

};};

void main(int argc, char **argv) {void main(int argc, char **argv) {

Matrix Matrix A(10,10),B(512,512),C(1000,1000); A(10,10),B(512,512),C(1000,1000);

// line 1// line 1

B.im.Rename(“MediumMatrix”); B.im.Rename(“MediumMatrix”); // line 2// line 2

A.invert(); A.invert(); // line 3// line 3

B.compl(); B.compl(); // line 4// line 4

C.invert(); C.invert(); // line 5// line 5

}}

Create three Matrix instances: A (mapped to “Matrix” bin), B (mapped to “LargeMatrix” bin),

and C (mapped to “LargeMatrix” bin)

Remap B to “MediumMatrix” bin

A.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in Matrix bin

B.compl is traced. Entry and exit events are collected and associated with (“Matrix:void

compl(void)”) in MediumMatrix bin

C.invert() is traced. Entry and exit events are collected and associated with (“Matrix:invert”) in

LargeMatrix bin

Page 43: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

43

Vampir OO Timeline Vampir OO Timeline Shows Informer BinsShows Informer Bins

InformerMappings: display each bin as a Vampir activity. MPI is put into a separate activity with same prefix

Rename as ‘Mangled name’ InformerMapping:Informer:NormalEventName

Page 44: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

44

Vampir OO Profile Shows Vampir OO Profile Shows Informer BinsInformer Bins

Time in Classes: Queens

MPI Time in Class: Queens

Page 45: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

45

OO GuideView Shows OO GuideView Shows Regions in BinsRegions in Bins

Time and counter data per thread by Bin

Page 46: ® 1 Integrated MPI/OpenMP Performance Analysis KAI Software Lab Intel Corporation & Pallas, GmbH Bob Kuhn, bob.kuhn@intel.com Hans-Christian Hoppe, hoppe@pallas.com

RR

®®

46

Parallel Performance Parallel Performance EngineeringEngineering

ASCI Ultrascale Performance ToolsASCI Ultrascale Performance Tools– ScalabilityScalability

– IntegrationIntegration

– Ease of UseEase of Use

Read about what was presentedRead about what was presented– ftp://ftp.kai.com/private/ftp://ftp.kai.com/private/

Lab_notes_2001.doc.gzLab_notes_2001.doc.gz

– Contact: [email protected]: [email protected]

Thank you for your attention!Thank you for your attention!