performance instrumentation and measurement for terascale...

10
Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra , Allen D. Malony , Shirley Moore , Philip Mucci , and Sameer Shende Innovative Computing Laboratory, University of Tennessee Knoxville, TN 37996-3450 USA dongarra,shirley,mucci @cs.utk.edu Computer Science Department, University of Oregon Eugene, OR 97403-1202 USA malony,sameer @cs.uoregon.edu Abstract. As computer systems grow in size and complexity, tool support is needed to facilitate the efficient mapping of large-scale applications onto these systems. To help achieve this mapping, performance analysis tools must provide robust performance observation capabilities at all levels of the system, as well as map low-level behavior to high-level program constructs. Instrumentation and measurement strategies, developed over the last several years, must evolve to- gether with performance analysis infrastructure to address the challenges of new scalable parallel systems. 1 Introduction Performance observation requirements for terascale systems are determined by the per- formance problem being addressed and the performance evaluation methodology being applied. Instrumentation of an application is necessary to capture performance data. Instrumentation may be inserted at various stages, from source code modifications to compile-time to link-time to modification of executable code either statically or dynam- ically during program execution. These instrumentation points have different mecha- nisms which vary in their ease of use, flexibility, level of detail, user control of what data can be collected, and intrusiveness. To provide insight into program behavior on large-scale systems and point the way toward program transformations that will improve performance, various performance data must be collected. Profiling data show the distribution of a metric across source- level constructs, such as routines, loops, and basic blocks. Most modern microproces- sors provide a rich set of hardware counters that capture cycle count, functional unit, memory, and operating system events. Profiling can be based one either time or various hardware-based metrics, such as cache misses, for example. Correlations between pro- files based on different events, as well as event-based ratios, provide derived informa- tion that can help to quickly identify and diagnose performance problems. In addition to profiling data, capturing event traces of program events, such as message communi- cation events, helps portray the temporal dynamics of application performance.

Upload: others

Post on 14-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

PerformanceInstrumentation and MeasurementforTerascaleSystems

JackDongarra�, Allen D. Malony

�, Shirley Moore

�,

Philip Mucci�, andSameerShende

�Innovative ComputingLaboratory, Universityof Tennessee

Knoxville, TN 37996-3450USA�dongarra,shirley,mucci � @cs.utk.edu�

ComputerScienceDepartment,Universityof OregonEugene,OR97403-1202USA�

malony,sameer � @cs.uoregon.edu

Abstract. As computersystemsgrow in size and complexity, tool supportisneededto facilitatethe efficient mappingof large-scaleapplicationsonto thesesystems.To helpachieve this mapping,performanceanalysistoolsmustproviderobust performanceobservation capabilitiesat all levels of the system,aswellasmaplow-level behavior to high-level programconstructs.Instrumentationandmeasurementstrategies,developedover the last several years,must evolve to-getherwith performanceanalysisinfrastructureto addressthechallengesof newscalableparallelsystems.

1 Intr oduction

Performanceobservationrequirementsfor terascalesystemsaredeterminedby theper-formanceproblembeingaddressedandtheperformanceevaluationmethodologybeingapplied.Instrumentationof an applicationis necessaryto captureperformancedata.Instrumentationmaybe insertedat variousstages,from sourcecodemodificationstocompile-timeto link-time to modificationof executablecodeeitherstaticallyor dynam-ically during programexecution.Theseinstrumentationpointshave differentmecha-nismswhich vary in their easeof use,flexibility , level of detail, usercontrol of whatdatacanbecollected,andintrusiveness.

To provide insightinto programbehavior on large-scalesystemsandpoint thewaytoward programtransformationsthat will improve performance,variousperformancedatamustbecollected.Profiling datashow thedistribution of a metricacrosssource-level constructs,suchasroutines,loops,andbasicblocks.Most modernmicroproces-sorsprovide a rich setof hardwarecountersthat capturecycle count,functionalunit,memory, andoperatingsystemevents.Profilingcanbebasedoneeithertimeor varioushardware-basedmetrics,suchascachemisses,for example.Correlationsbetweenpro-files basedon differentevents,aswell asevent-basedratios,provide derivedinforma-tion thatcanhelp to quickly identify anddiagnoseperformanceproblems.In additionto profiling data,capturingeventtracesof programevents,suchasmessagecommuni-cationevents,helpsportraythetemporaldynamicsof applicationperformance.

Page 2: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

For terascalesystems,a widerangeof performanceproblems,performanceevalua-tion methods,andprogrammingenvironmentsneedto besupported.A flexible andex-tensibleperformanceobservationframework canbestprovide thenecessaryflexibilityin experimentdesign.Researchproblemsto beaddressedby theframework includethefollowing: theappropriatelevel andlocationin theframework for implementingdiffer-ent instrumentationandmeasurementstrategies,how to make theframework modularandextensible,andtheappropriatecompromisebetweenthe level of detail andaccu-racy of theperformancedatacollectedandtheinstrumentationcost.

Theremainderof thepaperis organizedasfollows.Section2 describestheinstru-mentationmechanismsdesirabletosupportin suchaframeworkandSection3describestypesof measurements.Section4 explainshow the instrumentationandmeasurementstrategiesaresupportedin the PAPI cross-platformhardwarecounterinterfaceandintheTAU performanceobservationframework.Section5 presentsourconclusions.

2 Instrumentation

To observe applicationperformance,additionalinstructionsor probesaretypically in-sertedinto a program.This processis called instrumentation. Instrumentationcanbeinsertedatvariousstages,asdescribedbelow.

2.1 SourceCodeInstrumentation

Instrumentationatthesourcecodelevel allowstheprogrammerto communicatehigher-level domain-specificabstractionsto theperformancetool. A programmercancommu-nicatesucheventsby annotatingthesourcecodeat appropriatelocationswith instru-mentationcalls. Oncethe programundergoesa seriesof transformationsto generatetheexecutablecode,specifyingarbitrarypointsin thecodefor instrumentationandun-derstandingprogramsemanticsat thosepointsmaynotbepossible.Anotheradvantageof sourcecodeinstrumentationis thatonceaninstrumentationlibrary targetsonelan-guage,it canprovideportabilityacrossmultiplecompilersfor thatlanguage,aswell asacrossmultiple platforms.Drawbacksof sourcecodeinstrumentationincludepossiblechangesin instructionanddatacachebehavior, interactionswith optimizingcompilers,andruntimeoverheadof instrumentationlibrary calls.

Sourcecodeannotationscanbeinsertedmanuallyor automatically. Adding instru-mentationcalls in the sourcecodemanuallycanbe a tedioustaskthat introducesthepossibility of instrumentationerrorsproducingerroneousperformancedata.Someofthesedifficultieswith manualsourcecodeinstrumentationcanbeovercomeby usinga source-to-sourcepreprocessorto build anautomaticinstrumentationtool. ToolssuchasProgramDatabaseToolkit (PDT) [10] for C++, C andFortran90, canbe usedtoautomaticallyinstrumentsubroutines,coderegions,andstatements.

2.2 Library Level Instrumentation

Wrapperinterpositionlibrariesprovidea convenientmechanismfor addinginstrumen-tationcallsto libraries.For instance,theMPI ProfilingInterface[1] allowsatool devel-operto interfacewith MPI callsin aportablemannerwithoutmodifyingtheapplication

Page 3: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

sourcecodeor having accessto theproprietarysourcecodeof thelibrary implementa-tion. Theadvantageof library instrumentationis that it is relatively easyto enableandtheeventsgeneratedarecloselyassociatedwith thesemanticsof thelibrary routines.

2.3 Binary Instrumentation

Executableimagescanbe instrumentedusingbinarycode-rewriting techniques,oftenreferredto asbinary editing tools or executableediting tools.SystemssuchasPixie,ATOM [5], EEL [9], andPAT [6] includean objectcodeinstrumentorthat parsesanexecutableandrewritesit with addedinstrumentationcode.Theadvantageof binaryin-strumentationis thatthereisnoneedto re-compileanapplicationprogramandrewritingabinaryfile is mostlyindependentof theprogramminglanguage.Also, it is possibletospawn theinstrumentedparallelprogramthesamewayastheoriginalprogram,withoutany specialmodificationasarerequiredfor runtimeinstrumentation[12]. Furthermore,sinceanexecutableprogramis instrumented,compileroptimizationsdo not changeorinvalidatetheperformanceoptimization.

2.4 Dynamic Instrumentation

Dynamic instrumentationis a mechanismfor runtimecodepatchingthat modifiesaprogramduringexecution.DyninstAPI[3] providesanefficient,low-overheadinterfacethat is suitablefor performanceinstrumentation.A tool that usesthis API is calleda mutator and can insert codesnippetsinto a running program,which is called themutatee, withoutre-compiling,re-linking,or eventre-startingtheprogram.Themutatorcan either spawn an executableand instrumentit prior to its execution,or attachtoa running program.Dynamic instrumentationovercomessomelimitations of binaryinstrumentationby allowing instrumentationcodeto beaddedandremovedat runtime.Also, the instrumentationcanbe doneon a runningprograminsteadof requiringtheuserto re-executetheapplication.Thedisadvantageof dynamicinstrumentationis thattheinterfaceneedsto beawareof multipleobjectfile formats,binaryinterfaces(32/64bit), operatingsystemidiosyncrasies,aswell ascompilerspecificinformation(e.g.,tosupporttemplatenamede-manglingin C++from multipleC++compilers).To maintaincrosslanguage,crossplatform,crossfile format,crossbinary interfaceportability is achallengingtaskandrequiresa continuousportingeffort asnew computingplatformsandmulti-threadedprogrammingenvironmentsevolve.

3 Typesof Measurements

Decisionsaboutinstrumentationareconcernedwith the numberand type of perfor-manceeventsone wantsto observe during an application’s execution.Measurementdecisionsaddressthe typesandamountof performancedataneededfor performanceproblemsolving.Often thesedecisionsinvolve tradeoffs of the needfor performancedataversusthecostof obtainingit (i.e., themeasurementoverhead).Post-mortemper-formanceevaluationtools typically fall into two categories:profiling andtracing,al-thoughsomeprovide both capabilities.More recently, sometools provide real-time,ratherthanpost-mortem,performancemonitoring.

Page 4: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

3.1 Profiling

Profilingcharacterizesthebehavior of anapplicationin termsof aggregateperformancemetrics.Profilesare typically representedasa list of variousmetrics(suchas inclu-sive/exclusive wall-clock time) that areassociatedwith program-level semanticsenti-ties (suchasroutines,basicblocks,or statementsin theprogram).Time is a commonmetric,but any monotonicallyincreasingresourcefunctioncanbeused,suchascountsfrom hardwareperformancecounters.Profiling can be implementedby samplingorinstrumentationbasedapproaches.

3.2 Tracing

While profiling is usedto get aggregatesummariesof metricsin a compactform, itcannothighlight the time varyingaspectsof the execution.To studythe post-mortemspatialandtemporalaspectsof performancedata,event tracing,that is, theactivity ofcapturingeventsor actionsthattakeplaceduringprogramexecution,is moreappropri-ate.Eventtracingusuallyresultsin a log of theeventsthatcharacterizetheexecution.Eachevent in the log is anorderedtuple typically containinga time stamp,a location(e.g.,node,thread)anidentifierthatspecifiesthetypeof event(e.g.,routinetransition,user-definedevent,messagecommunication,etc.)andevent-specificinformation.For aparallelexecution,traceinformationgeneratedon differentprocessorsmaybemergedto produceasingletracefile. Themergingis usuallybasedonthetimestampwhichcanreflectlogical timeor physicaltime.

3.3 Real-timePerformanceMonitoring

Post-mortemanalysisof profiling dataor tracefiles hasthe disadvantagethat anal-ysis cannotbegin until after programexecutionhasfinished.Real-timeperformancemonitoringallows usersto evaluateprogramperformanceduringexecution.Real-timeperformancemonitoringis sometimescoupledwith applicationperformancesteering.

4 PAPI and TAU Instrumentation and MeasurementStrategies

Understandingthe performanceof parallel systemsis a complicatedtaskbecauseofthedifferentperformancelevelsinvolvedandtheneedto associateperformanceinfor-mationto theprogrammingandproblemabstractionsusedby applicationdevelopers.Terascalesystemsdo not make this taskany easier. We needto developperformanceanalysisstrategiesandtechniquethataresuccessfulboth in their accessibilityto usersandin their robustapplication.In this section,we describeour efforts in evolving thePAPI andTAU technologiesto terscaleuse.

4.1 PAPI

Most modernmicroprocessorsprovide hardwaresupportfor collectinghardwareper-formancecounterdata[2]. Performancemonitoringhardwareusuallyconsistsof a set

Page 5: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

Interface

Performance Counter Hardware

Operating System

PAPI High−Level

Kernel Extension

Interface

PAPI Machine DependentSubstrate

PAPI Low−Level

Fig.1. Layeredarchitectureof thePAPI implementation

of registersthatrecorddataabouttheprocessor’s function.Theseregistersrangefromsimpleeventcountersto moresophisticatedhardwarefor recordingdatasuchasdataand instructionaddressesfor an event, and pipeline or memorylatenciesfor an in-struction.Monitoring hardwareeventsfacilitatescorrelationbetweenthe structureofanapplication’ssource/objectcodeandtheefficiency of themappingof thatcodeto theunderlyingarchitecture.

Becauseof thewiderangeof performancemonitoringhardwareavailableondiffer-entprocessorsandthedifferentplatform-dependentinterfacesfor accessingthis hard-ware,thePAPI projectwasstartedwith thegoalof providing astandardcross-platforminterfacefor accessinghardwareperformancecounters[2]. PAPI proposesa standardsetof library routinesfor accessingthecountersaswell asastandardsetof eventsto bemeasured.Thelibrary interfaceconsistsof a high-level anda low-level interface.Thehigh-level interfaceprovidesasimplesetof routinesfor starting,reading,andstoppingthecountersfor a specifiedlist of events.The fully programmablelow-level interfaceprovidesadditionalfeaturesandoptionsandis intendedfor tool or applicationdevelop-erswith moresophisticatedneeds.

Thearchitectureof PAPI is shown in Figure1. Thegoalof thePAPI projectis toprovideafirm foundationthatsupportstheinstrumentationandmeasurementstrategiesdescribedin theprecedingsectionsandthatsupportsdevelopmentof end-userperfor-manceanalysistools for the full rangeof high-performancearchitecturesandparallelprogrammingmodels.For manualandpreprocessorsourcecodeinstrumentation,PAPIprovidesthehigh-level andlow-level routinesdescribedabove.ThePAPI flops callis aneasy-to-useroutinethatprovidestiming dataandthefloatingpointoperationcountfor thebracketedcode.Thelow-level routinestargetthemoredetailedinformationandfull rangeof optionsneededby tool developers.For examplethePAPI profil callimplementsSVR4-compatiblecodeprofiling basedon any hardwarecountermetric.Again, the codeto be profiledneedonly bebracketedby calls to the PAPI profilroutine.This routinecanbeusedby end-usertoolssuchasVProf 3 to collectprofilingdatawhichcanthenbecorrelatedwith applicationsourcecode.

Referenceimplementationsof PAPI areavailablefor a numberof platforms(e.g.,CrayT3E, SGI IRIX, IBM AIX Power, SunUltrasparcSolaris,Linux/x86,Linux/IA-64,HP/CompaqAlphaTru64Unix). Theimplementationfor agivenplatformattemptsto map as many of the standardPAPI eventsas possibleto the available platform-

3 http:/aros.ca.sandia.gov/ c̃ljanss/perf/vp rof/

Page 6: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

specificevents.The implementationalsoattemptsto useavailablehardwareandop-eratingsystemsupport– e.g.,for countermultiplexing, interrupton counteroverflow,andstatisticalprofiling.

Using PAPI on large-scaleapplicationcodes,suchas the EVH1 hydrodynamicscode,hasraisedissuesof scalabilityof the instrumentation.PAPI initially focusedonobtainingaggregatecountsof hardwareevents.However, theoverheadof library callsto readthe hardwarecounterscan be excessive if the routinesare called frequently– for example,on entry andexit of a small subroutineor basicblock within a tightloop. Unacceptableoverheadhascausedsometool developersto reducethe numberof calls throughstatisticalsamplingtechniques.On mostplatforms,the currentPAPIcodeimplementsstatisticalprofiling overaggregatecountingby generatinganinterrupton counteroverflow of a thresholdandsamplingtheprogramcounter. On out-of-orderprocessorstheprogramcountermayyield anaddressthatis severalinstructionsor evenbasicblocksremovedfrom thetrueaddressof the instructionthatcausedtheoverflowevent.ThePAPI projectis investigatinghardwaresupportfor sampling,sothattool de-veloperscanbe relievedof this burdenandmaximumaccuracy canbe achievedwithminimal overhead.With hardwaresampling,an in-flight instructionis selectedat ran-domandinformationaboutits stateis recorded– for example,thetypeof instruction,its address,whetherit hasincurreda cacheor TLB miss,variouspipelineand/ormem-ory latenciesincurred.Thesamplingresultsprovide a histogramof the profiling datawhichcorrelateseventfrequencieswith programlocations.

In addition,aggregateeventcountscanbeestimatedfrom samplingdatawith loweroverheadthandirectcounting.For example,thenew PAPI substratefor theHP/CompaqAlphaTru64UNIX platformis built on topof aprogramminginterfaceto DCPIcalledDADD (DynamicAccessto DCPI Data).DCPI identifiesthe exact addressof an in-struction,thusresultingin accuratetext addressesfor profiling data[4]. TestrunsofthePAPI calibrate utility on thesubstratehave shown thateventcountsconvergeto theexpectedvalue,givena long enoughrun time to obtainsufficient samples,whileincurringonly oneto two percentoverhead,ascomparedto up to 30 percenton othersubstratesthat usedirect counting.A similar capabilityexistson the ItaniumandIta-nium 2 platforms,whereEvent AddressRegisters(EARs) accuratelyidentify the in-structionanddataaddressesfor someevents.Futureversionsof PAPI will make useof suchhardwareassistedprofiling andwill provideanoptionfor estimatingaggregatecountsfrom samplingdata.

The dynaprof tool that is part of the most recentPAPI releaseusesdynamicinstrumentationto allow the userto either load an executableor attachto a runningexecutableand then dynamically insert instrumentationprobes[11]. Dynaprof usesDyninst API [3] on Linux/IA-32, SGI IRIX, and Sun Solarisplatforms,and DPCL4 on IBM AIX. The usercan list the internalstructureof the applicationin order toselectinstrumentationpoints.Dynaprofinsertsinstrumentationin the form of probes.Dynaprofprovidesa PAPI probefor collectinghardwarecounterdataanda wallclockprobefor measuringelapsedtime, both on a per-threadbasis.Usersmay optionallywrite their own probes.A probemay usewhatever output format is appropriate,forexamplea real-timedatafeed to a visualizationtool or a static datafile dumpedto

4 http://oss.software.ibm.com/developerwor ks/op ensour ce/dpc l/

Page 7: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

Fig.2. Real-timeperformanceanalysisusingPerfometer

disk at theendof the run. Futureplansareto developadditionalprobes,for examplefor VProf andTAU, andto improvesupportfor instrumentationandcontrolof parallelmessage-passingprograms.

PAPI hasbeenincorporatedinto a numberof profiling tools, includingSvPablo5,TAU andVProf. In supportof tracing,PAPI is alsobeingincorporatedinto version3 oftheVampirMPI analysistool 6. CollectingPAPI datafor variouseventsover intervalsof timeanddisplayingthisdataalongsidetheVampirtimelineview enablescorrelationof eventfrequencieswith messagepassingbehavior.

Real-timeperformancemonitoringis supportedby theperfometertool that is dis-tributedwith PAPI. By connectingthegraphicaldisplayto thebackendprocess(or pro-cesses)runninganapplicationcodethathasbeenlinkedwith theperfometerandPAPIlibraries,thetool providesa runtimetraceof a user-selectedPAPI metric,asshown inFigure2 for floating point operationsper second(FLOPS).The usermay changetheperformanceeventbeingmeasuredby clicking on theSelectMetric button.Theintentof perfometeris to provide a fastcoarse-grainedeasyway for a developerto find outwherea bottleneckexists in a program.In additionto real-timeanalysis,theperfome-ter library cansave a tracefile for lateroff-line analysis.Thedynaproftool describedabove includesa perfometerprobethatcanautomaticallyinsertcallsto theperfometersetupandcolor selectionroutinessothata runningapplicationcanbeattachedto andmonitoredin real-timewithout requiringany sourcecodechangesor recompilationorevenrestartingtheapplication.

5 http://www-pablo.cs.uiuc.edu/Project/SVP ablo/ SvPabl oOverv iew.ht m6 http://www.pallas.com/e/products/vampir/ index .htm

Page 8: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

Fig.3. ScalableSAMRAI ProfileDisplay

4.2 TAU PerformanceSystem

TheTAU (TuningandAnalysisUtilities) performancesystemis aportableprofilingandtracingtoolkit for parallelthreadedandor message-passingprogramswrittenin Fortran,C,C++,or Java,or acombinationof FortranandC.TheTAU architecturehasthreedis-tinct parts:instrumentation,measurement,andanalysis.The programcanundergo aseriesof transformationsthatinsertinstrumentationbeforeit executes.Instrumentationcanbeaddedat variousstages,from compile-timeto link-time to run-time,with eachstageimposingdifferentconstraintsandopportunitiesfor extractingprograminforma-tion. Moving from sourcecodeto binary instrumentationtechniquesshifts the focusfrom a languagespecificto a moreplatformspecificapproach.TAU canbeconfiguredto doeitherprofiling or tracingor to dobothsimultaneously.

Sourcecodecan be instrumentedby manuallyinsertingcalls to the TAU instru-mentationAPI, or by usingPDT [10] to insert instrumentationautomatically. PDT isa codeanalysisframework for developingsource-basedtools. It includescommercialgradefront endparsersfor Fortran77/90,C, andC++, aswell asa portableinterme-diatelanguageanalyzer, databaseformat,andaccessAPI. The TAU projecthasusedPDT to implementa source-to-sourceinstrumentor(tau instrumentor ) thatsup-portsautomaticinstrumentationof C, C++,andFortran77/90programs.TAU canalsouseDyninstAPI [3] to constructcalls to theTAU measurementlibrary andtheninsertthesecalls into theexecutablecode.In bothcases,a selective instrumentationlist thatspecifiesa list of routinesto beincludedor excludedfrom instrumentationcanbepro-vided. TAU usesPAPI to generateperformanceprofileswith hardwarecounterdata.It alsousesthe MPI profiling interfaceto generateprofile and/ortracedatafor MPIoperations.

Recently, theTAU projecthasfocussedon how to measureandanalyzelarg-scaleapplicationperformancedata.All of the instrumentationandmeasurementtechniquesdiscussedabove apply. In the caseof parallelprofiles,TAU suffers no limitations in

Page 9: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

F

Fig.4. PerformanceProfileVisualizationof 500UintahThreads

the ability to make low-overeheadperformancemeasurement.However, a significantamountof performancedatacanbegeneratedfor largeprocessorruns.TheTAU Para-Prof tool provide the userwith meansto navigatethroughtheprofile dataset.For ex-ample,we appliedParaProfto TAU dataobtainedduring the profiling of a SAMRAI[8] applicationrun on 512processornodes.Figure3 shows a view of exclusive wall-clocktimefor all events.Thedisplayis fully interactive,andcanbe“zoomed”in or outto show local detail.Evenso,someperformancecharacteristicscanstill bedifficult tocomprehendwhenpresentedwith somuchvisualdata.

We have also beenexperimentingwith three-dimensionaldisplaysof large-scaleperformancedata.For instance,Figure4 7 shows two visualizationsof parallelprofilesamplesfrom a Uintah [7] application.The left visualizationis for a 500 processorrun andshows the entireparallelprofile measurement.The performanceevents(i.e.,functions)arealongthe x-axis, the threadsarealongthe y-axis,andtheperformancemetric(in this case,theexclusive executiontime) is alongthez-axis.This full perfor-manceview enablestheuserto quickly identify major performancecontributors.TheMPI Recv() function is highlighted.Theright displayis of thesamedataset,but inthiscaseeachthreadis shown asa sphereata coordinatepointdeterminedby therela-tive exclusive executiontime of threesignificantevents.Thevisualizationgivesa wayto seeclusteringrelationships.

5 Conclusions

Terascalesystemsrequirea performanceobservation framework that supportsa widerangeof instrumentationandmeasurementstrategies.ThePAPI andTAU projectsareaddressingimportantresearchproblemsrelatedto constructionof sucha framework.Thewidespreadadoptionof PAPI by third-partytool developersdemonstratesthevalueof implementinglow-levelaccesstoarchitecture-specificperformancemonitoringhard-wareunderneatha portableinterface.Now tool developerscanprogramto a singlein-

7 Visualizationscourtesyof Kai Li, Universityof Oregon

Page 10: Performance Instrumentation and Measurement for Terascale ...icl.utk.edu/~mucci/latest/pubs/ICCS2003.pdf · Also, the instrumentation can be done on a running program instead of requiring

terface,allowing themto focus their efforts on high-level tool design.Similarly, theTAU framework providesportablemechanismsfor instrumentationandmeasurementof parallelsoftwareandsystems.

Terascalesystemsrequirescalablelow-overheadmeansof collectingrelevantper-formancedata.Statisticalsamplingmethods,suchasusedin thenew PAPI substratesfor theAlphaTru64UNIX andLinux/Itanium/Itanium2platforms,yield sufficientlyac-curateresultswhile incurringvery little overhead.Filteringandfeedbackschemessuchasthoseuseby TAU lower overheadwhile focusinginstrumentationwhereit is mostneeded.BothPAPI andTAU projectsaredevelopingonlinemonitoringcapabilitiesthatcanbe usedto control instrumentation,measurement,and runtimeperformancedataanalysis.This will be importantfor effective performancesteeringin highly parallelenvironments.

Together, thePAPI 8 andTAU 9 projectshave beguntheconstructionof a portableperformancetool infrastructurefor terascalesystemsdesignedfor interoperability, flex-ibility , andextensibility.

References

1. MPI: A messagepassinginterfacestandard.InternationalJournal of SupercomputingAp-plications, 8(3/4),1994.

2. S.Browne,J.Dongarra,N. Garner, G. Ho, andP. Mucci. A portableprogramminginterfacefor performanceevaluationon modernprocessors.InternationalJournal of High Perfor-manceComputingApplications, 14(3):189–204,2000.

3. B. BuckandJ.Hollingsworth. An API for runtimecodepatching.InternationalJournal ofHigh PerformanceComputingApplications, 14(4):317–329,2000.

4. J. Dean,C. Waldspurger, and W. Weihl. Transparent,low-overheadprofiling on modernprocessors.In WorkshoponProfileandFeedback-directedCompilation, October1998.

5. A. EustaceandA. Srivastava. ATOM: A flexible interfacefor building high performanceprogramanalysistools. In Proc.USENIXWinter 1995, pages303–314,1995.

6. J.GalarowicsandB. Mohr. AnalyzingmessagepassingprogramsontheCrayT3Ewith PATandVAMPIR. Technicalreport,ZAM Forschungszentrum:Juelich,Germany, 1998.

7. J.S.Germain,A. Morris,S.Parker, A. Malony, andS.Shende.Integratingperformanceanal-ysisin theuintahsoftwaredevelopmentcycle. In High PerformanceDistributedComputingConference, pages33–41,2000.

8. R. HornungandS. Kohn. Managingapplicationcomplexity in the samraiobject-orientedframework. ConcurrencyandComputation:PracticeandExperience, specialissueonSoft-wareArchitecturesfor ScientificApplications,2001.

9. J. LarusandT. Ball. Rewriting executablefiles to measureprogrambehavior. SoftwarePracticeandExperience, 24(2):197–218,1994.

10. K. Lindlan,J.Cuny, A. Malony, S.Shende,B. Mohr, R. Rivenburgh,andC. Rasmussen.Atool framework for staticanddynamicanalysisof object-orientedsoftwarewith templates.In Proc.SC2000, 2000.

11. P. Mucci. Dynaprof0.8user’s guide.Technicalreport,Nov. 2002.12. S.Shende,A. Malony, andR. Bell. Instrumentationandmeasurementstrategiesfor flexible

andportableempiricalperformanceevaluation.In InternationalConferenceonParallel andDistributedProcessingTechniquesandApplications(PDPTA’2001), 2001.

8 For furtherinformation:http://icl.cs.utk.edu/papi/9 For furtherinformation:http://www.cs.uoregon.edu/research/parac omp/ta u/