performance instrumentation and measurement for terascale...

PerformanceInstrumentation and MeasurementforTerascaleSystems

JackDongarra�, Allen D. Malony

�, Shirley Moore

�,

Philip Mucci�, andSameerShende

�

�Innovative ComputingLaboratory, Universityof Tennessee

Knoxville, TN 37996-3450USA�dongarra,shirley,mucci � @cs.utk.edu�

ComputerScienceDepartment,Universityof OregonEugene,OR97403-1202USA�

malony,sameer � @cs.uoregon.edu

Abstract. As computersystemsgrow in size and complexity, tool supportisneededto facilitatethe efficient mappingof large-scaleapplicationsonto thesesystems.To helpachieve this mapping,performanceanalysistoolsmustproviderobust performanceobservation capabilitiesat all levels of the system,aswellasmaplow-level behavior to high-level programconstructs.Instrumentationandmeasurementstrategies,developedover the last several years,must evolve to-getherwith performanceanalysisinfrastructureto addressthechallengesof newscalableparallelsystems.

1 Intr oduction

Performanceobservationrequirementsfor terascalesystemsaredeterminedby theper-formanceproblembeingaddressedandtheperformanceevaluationmethodologybeingapplied.Instrumentationof an applicationis necessaryto captureperformancedata.Instrumentationmaybe insertedat variousstages,from sourcecodemodificationstocompile-timeto link-time to modificationof executablecodeeitherstaticallyor dynam-ically during programexecution.Theseinstrumentationpointshave differentmecha-nismswhich vary in their easeof use,flexibility , level of detail, usercontrol of whatdatacanbecollected,andintrusiveness.

To provide insightinto programbehavior on large-scalesystemsandpoint thewaytoward programtransformationsthat will improve performance,variousperformancedatamustbecollected.Profiling datashow thedistribution of a metricacrosssource-level constructs,suchasroutines,loops,andbasicblocks.Most modernmicroproces-sorsprovide a rich setof hardwarecountersthat capturecycle count,functionalunit,memory, andoperatingsystemevents.Profilingcanbebasedoneeithertimeor varioushardware-basedmetrics,suchascachemisses,for example.Correlationsbetweenpro-files basedon differentevents,aswell asevent-basedratios,provide derivedinforma-tion thatcanhelp to quickly identify anddiagnoseperformanceproblems.In additionto profiling data,capturingeventtracesof programevents,suchasmessagecommuni-cationevents,helpsportraythetemporaldynamicsof applicationperformance.

For terascalesystems,a widerangeof performanceproblems,performanceevalua-tion methods,andprogrammingenvironmentsneedto besupported.A flexible andex-tensibleperformanceobservationframework canbestprovide thenecessaryflexibilityin experimentdesign.Researchproblemsto beaddressedby theframework includethefollowing: theappropriatelevel andlocationin theframework for implementingdiffer-ent instrumentationandmeasurementstrategies,how to make theframework modularandextensible,andtheappropriatecompromisebetweenthe level of detail andaccu-racy of theperformancedatacollectedandtheinstrumentationcost.

Theremainderof thepaperis organizedasfollows.Section2 describestheinstru-mentationmechanismsdesirabletosupportin suchaframeworkandSection3describestypesof measurements.Section4 explainshow the instrumentationandmeasurementstrategiesaresupportedin the PAPI cross-platformhardwarecounterinterfaceandintheTAU performanceobservationframework.Section5 presentsourconclusions.

2 Instrumentation

To observe applicationperformance,additionalinstructionsor probesaretypically in-sertedinto a program.This processis called instrumentation. Instrumentationcanbeinsertedatvariousstages,asdescribedbelow.

2.1 SourceCodeInstrumentation

Instrumentationatthesourcecodelevel allowstheprogrammerto communicatehigher-level domain-specificabstractionsto theperformancetool. A programmercancommu-nicatesucheventsby annotatingthesourcecodeat appropriatelocationswith instru-mentationcalls. Oncethe programundergoesa seriesof transformationsto generatetheexecutablecode,specifyingarbitrarypointsin thecodefor instrumentationandun-derstandingprogramsemanticsat thosepointsmaynotbepossible.Anotheradvantageof sourcecodeinstrumentationis thatonceaninstrumentationlibrary targetsonelan-guage,it canprovideportabilityacrossmultiplecompilersfor thatlanguage,aswell asacrossmultiple platforms.Drawbacksof sourcecodeinstrumentationincludepossiblechangesin instructionanddatacachebehavior, interactionswith optimizingcompilers,andruntimeoverheadof instrumentationlibrary calls.

Sourcecodeannotationscanbeinsertedmanuallyor automatically. Adding instru-mentationcalls in the sourcecodemanuallycanbe a tedioustaskthat introducesthepossibility of instrumentationerrorsproducingerroneousperformancedata.Someofthesedifficultieswith manualsourcecodeinstrumentationcanbeovercomeby usinga source-to-sourcepreprocessorto build anautomaticinstrumentationtool. ToolssuchasProgramDatabaseToolkit (PDT) [10] for C++, C andFortran90, canbe usedtoautomaticallyinstrumentsubroutines,coderegions,andstatements.

2.2 Library Level Instrumentation

Wrapperinterpositionlibrariesprovidea convenientmechanismfor addinginstrumen-tationcallsto libraries.For instance,theMPI ProfilingInterface[1] allowsatool devel-operto interfacewith MPI callsin aportablemannerwithoutmodifyingtheapplication

sourcecodeor having accessto theproprietarysourcecodeof thelibrary implementa-tion. Theadvantageof library instrumentationis that it is relatively easyto enableandtheeventsgeneratedarecloselyassociatedwith thesemanticsof thelibrary routines.

2.3 Binary Instrumentation

Executableimagescanbe instrumentedusingbinarycode-rewriting techniques,oftenreferredto asbinary editing tools or executableediting tools.SystemssuchasPixie,ATOM [5], EEL [9], andPAT [6] includean objectcodeinstrumentorthat parsesanexecutableandrewritesit with addedinstrumentationcode.Theadvantageof binaryin-strumentationis thatthereisnoneedto re-compileanapplicationprogramandrewritingabinaryfile is mostlyindependentof theprogramminglanguage.Also, it is possibletospawn theinstrumentedparallelprogramthesamewayastheoriginalprogram,withoutany specialmodificationasarerequiredfor runtimeinstrumentation[12]. Furthermore,sinceanexecutableprogramis instrumented,compileroptimizationsdo not changeorinvalidatetheperformanceoptimization.

2.4 Dynamic Instrumentation

Dynamic instrumentationis a mechanismfor runtimecodepatchingthat modifiesaprogramduringexecution.DyninstAPI[3] providesanefficient,low-overheadinterfacethat is suitablefor performanceinstrumentation.A tool that usesthis API is calleda mutator and can insert codesnippetsinto a running program,which is called themutatee, withoutre-compiling,re-linking,or eventre-startingtheprogram.Themutatorcan either spawn an executableand instrumentit prior to its execution,or attachtoa running program.Dynamic instrumentationovercomessomelimitations of binaryinstrumentationby allowing instrumentationcodeto beaddedandremovedat runtime.Also, the instrumentationcanbe doneon a runningprograminsteadof requiringtheuserto re-executetheapplication.Thedisadvantageof dynamicinstrumentationis thattheinterfaceneedsto beawareof multipleobjectfile formats,binaryinterfaces(32/64bit), operatingsystemidiosyncrasies,aswell ascompilerspecificinformation(e.g.,tosupporttemplatenamede-manglingin C++from multipleC++compilers).To maintaincrosslanguage,crossplatform,crossfile format,crossbinary interfaceportability is achallengingtaskandrequiresa continuousportingeffort asnew computingplatformsandmulti-threadedprogrammingenvironmentsevolve.

3 Typesof Measurements

Decisionsaboutinstrumentationareconcernedwith the numberand type of perfor-manceeventsone wantsto observe during an application’s execution.Measurementdecisionsaddressthe typesandamountof performancedataneededfor performanceproblemsolving.Often thesedecisionsinvolve tradeoffs of the needfor performancedataversusthecostof obtainingit (i.e., themeasurementoverhead).Post-mortemper-formanceevaluationtools typically fall into two categories:profiling andtracing,al-thoughsomeprovide both capabilities.More recently, sometools provide real-time,ratherthanpost-mortem,performancemonitoring.

3.1 Profiling

Profilingcharacterizesthebehavior of anapplicationin termsof aggregateperformancemetrics.Profilesare typically representedasa list of variousmetrics(suchas inclu-sive/exclusive wall-clock time) that areassociatedwith program-level semanticsenti-ties (suchasroutines,basicblocks,or statementsin theprogram).Time is a commonmetric,but any monotonicallyincreasingresourcefunctioncanbeused,suchascountsfrom hardwareperformancecounters.Profiling can be implementedby samplingorinstrumentationbasedapproaches.

3.2 Tracing

While profiling is usedto get aggregatesummariesof metricsin a compactform, itcannothighlight the time varyingaspectsof the execution.To studythe post-mortemspatialandtemporalaspectsof performancedata,event tracing,that is, theactivity ofcapturingeventsor actionsthattakeplaceduringprogramexecution,is moreappropri-ate.Eventtracingusuallyresultsin a log of theeventsthatcharacterizetheexecution.Eachevent in the log is anorderedtuple typically containinga time stamp,a location(e.g.,node,thread)anidentifierthatspecifiesthetypeof event(e.g.,routinetransition,user-definedevent,messagecommunication,etc.)andevent-specificinformation.For aparallelexecution,traceinformationgeneratedon differentprocessorsmaybemergedto produceasingletracefile. Themergingis usuallybasedonthetimestampwhichcanreflectlogical timeor physicaltime.

3.3 Real-timePerformanceMonitoring

Post-mortemanalysisof profiling dataor tracefiles hasthe disadvantagethat anal-ysis cannotbegin until after programexecutionhasfinished.Real-timeperformancemonitoringallows usersto evaluateprogramperformanceduringexecution.Real-timeperformancemonitoringis sometimescoupledwith applicationperformancesteering.

4 PAPI and TAU Instrumentation and MeasurementStrategies

Understandingthe performanceof parallel systemsis a complicatedtaskbecauseofthedifferentperformancelevelsinvolvedandtheneedto associateperformanceinfor-mationto theprogrammingandproblemabstractionsusedby applicationdevelopers.Terascalesystemsdo not make this taskany easier. We needto developperformanceanalysisstrategiesandtechniquethataresuccessfulboth in their accessibilityto usersandin their robustapplication.In this section,we describeour efforts in evolving thePAPI andTAU technologiesto terscaleuse.

4.1 PAPI

Most modernmicroprocessorsprovide hardwaresupportfor collectinghardwareper-formancecounterdata[2]. Performancemonitoringhardwareusuallyconsistsof a set

Interface

Performance Counter Hardware

Operating System

PAPI High−Level

Kernel Extension

Interface

PAPI Machine DependentSubstrate

PAPI Low−Level

Fig.1. Layeredarchitectureof thePAPI implementation

of registersthatrecorddataabouttheprocessor’s function.Theseregistersrangefromsimpleeventcountersto moresophisticatedhardwarefor recordingdatasuchasdataand instructionaddressesfor an event, and pipeline or memorylatenciesfor an in-struction.Monitoring hardwareeventsfacilitatescorrelationbetweenthe structureofanapplication’ssource/objectcodeandtheefficiency of themappingof thatcodeto theunderlyingarchitecture.

Becauseof thewiderangeof performancemonitoringhardwareavailableondiffer-entprocessorsandthedifferentplatform-dependentinterfacesfor accessingthis hard-ware,thePAPI projectwasstartedwith thegoalof providing astandardcross-platforminterfacefor accessinghardwareperformancecounters[2]. PAPI proposesa standardsetof library routinesfor accessingthecountersaswell asastandardsetof eventsto bemeasured.Thelibrary interfaceconsistsof a high-level anda low-level interface.Thehigh-level interfaceprovidesasimplesetof routinesfor starting,reading,andstoppingthecountersfor a specifiedlist of events.The fully programmablelow-level interfaceprovidesadditionalfeaturesandoptionsandis intendedfor tool or applicationdevelop-erswith moresophisticatedneeds.

Thearchitectureof PAPI is shown in Figure1. Thegoalof thePAPI projectis toprovideafirm foundationthatsupportstheinstrumentationandmeasurementstrategiesdescribedin theprecedingsectionsandthatsupportsdevelopmentof end-userperfor-manceanalysistools for the full rangeof high-performancearchitecturesandparallelprogrammingmodels.For manualandpreprocessorsourcecodeinstrumentation,PAPIprovidesthehigh-level andlow-level routinesdescribedabove.ThePAPI flops callis aneasy-to-useroutinethatprovidestiming dataandthefloatingpointoperationcountfor thebracketedcode.Thelow-level routinestargetthemoredetailedinformationandfull rangeof optionsneededby tool developers.For examplethePAPI profil callimplementsSVR4-compatiblecodeprofiling basedon any hardwarecountermetric.Again, the codeto be profiledneedonly bebracketedby calls to the PAPI profilroutine.This routinecanbeusedby end-usertoolssuchasVProf 3 to collectprofilingdatawhichcanthenbecorrelatedwith applicationsourcecode.

Referenceimplementationsof PAPI areavailablefor a numberof platforms(e.g.,CrayT3E, SGI IRIX, IBM AIX Power, SunUltrasparcSolaris,Linux/x86,Linux/IA-64,HP/CompaqAlphaTru64Unix). Theimplementationfor agivenplatformattemptsto map as many of the standardPAPI eventsas possibleto the available platform-

3 http:/aros.ca.sandia.gov/ c̃ljanss/perf/vp rof/

specificevents.The implementationalsoattemptsto useavailablehardwareandop-eratingsystemsupport– e.g.,for countermultiplexing, interrupton counteroverflow,andstatisticalprofiling.

Using PAPI on large-scaleapplicationcodes,suchas the EVH1 hydrodynamicscode,hasraisedissuesof scalabilityof the instrumentation.PAPI initially focusedonobtainingaggregatecountsof hardwareevents.However, theoverheadof library callsto readthe hardwarecounterscan be excessive if the routinesare called frequently– for example,on entry andexit of a small subroutineor basicblock within a tightloop. Unacceptableoverheadhascausedsometool developersto reducethe numberof calls throughstatisticalsamplingtechniques.On mostplatforms,the currentPAPIcodeimplementsstatisticalprofiling overaggregatecountingby generatinganinterrupton counteroverflow of a thresholdandsamplingtheprogramcounter. On out-of-orderprocessorstheprogramcountermayyield anaddressthatis severalinstructionsor evenbasicblocksremovedfrom thetrueaddressof the instructionthatcausedtheoverflowevent.ThePAPI projectis investigatinghardwaresupportfor sampling,sothattool de-veloperscanbe relievedof this burdenandmaximumaccuracy canbe achievedwithminimal overhead.With hardwaresampling,an in-flight instructionis selectedat ran-domandinformationaboutits stateis recorded– for example,thetypeof instruction,its address,whetherit hasincurreda cacheor TLB miss,variouspipelineand/ormem-ory latenciesincurred.Thesamplingresultsprovide a histogramof the profiling datawhichcorrelateseventfrequencieswith programlocations.

In addition,aggregateeventcountscanbeestimatedfrom samplingdatawith loweroverheadthandirectcounting.For example,thenew PAPI substratefor theHP/CompaqAlphaTru64UNIX platformis built on topof aprogramminginterfaceto DCPIcalledDADD (DynamicAccessto DCPI Data).DCPI identifiesthe exact addressof an in-struction,thusresultingin accuratetext addressesfor profiling data[4]. TestrunsofthePAPI calibrate utility on thesubstratehave shown thateventcountsconvergeto theexpectedvalue,givena long enoughrun time to obtainsufficient samples,whileincurringonly oneto two percentoverhead,ascomparedto up to 30 percenton othersubstratesthat usedirect counting.A similar capabilityexistson the ItaniumandIta-nium 2 platforms,whereEvent AddressRegisters(EARs) accuratelyidentify the in-structionanddataaddressesfor someevents.Futureversionsof PAPI will make useof suchhardwareassistedprofiling andwill provideanoptionfor estimatingaggregatecountsfrom samplingdata.

The dynaprof tool that is part of the most recentPAPI releaseusesdynamicinstrumentationto allow the userto either load an executableor attachto a runningexecutableand then dynamically insert instrumentationprobes[11]. Dynaprof usesDyninst API [3] on Linux/IA-32, SGI IRIX, and Sun Solarisplatforms,and DPCL4 on IBM AIX. The usercan list the internalstructureof the applicationin order toselectinstrumentationpoints.Dynaprofinsertsinstrumentationin the form of probes.Dynaprofprovidesa PAPI probefor collectinghardwarecounterdataanda wallclockprobefor measuringelapsedtime, both on a per-threadbasis.Usersmay optionallywrite their own probes.A probemay usewhatever output format is appropriate,forexamplea real-timedatafeed to a visualizationtool or a static datafile dumpedto

4 http://oss.software.ibm.com/developerwor ks/op ensour ce/dpc l/

Fig.2. Real-timeperformanceanalysisusingPerfometer

disk at theendof the run. Futureplansareto developadditionalprobes,for examplefor VProf andTAU, andto improvesupportfor instrumentationandcontrolof parallelmessage-passingprograms.

PAPI hasbeenincorporatedinto a numberof profiling tools, includingSvPablo5,TAU andVProf. In supportof tracing,PAPI is alsobeingincorporatedinto version3 oftheVampirMPI analysistool 6. CollectingPAPI datafor variouseventsover intervalsof timeanddisplayingthisdataalongsidetheVampirtimelineview enablescorrelationof eventfrequencieswith messagepassingbehavior.

Real-timeperformancemonitoringis supportedby theperfometertool that is dis-tributedwith PAPI. By connectingthegraphicaldisplayto thebackendprocess(or pro-cesses)runninganapplicationcodethathasbeenlinkedwith theperfometerandPAPIlibraries,thetool providesa runtimetraceof a user-selectedPAPI metric,asshown inFigure2 for floating point operationsper second(FLOPS).The usermay changetheperformanceeventbeingmeasuredby clicking on theSelectMetric button.Theintentof perfometeris to provide a fastcoarse-grainedeasyway for a developerto find outwherea bottleneckexists in a program.In additionto real-timeanalysis,theperfome-ter library cansave a tracefile for lateroff-line analysis.Thedynaproftool describedabove includesa perfometerprobethatcanautomaticallyinsertcallsto theperfometersetupandcolor selectionroutinessothata runningapplicationcanbeattachedto andmonitoredin real-timewithout requiringany sourcecodechangesor recompilationorevenrestartingtheapplication.

5 http://www-pablo.cs.uiuc.edu/Project/SVP ablo/ SvPabl oOverv iew.ht m6 http://www.pallas.com/e/products/vampir/ index .htm

Fig.3. ScalableSAMRAI ProfileDisplay

4.2 TAU PerformanceSystem

TheTAU (TuningandAnalysisUtilities) performancesystemis aportableprofilingandtracingtoolkit for parallelthreadedandor message-passingprogramswrittenin Fortran,C,C++,or Java,or acombinationof FortranandC.TheTAU architecturehasthreedis-tinct parts:instrumentation,measurement,andanalysis.The programcanundergo aseriesof transformationsthatinsertinstrumentationbeforeit executes.Instrumentationcanbeaddedat variousstages,from compile-timeto link-time to run-time,with eachstageimposingdifferentconstraintsandopportunitiesfor extractingprograminforma-tion. Moving from sourcecodeto binary instrumentationtechniquesshifts the focusfrom a languagespecificto a moreplatformspecificapproach.TAU canbeconfiguredto doeitherprofiling or tracingor to dobothsimultaneously.

Sourcecodecan be instrumentedby manuallyinsertingcalls to the TAU instru-mentationAPI, or by usingPDT [10] to insert instrumentationautomatically. PDT isa codeanalysisframework for developingsource-basedtools. It includescommercialgradefront endparsersfor Fortran77/90,C, andC++, aswell asa portableinterme-diatelanguageanalyzer, databaseformat,andaccessAPI. The TAU projecthasusedPDT to implementa source-to-sourceinstrumentor(tau instrumentor ) thatsup-portsautomaticinstrumentationof C, C++,andFortran77/90programs.TAU canalsouseDyninstAPI [3] to constructcalls to theTAU measurementlibrary andtheninsertthesecalls into theexecutablecode.In bothcases,a selective instrumentationlist thatspecifiesa list of routinesto beincludedor excludedfrom instrumentationcanbepro-vided. TAU usesPAPI to generateperformanceprofileswith hardwarecounterdata.It alsousesthe MPI profiling interfaceto generateprofile and/ortracedatafor MPIoperations.

Recently, theTAU projecthasfocussedon how to measureandanalyzelarg-scaleapplicationperformancedata.All of the instrumentationandmeasurementtechniquesdiscussedabove apply. In the caseof parallelprofiles,TAU suffers no limitations in

F

Fig.4. PerformanceProfileVisualizationof 500UintahThreads

the ability to make low-overeheadperformancemeasurement.However, a significantamountof performancedatacanbegeneratedfor largeprocessorruns.TheTAU Para-Prof tool provide the userwith meansto navigatethroughtheprofile dataset.For ex-ample,we appliedParaProfto TAU dataobtainedduring the profiling of a SAMRAI[8] applicationrun on 512processornodes.Figure3 shows a view of exclusive wall-clocktimefor all events.Thedisplayis fully interactive,andcanbe“zoomed”in or outto show local detail.Evenso,someperformancecharacteristicscanstill bedifficult tocomprehendwhenpresentedwith somuchvisualdata.

We have also beenexperimentingwith three-dimensionaldisplaysof large-scaleperformancedata.For instance,Figure4 7 shows two visualizationsof parallelprofilesamplesfrom a Uintah [7] application.The left visualizationis for a 500 processorrun andshows the entireparallelprofile measurement.The performanceevents(i.e.,functions)arealongthe x-axis, the threadsarealongthe y-axis,andtheperformancemetric(in this case,theexclusive executiontime) is alongthez-axis.This full perfor-manceview enablestheuserto quickly identify major performancecontributors.TheMPI Recv() function is highlighted.Theright displayis of thesamedataset,but inthiscaseeachthreadis shown asa sphereata coordinatepointdeterminedby therela-tive exclusive executiontime of threesignificantevents.Thevisualizationgivesa wayto seeclusteringrelationships.

5 Conclusions

Terascalesystemsrequirea performanceobservation framework that supportsa widerangeof instrumentationandmeasurementstrategies.ThePAPI andTAU projectsareaddressingimportantresearchproblemsrelatedto constructionof sucha framework.Thewidespreadadoptionof PAPI by third-partytool developersdemonstratesthevalueof implementinglow-levelaccesstoarchitecture-specificperformancemonitoringhard-wareunderneatha portableinterface.Now tool developerscanprogramto a singlein-

7 Visualizationscourtesyof Kai Li, Universityof Oregon

terface,allowing themto focus their efforts on high-level tool design.Similarly, theTAU framework providesportablemechanismsfor instrumentationandmeasurementof parallelsoftwareandsystems.

Terascalesystemsrequirescalablelow-overheadmeansof collectingrelevantper-formancedata.Statisticalsamplingmethods,suchasusedin thenew PAPI substratesfor theAlphaTru64UNIX andLinux/Itanium/Itanium2platforms,yield sufficientlyac-curateresultswhile incurringvery little overhead.Filteringandfeedbackschemessuchasthoseuseby TAU lower overheadwhile focusinginstrumentationwhereit is mostneeded.BothPAPI andTAU projectsaredevelopingonlinemonitoringcapabilitiesthatcanbe usedto control instrumentation,measurement,and runtimeperformancedataanalysis.This will be importantfor effective performancesteeringin highly parallelenvironments.

Together, thePAPI 8 andTAU 9 projectshave beguntheconstructionof a portableperformancetool infrastructurefor terascalesystemsdesignedfor interoperability, flex-ibility , andextensibility.

References

1. MPI: A messagepassinginterfacestandard.InternationalJournal of SupercomputingAp-plications, 8(3/4),1994.

2. S.Browne,J.Dongarra,N. Garner, G. Ho, andP. Mucci. A portableprogramminginterfacefor performanceevaluationon modernprocessors.InternationalJournal of High Perfor-manceComputingApplications, 14(3):189–204,2000.

3. B. BuckandJ.Hollingsworth. An API for runtimecodepatching.InternationalJournal ofHigh PerformanceComputingApplications, 14(4):317–329,2000.

4. J. Dean,C. Waldspurger, and W. Weihl. Transparent,low-overheadprofiling on modernprocessors.In WorkshoponProfileandFeedback-directedCompilation, October1998.

5. A. EustaceandA. Srivastava. ATOM: A flexible interfacefor building high performanceprogramanalysistools. In Proc.USENIXWinter 1995, pages303–314,1995.

6. J.GalarowicsandB. Mohr. AnalyzingmessagepassingprogramsontheCrayT3Ewith PATandVAMPIR. Technicalreport,ZAM Forschungszentrum:Juelich,Germany, 1998.

7. J.S.Germain,A. Morris,S.Parker, A. Malony, andS.Shende.Integratingperformanceanal-ysisin theuintahsoftwaredevelopmentcycle. In High PerformanceDistributedComputingConference, pages33–41,2000.

8. R. HornungandS. Kohn. Managingapplicationcomplexity in the samraiobject-orientedframework. ConcurrencyandComputation:PracticeandExperience, specialissueonSoft-wareArchitecturesfor ScientificApplications,2001.

9. J. LarusandT. Ball. Rewriting executablefiles to measureprogrambehavior. SoftwarePracticeandExperience, 24(2):197–218,1994.

10. K. Lindlan,J.Cuny, A. Malony, S.Shende,B. Mohr, R. Rivenburgh,andC. Rasmussen.Atool framework for staticanddynamicanalysisof object-orientedsoftwarewith templates.In Proc.SC2000, 2000.

11. P. Mucci. Dynaprof0.8user’s guide.Technicalreport,Nov. 2002.12. S.Shende,A. Malony, andR. Bell. Instrumentationandmeasurementstrategiesfor flexible

andportableempiricalperformanceevaluation.In InternationalConferenceonParallel andDistributedProcessingTechniquesandApplications(PDPTA’2001), 2001.

8 For furtherinformation:http://icl.cs.utk.edu/papi/9 For furtherinformation:http://www.cs.uoregon.edu/research/parac omp/ta u/

performance instrumentation and measurement for terascale...

Documents