recent advances in the performance api (papi) · recent advances in the performance api (papi)...

31
10 th Scalable Tools Workshop Anthony Danalis Lake Tahoe, California August 1-4, 2016 Recent Advances in the Performance API (PAPI) Collaborators: Heike Jagode Asim Yarkhan Jack Dongarra

Upload: vophuc

Post on 01-Nov-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

10thScalableToolsWorkshopAnthonyDanalis

LakeTahoe,CaliforniaAugust1-4,2016

RecentAdvancesinthePerformanceAPI(PAPI)

Collaborators:HeikeJagodeAsimYarkhanJackDongarra

Page 2: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPI•  Middlewarethatprovidesaconsistentinterfaceandmethodologyfortheperformancecounter

hardwarefoundinmostmajormicroprocessors

•  PAPIenablessoftwareengineerstosee,innearrealtime,therelationbetweenSWperformanceandHWevents

SUPPORTEDARCHITECTURES:

•  AMD•  CRAY:Aris,Gemini,power•  IBMBlueGeneSeries,Q:5D-Torus,I/Osystem,CNK,EMONpower/energy•  IBMPowerSeries•  IntelWestmere,Sandy|IvyBridge,Haswell,Broadwell,Skylake,KnightsCorner|Landing•  ARMCortexA8,A9,A15,ARM64•  NVidiaTesla,Kepler,NVML:CUDAsupportformultipleGPUs;PCSampling•  In\iniband•  IntelRAPL(power/energy);powercapping•  IntelKNC,KNLpower/energy

COMPONENTPAPI:

•  providesaccesstoacollectionofcomponentsthatexposeperformancemeasurementopportunitiesacrossthesystemasawhole,includingnetwork,theI/Osystem,theComputeNodeKernel,power/energy

2

Page 3: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPICPUComponents:KNCvs.KNL

3

PAPIComponents KnightsCorner KnightsLanding

perf_event:Linuxperf_eventCPUcoreevents

PMU'ssupported:perf,perf_raw,knc#ofNaBveEvents:140#ofPresetEvents:14#ofCounters:2

PMU'ssupported:perf,perf_raw,knl,ix86arch#ofNaBveEvents:182#ofPresetEvents:26#ofCounters:5

perf_event_uncore:Linuxperf_eventCPUuncoreandnorthbridgeevents

---

PMU’ssupported:e.g.Memory,on-dieinterconnect,IO,Memory-to-PCIeeventsupport#ofNaBveEvents:894

PAPIofferstwocomponentstotheCPUcounters:

Seenextslide

Page 4: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

4

PresetEvents Descrip(onPAPI_L1_DCM Level1datacachemissesPAPI_L1_ICM Level1instruc2oncachemissesPAPI_L1_TCM Level1cachemissesPAPI_L2_TCM Level2cachemissesPAPI_TLB_DM Datatransla2onlookasidebuffermissesPAPI_L1_LDM Level1loadmissesPAPI_L2_LDM Level2loadmissesPAPI_STL_ICY Cycleswithnoinstruc2onissuePAPI_BR_UCN Uncondi2onalbranchinstruc2onsPAPI_BR_CN Condi2onalbranchinstruc2onsPAPI_BR_TKN Condi2onalbranchinstruc2onstakenPAPI_BR_NTK Condi2onalbranchinstruc2onsnottakenPAPI_BR_MSP Condi2onalbranchinstruc2onsmispredictedPAPI_TOT_INS Instruc2onscompletedPAPI_LD_INS Loadinstruc2onsPAPI_ST_INS Storeinstruc2onsPAPI_BR_INS Branchinstruc2onsPAPI_RES_STL CyclesstalledonanyresourcePAPI_TOT_CYC TotalcyclesPAPI_LST_INS Load/storeinstruc2onscompletedPAPI_L1_DCA Level1datacacheaccessesPAPI_L1_ICH Level1instruc2oncachehitsPAPI_L1_ICA Level1instruc2oncacheaccessesPAPI_L2_TCH Level2totalcachehitsPAPI_L2_TCA Level2totalcacheaccessesPAPI_REF_CYC Referenceclockcycles

Listofthe26PAPIPresetEventsforKNL

Page 5: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPIPOWERComponents:KNCvs.KNL

5

PAPIComponents KNC KNL

powercap:ReadsRAPLresultsviaLinuxPOWERCAPinterface

---

#ofNaBveEvents:15(requiresnospecialpermissions)

rapl:RAPLresults(rawaccesstotheunderlyingMSRs)

---

#ofNaBveEvents:14(requiresrootprivilege)

micpower:Readingpowerinna2vemode

Powervaluesreportedin/sys/class/micras/power#ofNaBveEvents:16

---

host_micpower:Readingpowerinoffloadmode

PowervaluesexportedviaMicAccessAPI(MPSS)#ofNaBveEvents:16

---

PAPIoffersfourcomponentsforPower/Energymonitoring:

Page 6: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPIPOWERonKnightsLanding

6

Thefollowingpowerdomainsaresupported:•  PACKAGE: Processordie

•  DRAM(Memory): Directly-attachedDRAM

•  PP0(PowerPlane0):Processorscoressubsystem

SimpleVeriTicationTest:NaiveMMM(1024x1024):Scaled Energy Measurements:

PACKAGE_ENERGY 8216.792419 J (Average Power 84.5W)

DRAM_ENERGY 2539.645264 J (Average Power 26.1W)

Energy Measurement Counts:PACKAGE_ENERGY_CNT 134623927

DRAM_ENERGY_CNT 41609548

Scaled Fixed Values:

THERMAL_SPEC 215.000 W

MAXIMUM_POWER 258.000 W

MAXIMUM_TIME_WINDOW 0.046 s

Fixed Value Counts:THERMAL_SPEC_CNT 1720

MAXIMUM_POWER_CNT 2064

MAXIMUM_TIME_WINDOW_CNT 47

Page 7: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPIPOWERonKNL:Hessenberg(MKL’sdgehrd)Intel®XeonPhi™KnightsLanding,68cores(4HWthreads/core)

+memory-boundkernel(GEMVsandGEMMs)

+9computationswithdifferentmatrixsizes

powerusagemimicscomputationalintensity:

+Factorizationstartsonentirematrixàconsumesmostpower

+Asfactorizationprogresses,itoperatesonsmallermatricesàconsumeslesspower

7

0

50

100

150

200

0 50 100 150 200

1088 (6.3 G

Flops)

2112 (13.6 G

Flops)

3136 (21.4 G

Flops)

4160 (27.9 G

Flops)

5184 (34.4 G

Flops)

6082 (36.0 G

Flops)

7232 (46.1 G

Flops)

8256 (50.5 G

Flops)

9280 (56.1 G

Flops)

Po

we

r (w

att

s)

Time (seconds) passing as different size Hessenberg reductions are done on KNL (68x4 cores)

Accelerator Power Usage (Watts) (PACKAGE) Memory Power Usage (Watts) (DRAM)

Page 8: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPIFORPARSECPARALLELRUNTIMESCHEDULINGANDEXECUTIONCONTROLLER

8

Page 9: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

DataTlow-drivenProgrammingModels

•  Developforaportabilitylayer,notanarchitecture

•  Lettheruntimedealwiththehardwarecharacteristics

•  Task-scheduling:PaRSEC,StarSS,StarPU,Swift,Parallex,Quark,Kaapi,DuctTeip

9

Page 10: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PaRSECFeatures

10

Page 11: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPIandPaRSECParallelRuntimeSchedulingandExecutionController

PaRSEC:

•  Genericframeworkforarchitectureawareschedulingofmicro-tasksondistributedmany-coreheterogeneousarchitectures

à  Performancetoolsbecomemoreandmoreimportantfortask-baseddataTlowandexecutionsystems

à  AnalysisfeaturesthatshowtheconnectionofthedataTlowandtheexecutionproTile/traceisextremelybeneTicial

PAPIinPaRSEC:

•  IntegratedinPaRSEC’sPerformanceINStrumentationmodules

•  PINSmodulescanbeselectivelyloadedandusedbyusers

•  Enablesuserstomeasureperformancecounterdataforeachtask/nodeinaDAG(DirectedAcyclicGraph)

•  EverythingsupportedbyPAPIcanbemeasuredinPaRSECat“pertask”granularity

11

Page 12: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPIPowerperTask:PaRSEC

10

15

20

25

30

35

40

0 10 20 30 40 50 60 70

Aver

age

Pow

er (W

atts

)

Time (Seconds)

PACKAGE_ENERGY:PACKAGE0PACKAGE_ENERGY:PACKAGE1

PP0_ENERGY:PACKAGE0PP0_ENERGY:PACKAGE1

[email protected]=11,584--TileSize=724SandyBridgeEP2.60GHz,2sockets,runningon1(outof8)corepersocket

TotalPoweronSocket0,1(includeseverything)

PowerofcoresonlyonSocket0,1

12

Page 13: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPIPowerSampling:ScaLAPACK

0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140

Aver

age

Pow

er (W

atts

)

Time (Seconds)

PACKAGE_ENERGY:PACKAGE0PACKAGE_ENERGY:PACKAGE1

PP0_ENERGY:PACKAGE0PP0_ENERGY:PACKAGE1

[email protected]=11,584--TileSize=724SandyBridgeEP2.60GHz,2sockets,runningon1(outof8)corepersocket

TotalPoweronSocket0,1(includeseverything)

PowerofcoresonlyonSocket0,1

13

Page 14: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PaRSEC:30.8GFLOPsTotalEnergy:

0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140

Aver

age

Pow

er (W

atts

)

Time (Seconds)

PACKAGE_ENERGY:PACKAGE0PACKAGE_ENERGY:PACKAGE1

PP0_ENERGY:PACKAGE0PP0_ENERGY:PACKAGE1

PAPIPowerMeasurements

4.35kWs

ScaLAPACK:19.6GFLOPsTotalEnergy:

7.79kWs

AveragePowerUsagefor(p)dgeqrf--MatrixSize=11,584--TileSize=724SandyBridgeEP2.60GHz,2sockets,runningon1(outof8)corepersocket

14

Page 15: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPIPOWERCONTROLLINGREADINGANDWRITINGPOWER

15

Page 16: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

RAPL&msr-safe&libmsr•  RAPL–RunningAveragePowerLimit

•  IntelSandybridgeorbetter•  Modelsenergyatpackage,DRAMcontroller,CPUcore(PP0),graphicsuncore(PP1)

•  ProvidingwriteaccesstoMSRscanbeunsafe•  Canhavelargeeffectonmachine•  Touse,youneedtomakeMSRswriteable,capability-executable,static-only

executable,paranoidsettinginkernel!!!

•  msr-safeandlibmsrfromLLNL•  msr-safeprovidesasaferwhitelistcontrolledaccesstoMSRs

•  kernelmoduleprovides/dev/cpu/*/safe_msr•  libmsrisalibrarytosimplifyaccesstoMSRs

•  PAPIlibmsrcomponentinPAPItoWRITEvalues•  WrapstheRAPLpowercallsinlibmsr•  SetRAPLpowerlimitsoveratime-window(twowindows)

•  setlimitonsocket,low,high,time-window

•  CollaborationwithBarryRountree(LLNL)

16

Page 17: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

0

20

40

60

80

100

0 5 10 15 20 25 30

Watts

Unit

Work

Tim

e (

seco

nds)

Elapsed time (seconds)

Using PAPI libmsr component to read and set power caps 2x8 cores Xeon E5-2690 SandyBridge-EP at 2.9GHz

Set/Request Avg Power Cap (watts in 1 sec)Read Power Consumpution (watts)

Time for Unit Work (seconds on y2 axis)

Ini_alPowerconsump_on Timetakenforwork

a`erseangtolowestpower(y2axis)

Timeforworkdecreasesaspowerincreases

ControllingPowerwithPAPI-libmsr

Page 18: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

0

20

40

60

80

100

0 5 10 15 20 25 30

Watts

Unit

Work

Tim

e (

seco

nds)

Elapsed time (seconds)

Using PAPI libmsr component to read and set power caps 2x8 cores Xeon E5-2690 SandyBridge-EP at 2.9GHz

Set/Request Avg Power Cap (watts in 1 sec)Read Power Consumpution (watts)

Time for Unit Work (seconds on y2 axis)

Ini_alPowerconsump_on

Set/writePowerCap

Timetakenforworka`erseangtolowestpower(y2axis)

Timeforworkdecreasesaspowerincreases

ControllingPowerwithPAPI-libmsr

Page 19: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

0

20

40

60

80

100

0 5 10 15 20 25 30

Watts

Unit

Work

Tim

e (

seco

nds)

Elapsed time (seconds)

Using PAPI libmsr component to read and set power caps 2x8 cores Xeon E5-2690 SandyBridge-EP at 2.9GHz

Set/Request Avg Power Cap (watts in 1 sec)Read Power Consumpution (watts)

Time for Unit Work (seconds on y2 axis)

Ini_alPowerconsump_on

Set/writePowerCap

Timetakenforworka`erseangtolowestpower(y2axis)

Trytosetpowerabovemaximum

Timeforworkdecreasesaspowerincreases

ControllingPowerwithPAPI-libmsr

Page 20: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

20

UsageScenarios

•  Sampleusagescenarios

•  Ifweknowthatcomputationrequirementswilldecreaseduetocommunication(I/Obound)andthattheoverallexecutiontimewillnotsufferiftheCPUpoweriscappedtemporarily.

•  WecanschedulethecriticalpathoftheDAGonfastresources,decreasepowerconsumptiononsocketsrunningtherestoftheDAG.

Page 21: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

LUfactorizationandDAG

�����

����

�����

����

����

���������

����

����� �����

��������

����

�����

����

���������

���������

�����

Page 22: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

SOC 0 CPU 0SOC 0 CPU 1SOC 0 CPU 2SOC 0 CPU 3SOC 1 CPU 4SOC 1 CPU 5SOC 1 CPU 6SOC 1 CPU 7

Time (sec): 0.0

SOC 0 CPU 0SOC 0 CPU 1SOC 0 CPU 2SOC 0 CPU 3SOC 1 CPU 4SOC 1 CPU 5SOC 1 CPU 6SOC 1 CPU 7

Time (sec): 0.0

Bothsocket0and1runningatfullpower.Notethatthepanelfactoriza_on(red)islongandonthecri_calpath,sothereiswhitespacewherenotaskscanberunonsocket2.

Slowdownsocket1usingRAPLandlockcriBcalpathtosocket0.Thegemmtasks(green)takelonger,fillingoutthewhitespaceonsocket2.Thisoccurswithoutanyoveralllossin_meforthefullcomputa_on.

RED:GETRFBLUE/BROWN:LASWP/TRSMGREEN:GEMMBLUE:LASWP

Page 23: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

•  TiledLU(N=17920=224x80)usingaSandyBridgeEP(2sockets,4cores/socket)•  Demonstra_ngrunningthecri_calpathatahigherspeedthanothertasks.

•  Highpower:Bothsocketsrunningatfullpower.•  Lowpower:Secondsocketrunningalowpower;cri_calpath(panel)onfirstsocket.•  TotalenergyusedbyprocessorsinJoules:High4136Joules;Low:4001Joules

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30 35 40

Pro

cess

or

Pow

er

Sock

et 0+

1 (

Watts)

Elapsed Time (sec)

High Power (Total Joules:2069+2067=4136)High Power Usage (Socket 0)High Power Usage (Socket 1)

Low Power (Total Joules:2634+1371=4001)Low Power Usage (Socket 0)Low Power Usage (Socket 1)

Page 24: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

DGEMMPower&Performance:KNCvs.KNL

24

Matrix size2k 4k 6k 8k 10k 12k 14k 16k 18k 20k

Gflo

p/s

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400DGEMM Performance

Time (s)0 10 20 30 40 50 60 70 80

Ave

rage

pow

er (W

atts

)

0

40

80

120

160

200

240

280Accelerator Power Usage (PACKAGE)

Matrix size2k 4k 6k 8k 10k 12k 14k 16k 18k 20k

Gflo

p/s

0

200

400

600

800

1000

1200

1400

1600

1800

2000

2200

2400DGEMM Performance

Time (s)0 20 40 60 80 100 120

Ave

rage

pow

er (W

atts

)

0

40

80

120

160

200

240

280Accelerator Power Usage (PACKAGE)

KnightsCornerKnightsLanding60cores(4HWthreads/core) 68cores(1HWthread/core)

Page 25: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPICOUNTERINSPECTIONTOOLKIT

25

Page 26: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

CounterInspectionToolkitDe\inean“accuratemapping”betweenhigh-levelconceptsofperformancemetricsandtheunderlyinglow-levelhardwareevents.

Benchmarksandanalysesfor:

1.  Validatingnativeevents

2.  De\ininghigh-levelevents(“pre-de\ined”)

26

Page 27: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

RandomPointerChasing

Page 28: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

Benchmarktiming

0

2

4

6

8

10

12

14

16

18

20

22

24

26

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192

Ave

rage A

ccess

Late

ncy

(n

s)

Buffer Size (KBytes)

Memory Hierarchy (ig)

Page 29: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

DeTininghighlevelevents(Presets)| LLC_MISSES |

| Alias for LAST_LEVEL_CACHE_MISSES |

--------------------------------------------------------------------------------

| LAST_LEVEL_CACHE_MISSES |

| This is an alias for L3_LAT_CACHE:MISS |

--------------------------------------------------------------------------------

| L3_LAT_CACHE |

| Core-originated cacheable demand requests to L3 |

| :MISS |

| Core-originated cacheable demand requests missed L3 |

| :REFERENCE |

| Core-originated cacheable demand requests that refer to L3 |

| :e=0 |

| edge level (may require counter-mask >= 1) |

| :i=0 |

| invert |

| :c=0 |

| counter-mask in range [0-255] |

| :t=0 |

| measure any thread |

| :u=0 |

| monitor at user level |

| :k=0 |

| monitor at kernel level |

29

Page 30: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

PAPI6(a.k.a.PAPI-EX)

System-widemeasurements:•  Sharedhardwarecountersupportiscomplex

•  limitedvendorandkernelsupport

CounterinspectionToolkit:•  kernelsthatstresson-core+sharedhardwarefeatures

DeeperIntegrationofPAPIfordata\low-basedprogrammingmodelsNewArchitectures:

•  XeonPhiKnightsLanding,CaviumThunderX,…

30

Page 31: Recent Advances in the Performance API (PAPI) · Recent Advances in the Performance API (PAPI) Collaborators: Heike ... • PAPI enables software engineers to ... Simple VeriTication

3rdPartyToolsapplyingPAPI

•  PaRSEC(UTK)http://icl.cs.utk.edu/parsec/

•  TAU(UOregon)http://www.cs.uoregon.edu/research/tau/

•  PerfSuite(NCSA)http://perfsuite.ncsa.uiuc.edu/

•  HPCToolkit(RiceUniversity)http://hpctoolkit.org/

•  KOJAKandSCALASCA(FZJuelich,UTK)http://icl.cs.utk.edu/kojak/

•  VampirTraceandVampir(TUDresden)http://www.vamir.eu

•  Open|Speedshop(SGI)http://oss.sgi.com/projects/openspeedshop/

•  SvPablo(UNCRenaissanceComputingInstitute)http://www.renci.org/research/pablo/

•  ompP(UTK)http://www.ompp-tool.com31