introduction to cori - nersc · nersc cori • 34 double-width cabinets • 9,688 knl + 2,388...
TRANSCRIPT
Steve Leak!NERSC User Engagement Group!July 2017
Introduction to Cori
NERSC Cori
• 34double-widthcabinets• 9,688KNL+2,388Haswell
nodesonAriesHigh-SpeedNetwork
• 658,784KNLcores+76,416Haswellcores
• Top500#6(June2017)
Agenda
• Corioverview,loggingin• Runasimplejob• BuildingandrunningapplicaRonsonCori– Serial– Parallel(MPI)– Mul.threaded(OpenMP)
• Whataffectsperformance?– Bo7lenecks– Taskplacementandaffinity
• Preparingforperformanceanalysis
Cori overview, logging in
Before we start
• Thetrainingmaterialsareavailableongithub:hZps://git.io/v7OuW(andhZps://git.io/v7LeY)– AndalsoonCori,via:
cd $SCRATCH module load training/csgf-2017 git clone $TRAINING $SCRATCH/csgf-2017
• Theslidesareavailableat(hZps://www.nersc.gov/users/training/events/csgf-2017-hpc-workshop)
• Thisisaveryhands-onorientedsession– (sohavelaptopsready!)
• WewillpauseforQ&Aattheendofeachtopic
Cori Overview • Loginandcomputenodes
aredisRnct
• Large,fast,parallel$SCRATCHfilesystemforrunningjobs
• Smaller$HOME,configuredforbuildingcode
• Burstbufferfilesystemintegrated,onhigh-speednetwork
What’s so special about it?
• Yourquad-coredesktopCPUlookssomethinglikethis:
• ComparedtoaCoriHaswellnode:
16coresx2sockets,128GBRAM
ImagescourtesyofGoogleImageSearch,Intel,EnterpriseTech.com,NextPlaQorm.com
What’s so special about it?
• AndthentherearetheKNLnodes(68cores,MCDRAM,…)
What’s so special about it?
• Buthigh-endCPUsdon’tmakeasupercomputer– Highspeedinterconnectsbetweenthem– LightweightcomputenodeOS– Verylarge(28,000TB)fastparallelfilesystem
• …andadifferentusagemodel– Subsetofnodesdedicatedtoasingletask,runviabatchsystem(nointerac.veGUI/desktop)
What’s special about KNL?
• DifferentchoiceofcompromisebetweendiespaceallocatedtodifferentpartsoftheCPU
Instruc.onfetch-and-decode,hyperthreading,branching(importantforegcompiling,GUIapplica.ons)
Arithme.c,vectorandfloa.ngpointunits–theactualFLOPS
Memoryaccess–keepsexecu.onenginebusy
What’s special about KNL?
• Xeon(egHaswell) • XeonPhi(KNL)
Largeandfast-powerhungry,hot
Smallerspace,shortervectors
Smallandslow,cool,efficient
largeL3cache
LargeL2cache,noL3(butMCDRAM)
Bigwidevectors
Why?
• Exascalechallenges–powerandheat– CPUfrequencyplateaued~15yearsago– Transistordensity,featuresizereachingfundamentallimits
– PowerconsumpRonandheatdissipaRonarenowthekeyconstraintsforsupercompuRng
– …wecan’tgettherefromhere!
• KNLemphasizesvectorizaRonandparallelismatlowerpower– Targetsscien.ficcompu.ng
Cori Overview • Loginandcomputenodes
aredisRnct
• Large,fast,parallel$SCRATCHfilesystemforrunningjobs
• Smaller$HOME,configuredforbuildingcode
• Burstbufferfilesystemintegrated,onhigh-speednetwork
Connecting to Cori • Best:pointyourbrowserathttps://nxcloud01.nersc.gov and
startanNXsession– OrsetupanNXplayeronyourworksta.onbyfollowingtheinstruc.onsat
h7p://www.nersc.gov/users/connec.ng-to-nersc/using-nx/
• IfyouhaveaUNIX-likecomputer(oranNXsession),youcandirectlycontactNERSCwithyourbuilt-inSSHclient1. Openanewterminal2. %ssh –Y -l <training_acct_username> cori.nersc.gov
• ManySSHclientsexistforWindows– AverypopularoneispuZy
• hZp://www.puZy.org/– AdvancedusersmightprefertouseSSHdirectlywithinminZy(fromCygwin
distribu.on)
-14-
X-forwarding
• AllowsyoutoaccessGUIprogramsremotely• Wewillneeditthisapernoon!
-15-
Example:localhost% ssh –Y -l elvis cori.nersc.gov … e/elvis> module load matlab e/elvis> matlab <MATLAB starts up>
Example Session localhost:~elvis> ssh –Y -l <training_account_name> cori.nersc.gov ***************************************************************** * * * NOTICE TO USERS * * --------------- * * * * Lawrence Berkeley National Laboratory operates this * * computer system under contract to the U.S. Department of * * Energy. This computer system is the property of the United * * States Government and is for authorized use only. *Users * * (authorized or unauthorized) have no explicit or implicit * * expectation of privacy.* * * * * Any or all uses of this system and all files on this system * * may be intercepted, monitored, recorded, copied, audited, * * inspected, and disclosed to site, Department of Energy, and * * law enforcement personnel, as well as authorized officials * * of other agencies, both domestic and foreign. *By using * * this system, the user consents to such interception, * * monitoring, recording, copying, auditing, inspection, and * * disclosure at the discretion of authorized site or * * Department of Energy personnel.* * * * * *Unauthorized or improper use of this system may result in * * administrative disciplinary action and civil and criminal * * penalties. _By continuing to use this system you indicate * * your awareness of and consent to these terms and conditions * * of use. LOG OFF IMMEDIATELY if you do not agree to the * * conditions stated in this warning._* * * * ***************************************************************** Password: <enter your training account password here>
-16-
Promptonlocalsystem
No.fica.onofacceptableuse.
Passwordprompt
After logging in… • Onaloginnode
– cori01,cori02,…– Sharedbymanyusers– Notnecessarilythesameoneeach.me!
– Butsameaccesstofilesystems
• Nodirectaccesstocomputenodes– Onlyviabatchsystem(salloc,sbatch)
• Haswell(Xeon)architecture
Hands-on exercise
• First:Q&A?• Exercise:(checkREADME.mdathZps://git.io/v7LeY)
1. LogintoCori2. Navigateto$SCRATCH3. Loadthemodule“training/csgf-2017”,thiswillset
$TRAININGtotheloca.onofthetrainingmaterials.• Copy(orgit-clone)thetrainingmaterialstoyour$SCRATCH,and
browsethefiles(especiallyex1-getting_started/README.md)
4. IsXworking?Trytostartanxterm cori$ xterm &
Agenda
• Corioverview,loggingin• Runasimplejob• BuildingandrunningapplicaRonsonCori– Serial– Parallel(MPI)– Mul.threaded(OpenMP)
• Whataffectsperformance?– Bo7lenecks– Taskplacementandaffinity
• Preparingforperformanceanalysis
Run a simple job
Running jobs – key points
• HPCworkisviabatchsystem– Dedicatedsubsetofcomputeresources– Loginnodesaresharedresourceforbuildingcode,edi.ngscripts,
etc.Usebatchjobsforrealwork• Keycommands:
– sbatch/salloc-submitajob– srun-startan(op.onallyMPI)applica.onwithinajob– sqs-checkthequeueformyjobstatus
• Fortoday,wehaveareservaRon– #SBATCH–reserva.on=csgrrain
www.nersc.gov/users/computa.onal-systems/cori/running-jobs/
Running jobs – key points
• HPCworkisviabatchsystem– Dedicatedsubsetofcomputeresources– Loginnodesaresharedresourceforbuildingcode,edi.ngscripts,
etc.Usebatchjobsforrealwork• Keycommands:
– sbatch/salloc-submitajob– srun-startan(op.onallyMPI)applica.onwithinajob– sqs-checkthequeueformyjobstatus
• Fortoday,wehaveareservaRon– #SBATCH–reserva.on=csgrrain
www.nersc.gov/users/computa.onal-systems/cori/running-jobs/
Allofthisisontheweb!
How jobs work
Desktop/loginnode• Timeslicing
– coresharedbymul.pletasks
– Workswhenthecomputerismostlywai.ngforyou
HPC• Youarewai.ngfor
thecomputer• Subsetofpooled
resourcesdedicatedtoonejob
How jobs work
• Startonloginnode– sharedbymanyusers,
notforcomputa.onalwork
• Accesscomputenodeswithsbatchorsalloc
• Batchscript– Copiedtoqueue– Hasdirec.vesfor
SLURM,andshellcommandstoperformonfirstcomputenode
• Accessyourotherallocatednodeswithsrun
• stdout,stderrsavedtofile– (whenrunningin
batchmode)
Nodes, cores, CPUs, threads, tasks - some definitions • NodeisthebasicunitofallocaRonatNERSC
– Think“onehost”or“oneserver”– Singlememoryspace,mul.pleCPUcores(24or32or68...
• Andacoremightsupporthyperthreading
Nodes, cores, CPUs, threads, tasks - some definitions
Withhyperthreading
Withouthyperthreading
Hyperthreading• Fast.meslicing
– Goodwhenarithme.cunitsfrequentlywaitonmemory
• Coreholdsstateof2(4onKNL)processes,theysharearithme.cunits
• SLURMviewseachhyperthreadasaCPU• ButmostHPCjobsperformbestwhennot
sharingacore!• Usuallybesttoreserve2(or4)CPUs/core
Slurm tasks
• First,ablockdiagramofhowhyperthreads,cores,cache,andsocketsrelatewithina(haswell)node
• ASLURMtaskisareservaRonofCPUsandmemory,uptoonefullnode– Ajobhasmanytasks– 1tasktypicallycorrespondsto1MPIrank
srun -n <ntasks> ..
• Eg:3possibletaskson2nodes
Slurm tasks
So what must I request for my job?
Whatthebatchsystemneedstoknow:• Howmanynodes(orCPUsortasks)doesthisjobneed?• Forhowlongdoesitneedthem?
– Wallclock.melimit
NERSC-specificextras:• WhattypeofCPU?(-C…)
– KNLorXeon(haswell/ivybridge)?• Whichfilesystemswillthisjobuse?(-L…)
– UsuallySCRATCH
#SBATCH -N 64 # request 64 nodes srun -N 32 ./my_app # start ./my_app on 32 of them # (default: 1 per node) srun -n 128 ./my_app # start 128 instances of ./my_app, # across my 64 nodes (default is # to evenly distribute them in # block fashion)
OneMPIrankgenerallycorrespondstooneSLURMTask
Requesting nodes or tasks
Task0:./my_app
Task1:./my_app
Task2:./my_app
Task3:./my_app
Task126:./my_app
Task127:./my_app
Node0 Node1 Node63
Requesting time
#SBATCH -t 30 # 30 minutes #SBATCH -t 30:00 # 30 minutes #SBATCH -t 1:00:00 # 1 hour #SBATCH -t 1-0 # 1 day #SBATCH -t 1-12 # 1.5 days • WallclockRme,ierealelapsedRme• AperthismuchRme,SLURMcankillthisjob
Hands-on exercise: My first job
ASLURMjobscripthastwosecRons:1. DirecRvestellingSLURMwhatyouwouldlikeittodowiththisjob2. Thescriptitself-shellcommandstorunonthefirstcomputenode
Howmanynodes?
Forhowlong?
$SCRATCHfilesystem
Xeonnodesoncurrentcluster(setbycraype-{haswell,ivybridge}module)Note:cannotuseenvvarsindirec.ves-butdirec.veshaveequivalentcommand-lineop.on
Hands-on exercise: My first job
ASLURMjobscripthastwosecRons:1. DirecRvestellingSLURMwhatyouwouldlikeittodowiththisjob2. Thescriptitself-shellcommandstorunonthefirstcomputenode
Makestar.ngenvironmentlikemyloginenvironment
Runfrom$SCRATCH
Start4tasksacrossmynodes “sbatch”
submitsajobscript
Running jobs – key points
• HPCworkisviabatchsystem– Dedicatedsubsetofcomputeresources– Loginnodesaresharedresourceforbuildingcode,edi.ngscripts,
etc.Usebatchjobsforrealwork• Keycommands:
– sbatch/salloc-submitajob– srun-startan(op.onallyMPI)applica.onwithinajob– sqs-checkthequeueformyjobstatus
• Don’tforgetwehaveareservaRon– #SBATCH–reserva.on=csgrrain
www.nersc.gov/users/computa.onal-systems/cori/running-jobs/
Where is my job?
elvis@nersc:~> sqs JOBID ST REASON USER NAME NODES USED REQUESTED ... 2774102 R Prolog elvis myscript.q 2 0:00 30:00 ... SUBMIT PARTITION RANK_P RANK_BF 2016-11-18T11:24:20 debug N/A N/A elvis@nersc:~> ls -lt total 11280 -rw-r----- 1 elvis elvis 132 Nov 18 11:24 slurm-2774102.out -rw-r----- 1 elvis elvis 208 Nov 18 11:24 myscript.q
Hands-on exercise
• First:Q&A?• Exercise:runningasimplejob
– (checkREADME.mdinex2-running_jobs/)
Agenda
• Corioverview,loggingin• Runasimplejob• BuildingandrunningapplicaRonsonCori– Serial– Parallel(MPI)– Mul.threaded(OpenMP)
• Whataffectsperformance?– Bo7lenecks– Taskplacementandaffinity
• Preparingforperformanceanalysis
Building and running applications on Cori
Cray compiler wrappers, ftn, cc and CC
• BuildingcodeopRmallyforCorirequiresacomplexsetofcompileropRonandlibraries– eg,sta.clinkingbydefault(importantforperformanceatscale)
• Compilerwrapperspn,ccandCCmanagethiscomplexityforyou– Usingenvironmentvariablessetbythemodulesyouhaveloaded
• AlsoprovideMPI(soegmpiccisnotrequired)
-39-
Wait .. “environment modules” ?
• SopwareonCori(andmostHPCsystems)ismanagedwith“environmentmodules”
• Why?– Coriisasharedresource– Differentpeopleneeddifferentcombina.onsofsorware,atdifferentversions,withdifferentdependencies(andfordifferentjobs)
• Loadingandunloadingamoduleupdatesenvironmentvariables(eg$PATH,$LD_LIBRARY_PATH)tomakeapackageavailable
Module Commands module load <modulename>
– Addthemodulefromyourenvironmentmodule unload <modulename>
– Removethemodulefromyourenvironmentmodule swap <module1> <module2>
– Unloadonemoduleandreplaceitwithanother % module swap intel intel/16.0.3.210 (replacecurrentdefaulttoaspecificversion)
module list – Seewhatmodulesyouhaveloadedrightnow
module show <modulename> – Seewhatthemoduleactuallydoes
module help <modulename> – Getmoreinforma.onaboutthesorware
-41-
Key modules for compiling
• PrgEnv-intel/PrgEnv-cray/PrgEnv-gnu– Whichunderlyingcompilerthewrappersshouldinvoke
• craype-haswell/craype-mic-knl– Remember,loginnodesarehaswell,butwearebuildingforKNL!
module swap craype-haswell craype-mic-knl – Wrappersmanagecross-compiling
Importantfortoday!
What do compiler wrappers link by default?
• Dependingonthemodulesloaded,MPI,LAPACK/BLAS/ScaLAPACKlibraries,andmore
-43-
Compiling code
• VerysimilartoregularLinux,butusingCC/cc/pn• Dothisbitonce:
module swap craype-haswell craype-mic-knl
• Then:ftn –c hack-a-kernel.f90 ftn –o hack-a-kernel.ex hack-a-kernel.o
• NotethatthemodulelooksaperCPUtarget!
Compiling parallel code
• CompilerwrappersgiveyouMPI“forfree”CC –c hello-mpi.c++ CC –o hello-mpi.ex hello-mpi.o
• (CrayMPICH–opRmizedforAriesHSN)• OpenMP:withPrgEnv-intel(NERSCdefault):
cc –qopenmp –c hello-omp.c cc –qopenmp –o hello-omp.ex hello-omp.o
MPI vs OpenMP
• MPIprovidesexplicitcommunicaRonbetweenseparateprocesses– Op.onallyonseparatenodes–iepacketsoveranetwork– Mostparalleldevelopmentinlast2decadeshasusedthisapproach
• OpenMPprovideswork-sharingandsynchronizaRonbetweenthreadsinasingleprocess– Threadssharethesamememoryimage– TomakethemostofaKNLnode,mostapplica.onswillneedtouseOpenMP
MPI vs OpenMP
• AnMPIapplicaRoncanhaveprocessesonmorethanonenode
• AnOpenMPapplicaRonexistsenRrelywithin1node
MPIrank0
MPIrank1
OpenMPthread0OpenMPthread1
Why do we suddenly need OpenMP?
• MulR-levelparallelism– Atverylargescale,theoverheadsofMPI(oranyparallelapproach)becomeexcessivelycostly
– Combining(nes.ng)parallelapproachesallowsustooperateeachatlowerscale• Sweetspotforbestoverallefficiency
– MPI->OpenMP->Vectoriza.on
Why do we suddenly need OpenMP?
• Memory-per-coreistrendingdownwards– CoriHaswell:128GBfor32cores– CoriKNL:96GBfor68cores(16GBMCDRAMfor68cores)– Parallelismwithinsamememoryfootprintisnecessary
Hands-on exercise
• First:Q&A?• Exercise:buildingandrunningasimpleserial,OpenMPandMPIapplicaRon– (checkREADME.mdinex3-building_apps)
Agenda
• Corioverview,loggingin• Runasimplejob• BuildingandrunningapplicaRonsonCori– Serial– Parallel(MPI)– Mul.threaded(OpenMP)
• Whataffectsperformance?– Bo7lenecks– Taskplacementandaffinity
• Preparingforperformanceanalysis
What affects performance?
Performance bottlenecks
• AnypointatwhichsomecomponentofthesystemorapplicaRonisstalled,waiRngonsomeothercomponent,isaboZleneck
• EgwaiRngforMPIcalltocomplete– Loadimbalanceorcommunica.onoverheadisabo7leneck
• BUTfortodayweareinterestedinKNLspecifics– Bo7leneckswithinthenode
Bottlenecks within the node – Affinity issues
• Firstly:AmIrunningitright?– CoriKNLhas68corespernode,arrangedonameshof34.les
– Each.lehas2coresandasharedL2cache
– Eachcorehas4hyperthreadssharing2VPUsandanL1cache
Bottlenecks within the node – Affinity issues
• Wherearemythreads?– Iseachusingadifferentcoreatleast?
• Linuxdoesnotalwayschoosebestplacement– Usesrunop.onstoensureop.malthread/processplacement
SoluRon:use--cpu_bind:srun -n 64 -c 4 --cpu_bind=verbose,cores ./my_exec srun -n 128 -c 2 --cpu_bind=verbose,threads ./my_exec
• Controlswhatatask(MPIrank)isboundto– Ifnomorethan1MPIrankpercore:--cpu_bind=cores
– Ifmorethan1MPIrankpercore:--cpu_bind=threads
Process (task) affinity
Thread affinity (OpenMP)
export OMP_NUM_THREADS=2 export OMP_PROC_BIND=spread # or close export OMP_PLACES=cores # or threads, or sockets srun -n 32 -c 8 --cpu_bind=verbose,cores ./my_exec
...Ifusinghyperthreads,useOMP_PLACES=threads
Memory affinity
LinuxdefaultbehavioristoallocatetoclosestNUMA-node,ifpossible
NotalwaysopRmal:• KNLnodes:DDRis“closer”thanMCDRAM
#SBATCH -C knl,quad,flat export OMP_NUM_THREADS=4 srun -n16 -c16 --cpu_bind=cores --mem_bind=map_mem:1 ./a.out
• NUMAnode1isMCDRAMinquad,flatmode• “Mandatory”mapping:ifusing>16GB,mallocwillfail
NOTE:today’sreservaRonisfor“cache-mode”nodes,soMCDRAMisinvisible
Bottlenecks within the node – Affinity issues OMP_PROC_BIND=true OMP_PLACES=cores (threads) srun -c 4 --cpu_bind=cores …
Bottlenecks within the node • Performancetendstobedominatedby:
• BoZlenecksareusuallyduetooneofthesepartswaiRngontheother
DoingarithmeRcoperaRons(#FLOPS/cycle)
MovingdatabetweenarithmeRcunitsandmemory
Bottlenecks within the node
• Bandwidth!– MCDRAMisveryhighbandwidth(~450GB/s)– ..Butaggregatebandwidthisnotthesameasbandwidthtoeachcore!
Hands-on exercise
• First:Q&A?• Exercise:Gezngtheaffinitysezngsright
– (checkREADME.mdinex4-affinity)– Wehavetwojobsinthisexercise,thefirstisfastandsufficienttodemonstratetheeffectofaffinityse|ngs.Thesecondwilltakelonger,sowe’llcon.nuewiththepresenta.onwithoutwai.ngforittocomplete
Agenda
• Corioverview,loggingin• Runasimplejob• BuildingandrunningapplicaRonsonCori– Serial– Parallel(MPI)– Mul.threaded(OpenMP)
• Whataffectsperformance?– Bo7lenecks– Taskplacementandaffinity
• Preparingforperformanceanalysis
Preparing for performance analysis
• Asmall,short,butrepresentaRvetestcaseforyourapplicaRon– Profilingtendstobecostly
• Run.meoverhead• Sizeofdatacollected
– BUT:mustcoversamepathsthroughcodeas“real”example(oratleast,thedifferencesmustbeunderstood)• Whatcouldgowrongwiththistestcase?
Profiling prerequisite
-65-
.me-stepping
setup
1.mestep
output
Realrun:
Testcaseforprofiling:
…But my application just is big!
• Startwithalow-overheadprofilingmethod– Egsampling-based(gprof,TAU,CrayPat,…)– Iden.fyhotspots
• Onlyprofilepartofarun– Sometools(egVtune)allowyoutostarttherun“paused”and“resumecollec.on”viaanAPIcall
• Onlyprofile1MPIrank– Viasrunop.ons,egrunVtuneononerank,notothers(beyondtoday’sscope)
• Luckily,ourhack-a-kernelisalreadysmall!J
Today’s exercise
• Hackathonthisapernoon:UsingVtunetoanalyzehack-a-kernelperformanceonKNL
• Vtune:powerfultool,butveryinformaRon-dense– Bestsuitedtonode-levelperformanceanalysis(OpenMP,vectoriza.on,memorybandwidth..NotreallyMPI)
– Manyknobstoturn– Command-lineinterface:usedat“collect”
• Op.onallycanuseto“report”too– GUIinterface:exploreperformancereports
Performance Analysis Gotchas
• Overhead!– Yourjobwill(usually)takelongerwhileprofiling
• Performancetoolsopendon’tplaynicelytogether!– Tryingtousesameresources
• Especially(atNERSC):– Darshanisloadedbydefault,collectsI/Operformancedata
– MustunloaditbeforeusingVtune(andmostotherperformancetools)
Performance Analysis Gotchas
• Manytoolsrequiredynamically-linkedexecutable– IncludingVtune– NERSC/Cray:applica.onsaresta.callylinkedbydefault,mustuse“-dynamic”atcompile.me
• “Uncore”(egmemoryrelated)performancecountersusuallyrequirespecialpermissions(egkernelmodule)– NERSC/Slurmsupportsthiswith#SBATCH–perf=vtunedirec.ve
Hands-on exercise
• Thisapernoon:wewillopRmizehack-a-kernelforKNL
• Exercise:First,we’llbuilditforVtuneandrunanexperiment,sowehaveresultsreadytolookatthisapernoon– checkREADME.mdinex5-vtune)
Summary and Recap
• Nowyoucan:– Login– Buildanapplica.on– Runanapplica.on,andbeawareofhowandwhereitisrunning
– Prepareanapplica.onforperformanceanalysiswithVtune
• Q&A?