introduction to cori - nersc · nersc cori • 34 double-width cabinets • 9,688 knl + 2,388...

72
Steve Leak NERSC User Engagement Group July 2017 Introduction to Cori

Upload: others

Post on 18-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Steve Leak!NERSC User Engagement Group!July 2017

Introduction to Cori

Page 2: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

NERSC Cori

•  34double-widthcabinets•  9,688KNL+2,388Haswell

nodesonAriesHigh-SpeedNetwork

•  658,784KNLcores+76,416Haswellcores

•  Top500#6(June2017)

Page 3: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Agenda

•  Corioverview,loggingin•  Runasimplejob•  BuildingandrunningapplicaRonsonCori–  Serial–  Parallel(MPI)– Mul.threaded(OpenMP)

•  Whataffectsperformance?–  Bo7lenecks–  Taskplacementandaffinity

•  Preparingforperformanceanalysis

Page 4: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Cori overview, logging in

Page 5: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Before we start

•  Thetrainingmaterialsareavailableongithub:hZps://git.io/v7OuW(andhZps://git.io/v7LeY)–  AndalsoonCori,via:

cd $SCRATCH module load training/csgf-2017 git clone $TRAINING $SCRATCH/csgf-2017

•  Theslidesareavailableat(hZps://www.nersc.gov/users/training/events/csgf-2017-hpc-workshop)

•  Thisisaveryhands-onorientedsession–  (sohavelaptopsready!)

•  WewillpauseforQ&Aattheendofeachtopic

Page 6: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Cori Overview •  Loginandcomputenodes

aredisRnct

•  Large,fast,parallel$SCRATCHfilesystemforrunningjobs

•  Smaller$HOME,configuredforbuildingcode

•  Burstbufferfilesystemintegrated,onhigh-speednetwork

Page 7: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

What’s so special about it?

•  Yourquad-coredesktopCPUlookssomethinglikethis:

•  ComparedtoaCoriHaswellnode:

16coresx2sockets,128GBRAM

ImagescourtesyofGoogleImageSearch,Intel,EnterpriseTech.com,NextPlaQorm.com

Page 8: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

What’s so special about it?

•  AndthentherearetheKNLnodes(68cores,MCDRAM,…)

Page 9: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

What’s so special about it?

•  Buthigh-endCPUsdon’tmakeasupercomputer–  Highspeedinterconnectsbetweenthem–  LightweightcomputenodeOS–  Verylarge(28,000TB)fastparallelfilesystem

•  …andadifferentusagemodel–  Subsetofnodesdedicatedtoasingletask,runviabatchsystem(nointerac.veGUI/desktop)

Page 10: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

What’s special about KNL?

•  DifferentchoiceofcompromisebetweendiespaceallocatedtodifferentpartsoftheCPU

Instruc.onfetch-and-decode,hyperthreading,branching(importantforegcompiling,GUIapplica.ons)

Arithme.c,vectorandfloa.ngpointunits–theactualFLOPS

Memoryaccess–keepsexecu.onenginebusy

Page 11: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

What’s special about KNL?

•  Xeon(egHaswell) •  XeonPhi(KNL)

Largeandfast-powerhungry,hot

Smallerspace,shortervectors

Smallandslow,cool,efficient

largeL3cache

LargeL2cache,noL3(butMCDRAM)

Bigwidevectors

Page 12: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Why?

•  Exascalechallenges–powerandheat–  CPUfrequencyplateaued~15yearsago–  Transistordensity,featuresizereachingfundamentallimits

–  PowerconsumpRonandheatdissipaRonarenowthekeyconstraintsforsupercompuRng

–  …wecan’tgettherefromhere!

•  KNLemphasizesvectorizaRonandparallelismatlowerpower–  Targetsscien.ficcompu.ng

Page 13: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Cori Overview •  Loginandcomputenodes

aredisRnct

•  Large,fast,parallel$SCRATCHfilesystemforrunningjobs

•  Smaller$HOME,configuredforbuildingcode

•  Burstbufferfilesystemintegrated,onhigh-speednetwork

Page 14: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Connecting to Cori •  Best:pointyourbrowserathttps://nxcloud01.nersc.gov and

startanNXsession–  OrsetupanNXplayeronyourworksta.onbyfollowingtheinstruc.onsat

h7p://www.nersc.gov/users/connec.ng-to-nersc/using-nx/

•  IfyouhaveaUNIX-likecomputer(oranNXsession),youcandirectlycontactNERSCwithyourbuilt-inSSHclient1.  Openanewterminal2.  %ssh –Y -l <training_acct_username> cori.nersc.gov

•  ManySSHclientsexistforWindows–  AverypopularoneispuZy

•  hZp://www.puZy.org/–  AdvancedusersmightprefertouseSSHdirectlywithinminZy(fromCygwin

distribu.on)

-14-

Page 15: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

X-forwarding

•  AllowsyoutoaccessGUIprogramsremotely•  Wewillneeditthisapernoon!

-15-

Example:localhost% ssh –Y -l elvis cori.nersc.gov … e/elvis> module load matlab e/elvis> matlab <MATLAB starts up>

Page 16: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Example Session localhost:~elvis> ssh –Y -l <training_account_name> cori.nersc.gov ***************************************************************** * * * NOTICE TO USERS * * --------------- * * * * Lawrence Berkeley National Laboratory operates this * * computer system under contract to the U.S. Department of * * Energy. This computer system is the property of the United * * States Government and is for authorized use only. *Users * * (authorized or unauthorized) have no explicit or implicit * * expectation of privacy.* * * * * Any or all uses of this system and all files on this system * * may be intercepted, monitored, recorded, copied, audited, * * inspected, and disclosed to site, Department of Energy, and * * law enforcement personnel, as well as authorized officials * * of other agencies, both domestic and foreign. *By using * * this system, the user consents to such interception, * * monitoring, recording, copying, auditing, inspection, and * * disclosure at the discretion of authorized site or * * Department of Energy personnel.* * * * * *Unauthorized or improper use of this system may result in * * administrative disciplinary action and civil and criminal * * penalties. _By continuing to use this system you indicate * * your awareness of and consent to these terms and conditions * * of use. LOG OFF IMMEDIATELY if you do not agree to the * * conditions stated in this warning._* * * * ***************************************************************** Password: <enter your training account password here>

-16-

Promptonlocalsystem

No.fica.onofacceptableuse.

Passwordprompt

Page 17: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

After logging in… •  Onaloginnode

–  cori01,cori02,…–  Sharedbymanyusers–  Notnecessarilythesameoneeach.me!

–  Butsameaccesstofilesystems

•  Nodirectaccesstocomputenodes–  Onlyviabatchsystem(salloc,sbatch)

•  Haswell(Xeon)architecture

Page 18: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Hands-on exercise

•  First:Q&A?•  Exercise:(checkREADME.mdathZps://git.io/v7LeY)

1.  LogintoCori2.  Navigateto$SCRATCH3.  Loadthemodule“training/csgf-2017”,thiswillset

$TRAININGtotheloca.onofthetrainingmaterials.•  Copy(orgit-clone)thetrainingmaterialstoyour$SCRATCH,and

browsethefiles(especiallyex1-getting_started/README.md)

4.  IsXworking?Trytostartanxterm cori$ xterm &

Page 19: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Agenda

•  Corioverview,loggingin•  Runasimplejob•  BuildingandrunningapplicaRonsonCori–  Serial–  Parallel(MPI)– Mul.threaded(OpenMP)

•  Whataffectsperformance?–  Bo7lenecks–  Taskplacementandaffinity

•  Preparingforperformanceanalysis

Page 20: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Run a simple job

Page 21: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Running jobs – key points

•  HPCworkisviabatchsystem–  Dedicatedsubsetofcomputeresources–  Loginnodesaresharedresourceforbuildingcode,edi.ngscripts,

etc.Usebatchjobsforrealwork•  Keycommands:

–  sbatch/salloc-submitajob–  srun-startan(op.onallyMPI)applica.onwithinajob–  sqs-checkthequeueformyjobstatus

•  Fortoday,wehaveareservaRon–  #SBATCH–reserva.on=csgrrain

www.nersc.gov/users/computa.onal-systems/cori/running-jobs/

Page 22: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Running jobs – key points

•  HPCworkisviabatchsystem–  Dedicatedsubsetofcomputeresources–  Loginnodesaresharedresourceforbuildingcode,edi.ngscripts,

etc.Usebatchjobsforrealwork•  Keycommands:

–  sbatch/salloc-submitajob–  srun-startan(op.onallyMPI)applica.onwithinajob–  sqs-checkthequeueformyjobstatus

•  Fortoday,wehaveareservaRon–  #SBATCH–reserva.on=csgrrain

www.nersc.gov/users/computa.onal-systems/cori/running-jobs/

Allofthisisontheweb!

Page 23: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

How jobs work

Desktop/loginnode•  Timeslicing

–  coresharedbymul.pletasks

–  Workswhenthecomputerismostlywai.ngforyou

HPC•  Youarewai.ngfor

thecomputer•  Subsetofpooled

resourcesdedicatedtoonejob

Page 24: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

How jobs work

•  Startonloginnode–  sharedbymanyusers,

notforcomputa.onalwork

•  Accesscomputenodeswithsbatchorsalloc

•  Batchscript–  Copiedtoqueue–  Hasdirec.vesfor

SLURM,andshellcommandstoperformonfirstcomputenode

•  Accessyourotherallocatednodeswithsrun

•  stdout,stderrsavedtofile–  (whenrunningin

batchmode)

Page 25: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Nodes, cores, CPUs, threads, tasks - some definitions •  NodeisthebasicunitofallocaRonatNERSC

–  Think“onehost”or“oneserver”–  Singlememoryspace,mul.pleCPUcores(24or32or68...

•  Andacoremightsupporthyperthreading

Page 26: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Nodes, cores, CPUs, threads, tasks - some definitions

Withhyperthreading

Withouthyperthreading

Hyperthreading•  Fast.meslicing

–  Goodwhenarithme.cunitsfrequentlywaitonmemory

•  Coreholdsstateof2(4onKNL)processes,theysharearithme.cunits

•  SLURMviewseachhyperthreadasaCPU•  ButmostHPCjobsperformbestwhennot

sharingacore!•  Usuallybesttoreserve2(or4)CPUs/core

Page 27: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Slurm tasks

•  First,ablockdiagramofhowhyperthreads,cores,cache,andsocketsrelatewithina(haswell)node

Page 28: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

• ASLURMtaskisareservaRonofCPUsandmemory,uptoonefullnode– Ajobhasmanytasks– 1tasktypicallycorrespondsto1MPIrank

srun -n <ntasks> ..

•  Eg:3possibletaskson2nodes

Slurm tasks

Page 29: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

So what must I request for my job?

Whatthebatchsystemneedstoknow:•  Howmanynodes(orCPUsortasks)doesthisjobneed?•  Forhowlongdoesitneedthem?

–  Wallclock.melimit

NERSC-specificextras:•  WhattypeofCPU?(-C…)

–  KNLorXeon(haswell/ivybridge)?•  Whichfilesystemswillthisjobuse?(-L…)

–  UsuallySCRATCH

Page 30: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

#SBATCH -N 64 # request 64 nodes srun -N 32 ./my_app # start ./my_app on 32 of them # (default: 1 per node) srun -n 128 ./my_app # start 128 instances of ./my_app, # across my 64 nodes (default is # to evenly distribute them in # block fashion)

OneMPIrankgenerallycorrespondstooneSLURMTask

Requesting nodes or tasks

Task0:./my_app

Task1:./my_app

Task2:./my_app

Task3:./my_app

Task126:./my_app

Task127:./my_app

Node0 Node1 Node63

Page 31: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Requesting time

#SBATCH -t 30 # 30 minutes #SBATCH -t 30:00 # 30 minutes #SBATCH -t 1:00:00 # 1 hour #SBATCH -t 1-0 # 1 day #SBATCH -t 1-12 # 1.5 days • WallclockRme,ierealelapsedRme• AperthismuchRme,SLURMcankillthisjob

Page 32: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Hands-on exercise: My first job

ASLURMjobscripthastwosecRons:1.   DirecRvestellingSLURMwhatyouwouldlikeittodowiththisjob2.   Thescriptitself-shellcommandstorunonthefirstcomputenode

Howmanynodes?

Forhowlong?

$SCRATCHfilesystem

Xeonnodesoncurrentcluster(setbycraype-{haswell,ivybridge}module)Note:cannotuseenvvarsindirec.ves-butdirec.veshaveequivalentcommand-lineop.on

Page 33: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Hands-on exercise: My first job

ASLURMjobscripthastwosecRons:1.   DirecRvestellingSLURMwhatyouwouldlikeittodowiththisjob2.   Thescriptitself-shellcommandstorunonthefirstcomputenode

Makestar.ngenvironmentlikemyloginenvironment

Runfrom$SCRATCH

Start4tasksacrossmynodes “sbatch”

submitsajobscript

Page 34: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Running jobs – key points

•  HPCworkisviabatchsystem–  Dedicatedsubsetofcomputeresources–  Loginnodesaresharedresourceforbuildingcode,edi.ngscripts,

etc.Usebatchjobsforrealwork•  Keycommands:

–  sbatch/salloc-submitajob–  srun-startan(op.onallyMPI)applica.onwithinajob–  sqs-checkthequeueformyjobstatus

•  Don’tforgetwehaveareservaRon–  #SBATCH–reserva.on=csgrrain

www.nersc.gov/users/computa.onal-systems/cori/running-jobs/

Page 35: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Where is my job?

elvis@nersc:~> sqs JOBID ST REASON USER NAME NODES USED REQUESTED ... 2774102 R Prolog elvis myscript.q 2 0:00 30:00 ... SUBMIT PARTITION RANK_P RANK_BF 2016-11-18T11:24:20 debug N/A N/A elvis@nersc:~> ls -lt total 11280 -rw-r----- 1 elvis elvis 132 Nov 18 11:24 slurm-2774102.out -rw-r----- 1 elvis elvis 208 Nov 18 11:24 myscript.q

Page 36: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Hands-on exercise

•  First:Q&A?•  Exercise:runningasimplejob

–  (checkREADME.mdinex2-running_jobs/)

Page 37: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Agenda

•  Corioverview,loggingin•  Runasimplejob•  BuildingandrunningapplicaRonsonCori–  Serial–  Parallel(MPI)– Mul.threaded(OpenMP)

•  Whataffectsperformance?–  Bo7lenecks–  Taskplacementandaffinity

•  Preparingforperformanceanalysis

Page 38: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Building and running applications on Cori

Page 39: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Cray compiler wrappers, ftn, cc and CC

•  BuildingcodeopRmallyforCorirequiresacomplexsetofcompileropRonandlibraries– eg,sta.clinkingbydefault(importantforperformanceatscale)

•  Compilerwrapperspn,ccandCCmanagethiscomplexityforyou– Usingenvironmentvariablessetbythemodulesyouhaveloaded

•  AlsoprovideMPI(soegmpiccisnotrequired)

-39-

Page 40: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Wait .. “environment modules” ?

•  SopwareonCori(andmostHPCsystems)ismanagedwith“environmentmodules”

•  Why?–  Coriisasharedresource–  Differentpeopleneeddifferentcombina.onsofsorware,atdifferentversions,withdifferentdependencies(andfordifferentjobs)

•  Loadingandunloadingamoduleupdatesenvironmentvariables(eg$PATH,$LD_LIBRARY_PATH)tomakeapackageavailable

Page 41: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Module Commands module load <modulename>

–  Addthemodulefromyourenvironmentmodule unload <modulename>

–  Removethemodulefromyourenvironmentmodule swap <module1> <module2>

–  Unloadonemoduleandreplaceitwithanother % module swap intel intel/16.0.3.210 (replacecurrentdefaulttoaspecificversion)

module list –  Seewhatmodulesyouhaveloadedrightnow

module show <modulename> –  Seewhatthemoduleactuallydoes

module help <modulename> –  Getmoreinforma.onaboutthesorware

-41-

Page 42: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Key modules for compiling

•  PrgEnv-intel/PrgEnv-cray/PrgEnv-gnu– Whichunderlyingcompilerthewrappersshouldinvoke

•  craype-haswell/craype-mic-knl–  Remember,loginnodesarehaswell,butwearebuildingforKNL!

module swap craype-haswell craype-mic-knl – Wrappersmanagecross-compiling

Importantfortoday!

Page 43: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

What do compiler wrappers link by default?

•  Dependingonthemodulesloaded,MPI,LAPACK/BLAS/ScaLAPACKlibraries,andmore

-43-

Page 44: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Compiling code

•  VerysimilartoregularLinux,butusingCC/cc/pn•  Dothisbitonce:

module swap craype-haswell craype-mic-knl

•  Then:ftn –c hack-a-kernel.f90 ftn –o hack-a-kernel.ex hack-a-kernel.o

•  NotethatthemodulelooksaperCPUtarget!

Page 45: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Compiling parallel code

•  CompilerwrappersgiveyouMPI“forfree”CC –c hello-mpi.c++ CC –o hello-mpi.ex hello-mpi.o

•  (CrayMPICH–opRmizedforAriesHSN)•  OpenMP:withPrgEnv-intel(NERSCdefault):

cc –qopenmp –c hello-omp.c cc –qopenmp –o hello-omp.ex hello-omp.o

Page 46: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

MPI vs OpenMP

•  MPIprovidesexplicitcommunicaRonbetweenseparateprocesses–  Op.onallyonseparatenodes–iepacketsoveranetwork– Mostparalleldevelopmentinlast2decadeshasusedthisapproach

•  OpenMPprovideswork-sharingandsynchronizaRonbetweenthreadsinasingleprocess–  Threadssharethesamememoryimage–  TomakethemostofaKNLnode,mostapplica.onswillneedtouseOpenMP

Page 47: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

MPI vs OpenMP

•  AnMPIapplicaRoncanhaveprocessesonmorethanonenode

•  AnOpenMPapplicaRonexistsenRrelywithin1node

MPIrank0

MPIrank1

OpenMPthread0OpenMPthread1

Page 48: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Why do we suddenly need OpenMP?

•  MulR-levelparallelism–  Atverylargescale,theoverheadsofMPI(oranyparallelapproach)becomeexcessivelycostly

–  Combining(nes.ng)parallelapproachesallowsustooperateeachatlowerscale•  Sweetspotforbestoverallefficiency

– MPI->OpenMP->Vectoriza.on

Page 49: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Why do we suddenly need OpenMP?

•  Memory-per-coreistrendingdownwards–  CoriHaswell:128GBfor32cores–  CoriKNL:96GBfor68cores(16GBMCDRAMfor68cores)–  Parallelismwithinsamememoryfootprintisnecessary

Page 50: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Hands-on exercise

•  First:Q&A?•  Exercise:buildingandrunningasimpleserial,OpenMPandMPIapplicaRon–  (checkREADME.mdinex3-building_apps)

Page 51: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Agenda

•  Corioverview,loggingin•  Runasimplejob•  BuildingandrunningapplicaRonsonCori–  Serial–  Parallel(MPI)– Mul.threaded(OpenMP)

•  Whataffectsperformance?–  Bo7lenecks–  Taskplacementandaffinity

•  Preparingforperformanceanalysis

Page 52: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

What affects performance?

Page 53: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Performance bottlenecks

•  AnypointatwhichsomecomponentofthesystemorapplicaRonisstalled,waiRngonsomeothercomponent,isaboZleneck

•  EgwaiRngforMPIcalltocomplete–  Loadimbalanceorcommunica.onoverheadisabo7leneck

•  BUTfortodayweareinterestedinKNLspecifics–  Bo7leneckswithinthenode

Page 54: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Bottlenecks within the node – Affinity issues

•  Firstly:AmIrunningitright?–  CoriKNLhas68corespernode,arrangedonameshof34.les

–  Each.lehas2coresandasharedL2cache

–  Eachcorehas4hyperthreadssharing2VPUsandanL1cache

Page 55: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Bottlenecks within the node – Affinity issues

•  Wherearemythreads?–  Iseachusingadifferentcoreatleast?

•  Linuxdoesnotalwayschoosebestplacement–  Usesrunop.onstoensureop.malthread/processplacement

Page 56: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

SoluRon:use--cpu_bind:srun -n 64 -c 4 --cpu_bind=verbose,cores ./my_exec srun -n 128 -c 2 --cpu_bind=verbose,threads ./my_exec

•  Controlswhatatask(MPIrank)isboundto– Ifnomorethan1MPIrankpercore:--cpu_bind=cores

–  Ifmorethan1MPIrankpercore:--cpu_bind=threads

Process (task) affinity

Page 57: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Thread affinity (OpenMP)

export OMP_NUM_THREADS=2 export OMP_PROC_BIND=spread # or close export OMP_PLACES=cores # or threads, or sockets srun -n 32 -c 8 --cpu_bind=verbose,cores ./my_exec

...Ifusinghyperthreads,useOMP_PLACES=threads

Page 58: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Memory affinity

LinuxdefaultbehavioristoallocatetoclosestNUMA-node,ifpossible

NotalwaysopRmal:•  KNLnodes:DDRis“closer”thanMCDRAM

#SBATCH -C knl,quad,flat export OMP_NUM_THREADS=4 srun -n16 -c16 --cpu_bind=cores --mem_bind=map_mem:1 ./a.out

•  NUMAnode1isMCDRAMinquad,flatmode•  “Mandatory”mapping:ifusing>16GB,mallocwillfail

NOTE:today’sreservaRonisfor“cache-mode”nodes,soMCDRAMisinvisible

Page 59: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Bottlenecks within the node – Affinity issues OMP_PROC_BIND=true OMP_PLACES=cores (threads) srun -c 4 --cpu_bind=cores …

Page 60: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Bottlenecks within the node •  Performancetendstobedominatedby:

•  BoZlenecksareusuallyduetooneofthesepartswaiRngontheother

DoingarithmeRcoperaRons(#FLOPS/cycle)

MovingdatabetweenarithmeRcunitsandmemory

Page 61: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Bottlenecks within the node

•  Bandwidth!– MCDRAMisveryhighbandwidth(~450GB/s)–  ..Butaggregatebandwidthisnotthesameasbandwidthtoeachcore!

Page 62: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Hands-on exercise

•  First:Q&A?•  Exercise:Gezngtheaffinitysezngsright

–  (checkREADME.mdinex4-affinity)– Wehavetwojobsinthisexercise,thefirstisfastandsufficienttodemonstratetheeffectofaffinityse|ngs.Thesecondwilltakelonger,sowe’llcon.nuewiththepresenta.onwithoutwai.ngforittocomplete

Page 63: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Agenda

•  Corioverview,loggingin•  Runasimplejob•  BuildingandrunningapplicaRonsonCori–  Serial–  Parallel(MPI)– Mul.threaded(OpenMP)

•  Whataffectsperformance?–  Bo7lenecks–  Taskplacementandaffinity

•  Preparingforperformanceanalysis

Page 64: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Preparing for performance analysis

Page 65: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

•  Asmall,short,butrepresentaRvetestcaseforyourapplicaRon–  Profilingtendstobecostly

•  Run.meoverhead•  Sizeofdatacollected

–  BUT:mustcoversamepathsthroughcodeas“real”example(oratleast,thedifferencesmustbeunderstood)• Whatcouldgowrongwiththistestcase?

Profiling prerequisite

-65-

.me-stepping

setup

1.mestep

output

Realrun:

Testcaseforprofiling:

Page 66: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

…But my application just is big!

•  Startwithalow-overheadprofilingmethod–  Egsampling-based(gprof,TAU,CrayPat,…)–  Iden.fyhotspots

•  Onlyprofilepartofarun–  Sometools(egVtune)allowyoutostarttherun“paused”and“resumecollec.on”viaanAPIcall

•  Onlyprofile1MPIrank–  Viasrunop.ons,egrunVtuneononerank,notothers(beyondtoday’sscope)

•  Luckily,ourhack-a-kernelisalreadysmall!J

Page 67: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Today’s exercise

•  Hackathonthisapernoon:UsingVtunetoanalyzehack-a-kernelperformanceonKNL

•  Vtune:powerfultool,butveryinformaRon-dense–  Bestsuitedtonode-levelperformanceanalysis(OpenMP,vectoriza.on,memorybandwidth..NotreallyMPI)

– Manyknobstoturn–  Command-lineinterface:usedat“collect”

•  Op.onallycanuseto“report”too–  GUIinterface:exploreperformancereports

Page 68: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Performance Analysis Gotchas

•  Overhead!–  Yourjobwill(usually)takelongerwhileprofiling

•  Performancetoolsopendon’tplaynicelytogether!–  Tryingtousesameresources

•  Especially(atNERSC):–  Darshanisloadedbydefault,collectsI/Operformancedata

– MustunloaditbeforeusingVtune(andmostotherperformancetools)

Page 69: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Performance Analysis Gotchas

•  Manytoolsrequiredynamically-linkedexecutable–  IncludingVtune–  NERSC/Cray:applica.onsaresta.callylinkedbydefault,mustuse“-dynamic”atcompile.me

•  “Uncore”(egmemoryrelated)performancecountersusuallyrequirespecialpermissions(egkernelmodule)–  NERSC/Slurmsupportsthiswith#SBATCH–perf=vtunedirec.ve

Page 70: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Hands-on exercise

•  Thisapernoon:wewillopRmizehack-a-kernelforKNL

•  Exercise:First,we’llbuilditforVtuneandrunanexperiment,sowehaveresultsreadytolookatthisapernoon–  checkREADME.mdinex5-vtune)

Page 71: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores

Summary and Recap

•  Nowyoucan:–  Login–  Buildanapplica.on–  Runanapplica.on,andbeawareofhowandwhereitisrunning

–  Prepareanapplica.onforperformanceanalysiswithVtune

•  Q&A?

Page 72: Introduction to Cori - NERSC · NERSC Cori • 34 double-width cabinets • 9,688 KNL + 2,388 Haswell nodes on Aries High-Speed Network • 658,784 KNL cores + 76,416 Haswell cores