introducon to parallel compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/intro... · 2. read...

Introduc)on to Parallel Compu)ng

September2018

MikeJarsulic

Introduc)on

So Far...

Year Users Jobs CPUHours SupportTickets

2012 36 * 1.2M *

2013 71 * 1.9M 141

2014 84 739K 5.3M 391

2015 125 1.5M 5.2M 372

2016 153 1.6M 6.1M 363

2017 256 4.8M 14.7M 759

2018 296 3.0M 8.8M 631

TOTAL * 11.7M 40.2M 2657

However...

•  TheBSDresearchcommunitydoesnotunderstandhowtouClizeresourcesproperly•  Example:

•  August1st,2018•  1021coresinuse•  39jobshadrequestedatotalof358cores•  All39jobsweresinglecorejobs•  Thus,319coreswerededicated,butidle

• Mytakeaways:•  TheBSDresearchcommunitydoesnotreallyunderstandparallelcompuCng!•  IhavenotdoneagoodoftrainingtheBSDresearchcommunityinthisarea!

Office Hours

• HeldeveryTuesdayandThursdayatPeckPavilionN161• Userscomeintoworkone-on-onetobeZeruClizeourHPCresources•  ThreecommonquesConsonparallelcompuCng:

1.  Isthisaworthwhileskilltolearn?Aren'tsingleprocessorsystemsfastenoughandalwaysimproving?

2.  Whycan'tprocessormanufacturersjustmakefasterprocessors?3.  WillyouwritemeaprogramthatwillautomaCcallyconvertmyserial

applicaConintoaparallelapplicaCon?

Obligatory Cost per Genome Slide

Mo)va)on

Goals (of this presenta)on)

• UnderstandthebasicterminologyfoundinparallelcompuCng• Recognizethedifferenttypesofparallelarchitectures•  Learnhowparallelperformanceismeasured• Gainahigh-levelunderstandingtothedifferentparadigmsinparallelprogrammingandwhatlibrariesareavailabletoimplementthoseparadigms

• PleasenotethatyouwillnotbeaparallelprogrammingrockstaracerthisthreehourintroducCon

Overall Goals

• Hopefully,therewillbeenoughinterestacerthispresentaContosupportaworkshopseriesonparallelcompuCng• Possibletopicsforworkshopsare:•  ScienCficProgramming•  ParameterizaCon•  OpenMP•  MPI•  CUDA•  OpenACC

Today's Outline

•  Terminology• ParallelArchitectures• MeasuringParallelPerformance•  EmbarrassinglyParallelWorkflows•  SharedMemory• DistributedMemory• VectorizaCon• HybridCompuCng

A Note on Sources

•  PresentaCongivenatSC17byStoutandJablonowskifromtheUniversityofMichigan•  IntroducContoParallelCompuCng2ndEdiConbyGrama,Gupta,etal.•  IntroducContoParallelProgrammingbyPacheco•  UsingOpenMPbyChapman,Jost,andVanDerPas•  UsingMPIbyGropp,Lusk,Skjellum•  AdvancedMPIbyGropp,Hoefler,etal.•  CUDAforEngineersbyStorCandYurtoglu•  CUDAbyExamplebySandersandKandrot•  ParallelProgrammingwithOpenACCbyFarber•  TheLLNLOpenMPandMPItutorials

A Note on Format

•  Today'sSeminarwillberecordedforaITM/CTSAgrant•  QuesCons• Whiteboard•  Swearing

Terminology

Caveats

•  TherearenostandardizeddefiniConsfortermsinparallelcompuCng,thusresourcesoutsideofthispresentaConmayusethetermsinadifferentway• ManyalgorithmsorsocwareapplicaConsthatyouwillfindintherealworlddonotfallneatlyintothecategoriesmenConedinthispresentaConsincetheyblendapproachesinwaysthatdifficulttocategorize.

Basic Terminology

•  ParallelCompuFng–solvingaprobleminwhichmulCpletaskscooperatecloselyandmakesimultaneoususeofmulCpleprocessors•  EmbarrassinglyParallel–solvingmanysimilar,independenttasks;parameterizaCon• MulFprocessor–mulCpleprocessors(cores)onasinglechip•  SymmetricMulFprocessing(SMP)-mulCpleprocessorssharingasingleaddressspace,OSinstance,storage,etc.AllprocessorsaretreatedequallybytheOS•  UMA–UniformMemoryAccess;allprocessorssharethememoryspaceuniformly•  NUMA–Non-UniformMemoryAccess;memoryaccessCmedependsonthelocaConofthememoryrelaCvetotheprocessor

More Basic Terminology

• ClusterCompuFng–AparallelsystemconsisCngofacombinaConofcommodityunits(processorsorSMPs)• CapacityCluster–Aclusterdesignedtorunavarietytypesofproblemsregardlessofthesize;thegoalistoperformasmuchresearchaspossible• CapabilityCluster–Thefastest,largestmachinesdesignedtosolvelargeproblems;thegoalistodemonstratecapabilityoftheparallelsystem• HighPerformanceCompuFng–Solvinglargeproblemsviaaclusterofcomputersthatconsistoffastnetworksandmassivestorage

Even More Basic Terminology

•  InstrucFon-LevelParallelism(ILP)-ImprovesprocessorperformancebyhavingmulCpleprocessorcomponentssimultaneouslyexecuCnginstrucCons• Pipelining–Breakingataskintostepsperformedbydifferentprocessorswithinputsstreamingthrough,muchlikeanassemblyline•  SuperscalarExecuFon-TheabilityforaprocessortoexecutemorethanoneinstrucConduringaclockcyclebysimultaneouslydispatchingmulCpleinstrucConstodifferentexecuConunitsontheprocessor• CacheCoherence-SynchronizaConbetweenlocalcachesforeachprocessor

Crash Simula)on Example

•  SimplifiedmodelbasedonacrashsimulaConfortheFordMotorCompany•  IllustratesvariousaspectscommontomanysimulaConsandapplicaCons•  ThisexamplewasprovidedbyQ.StoutandC.JablonowskioftheUniversityofMichigan

Finite Element Representa)on

• Carismodeledbyatriangulatedsurface(elements)•  ThesimulaConmodelsthemovementoftheelements,incorporaCngtheforcesontheelementstodeterminetheirnewposiCon•  IneachCmestep,themovementofeachelementdependsonitsinteracConwiththeotherelementstowhichitisphysicallyadjacent•  Inacrash,elementsmayenduptouchingthatwerenottouchinginiCally•  ThestateofanelementisitslocaCon,velocity,andinformaConsuchaswhetheritismetalthatisbending

Serial Crash Simula)on

1.  Forallelements2.  ReadState(element),ProperCes(element),Neighbor_list(element)3.  ForCme=1toend_of_simulaCon4.  Forelement=1tonum_elements5.  ComputeState(element)fornextCmestep,basedonthe

previousstateoftheelementanditsneighborsandtheproperCesoftheelement

Simple Approach to Paralleliza)on (Concepts)

• DistributedMemory–Parallelsystembasedonprocessorslinkedwithafastnetwork;processorscommunicateviamessages• OwnerComputes–Distributeelementstoprocessors;eachprocessorupdatestheelementswhichitcontains•  SingleProgramMulFpleData(SPMD)-Allmachinesrunthesameprogramonindependentdata;dominantformofparallelcompuCng

Distributed Car

Basic Parallel Version

ConcurrentlyforallprocessorsP1.  ForallelementsassignedtoP2.  ReadState(element),ProperCes(element),Neighbor-

list(element)3.  ForCme=1toend_of_simulaCon4.  Forelement=1tonum_elements_in_P5.  ComputeState(element)fornextCmestep,basedon

previousstateofelementanditsneighbors,andonproperCesoftheelement

Notes

• Mostofthecodeisthesameas,orsimilarto,serialcode.• High-levelstructureremainsthesame:asequenceofsteps•  Thesequenceisaserialconstruct,but•  Nowthestepsareperformedinparallel,but•  CalculaConsforindividualelementsareserial

• QuesCon:•  Howdoeseachprocessorkeeptrackofadjacencyinfoforneighborsinotherprocessors?

Distributed Car (ghost cells)

Parallel Architectures

Flynn's Taxonomies

•  HardwareClassificaCons•  {S,M}I{S,M}D•  SingleInstrucFon(SI)-SysteminwhichallprocessorsexecutethesameinstrucCon• MulFpleInstrucFon(MI)-SysteminwhichdifferentprocessorsmayexecutedifferentinstrucCons•  SingleData(SD)-Systeminwhichallprocessorsoperateonthesamedata• MulFpleData(MD)-Systeminwhichdifferentprocessorsmayoperateondifferentdata• M.J.Flynn.SomecomputerorganizaConsandtheireffecCveness.IEEETransacConsonComputers,C-21(9):948–960,1972.

Flynn's Taxonomies

•  SISD–ClassicvonNeumannarchitecture;serialcomputer• MIMD–CollecConsofautonomousprocessorsthatcanexecutemulCpleindependentprograms;eachofwhichcanhaveitsowndatastream•  SIMD–DataisdividedamongtheprocessorsandeachdataitemissubjectedtothesamesequenceofinstrucCons;GPUs,AdvancedVectorExtensions(AVX)• MISD–Veryrare;systolicarrays;smartphonescarriedbyChupacabras

CRAY-1 Vector Machine (1976)

Vector Machines Today

SoWware Taxonomies

• DataParallel(SIMD)•  ParallelismthatisaresultofidenCcaloperaConsbeingappliedconcurrentlyondifferentdataitems;e.g.,manymatrixalgorithms•  Difficulttoapplytocomplexproblems

•  SingleProgram,MulFpleData(SPMD)•  AsingleapplicaConisrunacrossmulCpleprocesses/threadsonaMIMDarchitecture•  Mostprocessesexecutethesamecodebutdonotworkinlock-step•  Dominantformofparallelprogramming

SISD vs. SIMD

MIMD Architectures (Shared Memory)

UniformMemoryAccess(UMA) Non-UniformMemoryAccess(NUMA)

More MIMD Architectures

DistributedMemory HybridMemory

Shared Memory (SM)

• AYributes:•  Globalmemoryspace•  EachprocessorwilluClizeit'sowncacheforaporConofglobalmemory;consistencyofthecacheismaintainedbyhardware

• Advantages:•  User-friendlyprogrammingtechniques(OpenMPandOpenACC)•  Lowlatencyfordatasharingbetweentasks

• Disadvantages:•  Globalmemoryspace-to-CPUpathmaybeaboZleneck•  Non-UniformMemoryAccess•  ProgrammerresponsibleforsynchronizaCon

Distributed Memory (DM)

• AYributes:•  Memoryissharedamongstprocessorsthroughmessagepassing

• Advantages:•  Memoryscalesbasedonthenumberofprocessors•  Accesstoaprocessor'sownmemoryisfast•  CosteffecCve

• Disadvantages:•  Errorprone;programmersareresponsibleforthedetailsofthecommunicaCon•  Complexdatastructuresmaybedifficulttodistribute

Hardware/SoWware Models

•  Socwareandhardwaremodelsdonotneedtomatch

• DMsocwareonSMhardware:•  MessagePassingInterface(MPI)-designedforDMHardwarebutavailableonSMsystems

•  SMsocwareonDMhardware•  RemoteMemoryAccess(RMA)includedwithinMPI-3•  ParCConedGlobalAddressSpace(PGAS)languages

•  UnifiedParallelC(extensiontoISOC99)•  CoarrayFortran(Fortran2008)

Difficul)es

•  SerializaConcausesboZlenecks• Workloadisnotdistributed• Debuggingishard•  Serialapproachdoesn'tparallelize

Parallel Performance

Defini)ons

•  SerialTime(n)-Timeforaserialprogramtosolveaproblemwithaninputofsizen•  ParallelTime(n,p)-Timeforaparallelprogramtosolvethesameproblemwithaninputsizeofnusingpprocessors

•  SerialTime(n)≤ParallelTime(n,1)

•  Speedup(n,p)=SerialTime(n)/ParallelTime(n,p)• Work(n,p)=p*ParallelTime(n,p)•  Efficiency(n,p)=SerialTime(n)/[p*ParallelTime(n,p)]

Defini)ons

•  ExpectaCons:•  0<Speedup≤p•  SerialWork≤ParallelWork<∞•  0<Efficiency≤1

•  LinearSpeedup:Speedup=p,givensomerestricConontherelaConshipbetweennandp

In The Real World

Superlinear Speedup

•  SuperlinearSpeedup:Speedup>p,thusEfficiency>1

• Why?•  ParallelcomputerhaspCmesasmuchRAMsoahigherfracConofprogrammemoryisinRAMinsteadofdisk.•  Parallelprogramusedadifferentalgorithmwhichwasnotpossiblewiththeserialprogram.

Amdahl's Law

LetfbethefracConofCmespentonserialoperaConsLetParallelOps=1–fandassumelinearspeedup•  Forpprocessors:•  ParallelTime(p)≥SerialTime(p)*[f+(ParallelOps/p)]•  Speedup(p)≤1/(f+ParallelOps/p)

•  Thus,Speedup(p)≤1/f,nomaZerthenumberofprocessorsused

Maximum Possible Performance

Pessimis)c Interpreta)on

Op)mis)c Interpreta)on

• WeshouldbemoreopCmisCcthanAmdahl'sLawbecause:•  Algorithmswithmuchsmallervaluesoff•  MoreCmespentinRAMthandisk•  TimespentinfisadecreasingfracConofthetotalCmeasproblemsizeincreases

Common Program Structure

Scaling Strategy

•  Increasenaspincreases•  Amdahlconsideredstrongscalingwherenisfixed

• UClizeweakscaling•  Theamountofdataperprocessorisfixed•  EfficiencycanremainhighifcommunicaCondoesnotincreaseexcessively

Embarrassingly Parallel Jobs

Parameteriza)on

• Requirements•  Independenttasks•  CommonapplicaCons•  Onejobsubmissionpertask

•  Techniques•  Parametersweeps(modifyparametersmathemaCcallyforeachtask)•  List-drivenanalysis(modifyparametersbasedonalistofvalues)

Example: List-Driven Analysis

• Problem:CalculatethetheoreCcalperformancefortheCRIHPCclusters(historically)• Rpeak=Nodes*CoresPerNodes*ClockSpeed*InstructPerCycle

List

Cluster Nodes CoresPerNode ClockSpeed InstructPerCycle

ibicluster 99 8 2.66e09 4

tarbell 40 64 2.20e09 8

gardner 126 28 2.00e09 16

Base

#PBS–NRpeak.@Cluster#PBS–lnodes=1:ppn=1#PBS–lmem=1gb#PBS–lwallCme=20:00./rpeak-n@Nodes–c@CoresPerNode\–s@ClockSpeed-i@InstructPerCycle

Shared Memory Parallelism

Shared Memory

• Allprocessorscandirectlyaccessallthememoryinthesystem(thoughaccessCmesmaybedifferentduetoNUMA)•  Socwareportability–ApplicaConscanbedevelopedonserialmachinesandrunonparallelmachineswithoutanychanges•  LatencyHiding–Abilitytomasktheaccesslatencyformemoryaccess,I/O,andcommunicaCons•  Scheduling–Supportssystem-leveldynamicmappingoftaskstoprocessors•  Easeofprogramming–Significantlyeasiertoprogram(inmyopinion)usingasharedmemorymodeloveradistributedmemorymodel

Paralleliza)on Techniques: pthreads

• POSIXthreads•  StandardthreadingimplementaConforUNIX-likeoperaCngsystems(Linux,MacOSX)•  LibrarythatcanbelinkedwithCprograms•  SomecompilersincludeaFortranversionofpthreads• KnowledgeofpthreadscaneasilybetransferredtoprogrammingotherwidelyusedthreadingspecificaCons(e.g.,Javathreads)

Matrix-Vector Mul)plica)on (pthreads)

#include<stdio.h>#include<stdlib.h>#include<pthread.h>/*Globalvariable:accessibletoallthreads*/intthread_count;void*Pth_mat_vect(void*rank);/*ThreadfuncCon*/intmain(intargc,char*argv[]){longthread;/*Uselongincaseofa64-bitsystem*/pthread_t*thread_handles;/*Getnumberofthreadsfromthecommandline*/thread_count=strtol(argv[1],NULL,10);thread_handles=malloc(thread_count*sizeof(pthread_t));


/*Createathreadforeachrank*/for(thread=0;thread<thread_count;thread++)pthread_create(&thread_handles[thread],NULL,Pth_mat_vect,(void*)thread);/*Jointheresultsfromeachthread*/for(thread=0;thread<thread_count;thread++)pthread_join(thread_handles[thread],NULL);free(thread_handles);return0;}/*main*/


void*Pth_mat_vect(void*rank){longmy_rank=(long)rank;inti,j;intlocal_m=m/thread_count;intmy_first_row=my_rank*local_m;intmy_last_row=(my_rank+1)*local_m–1;for(i=my_first_row;i<=my_last_row;i++){y[i]=0.0;for(j=0;j<n;j++)y[i]+=A[i][j]*x[j];}returnNULL;}/*Pth_mat_vect*/

Paralleliza)on Techniques: OpenMP

• OpenMPissortofanHPCstandardforsharedmemoryprogramming• OpenMPversion4.5releasedin2015andincludesacceleratorsupportasanadvancedfeature• APIforthread-basedparallelism•  Explicitprogrammingmodel,compilerinterpretsdirecCves• BasedonacombinaConofcompilerdirecCves,libraryrouCnes,andenvironmentvariables• Usesthefork-joinmodelofparallelexecuCon• AvailableinmostFortranandCcompilers

OpenMP Goals

•  StandardizaCon:standardamongallsharedmemoryarchitecturesandhardwarepla�orms•  Lean:simpleandlimitedsetofcompilerdirecCvesforsharedmemoryprogramming.Ocensignificantperformancegainsusingjust4-6direcCvesincomplexapplicaCons.•  Easeofuse:supportsincrementalparallelizaConofaserialprogram,notanall-or-nothingapproach.• Portability:supportsFortran,C,andC++

OpenMP Building Blocks

•  CompilerDirecCves(embeddedincode)•  Parallelregions(PARALLEL)•  Parallelloops(PARALLELDO)•  Parallelworkshare(PARALLELWORKSHARE)•  ParallelsecCons(PARALLELSECTIONS)•  Paralleltasks(PARALLELTASK)•  SerialsecCons(SINGLE)•  SynchronizaCon(BARRIER,CRITICAL,ATOMIC,…)•  Datastructures(PRIVATE,SHARED,REDUCTION)

•  Run-CmelibraryrouCnes(calledincode)•  OMP_SET_NUM_THREADS•  OMP_GET_NUM_THREADS

•  UNIXEnvironmentVariables(setbeforeprogramexecuCon)•  OMP_NUM_THREADS

Fork-Join Model

• ParallelexecuConisachievedbygeneraCngthreadswhichareexecutedinparallel• MasterthreadexecutesinserialunClthefirstparallelregionisencountered•  Fork–Themasterthreadcreatedateamofthreadswhichareexecutedinparallel•  Join–Whentheteammemberscompletethework,theysynchronizeandterminate.ThemasterthreadconCnuessequenCally.

Fork-Join Model

Work-Sharing Constructs

Barriers

• Barriersmaybeneededforcorrectness•  SynchronizaCondegradesperformance

OpenMP Nota)on

• OpenMPrecognizesthefollowingcompilerdirecCvesthatstartwith:•  !$OMP(inFortran)•  #pragmaomp(inC/C++)

Parallel Loops

•  Eachthreadexecutespartoftheloop•  ThenumberofiteraConsisstaCcallyassignedtothethreadsuponentrytotheloop• NumberofiteraConscannotbechangedduringexecuCons•  ImplicitBARRIERattheendoftheloop• Highefficiency

Parallel Loop Example (Fortran)

!$OMPPARALLEL!$OMPDODOi=1,na(i)=b(i)+c(i)ENDDO!$OMPENDDO!$OMPENDPARALLEL

Parallel Sec)ons

• MulCpleindependentsecConscanbeexecutedconcurrentlybyseparatethreads•  EachparallelsecConisassignedtoaspecificthreadwhichexecutestheworkfromstarttofinish•  ImplictBARRIERattheendofeachSECTION• NestedparallelsecConsarepossible,butimpracCcal

Parallel Sec)ons Example (Fortran)

!$OMPPARALLELSECTIONS!$OMPSECTIONDOi=1,na(i)=b(i)+c(i)ENDDO!$OMPSECTIONDOi=1,kd(i)=e(i)+f(i)ENDDO!$OMPENDPARALLELSECTIONS

Parallel Tasks and Workshares

• ParallelTasks•  Unstructuredparallelism•  Irregularproblemslike:

•  Recursivealgorithms•  Unboundedloops

• ParallelWorkshares•  Fortranonly•  Usedtoparallelizeassignments

Matrix-Vector Mul)plica)on: OpenMP C

#include<stdio.h>#include<stdlib.h>voidOmp_mat_vect(intm,intn,double*restricta,double*restrictb,double*restrictc){inti,j;#pragmaompparallelfordefault(none)\shared(m,n,a,b,c)private(i,j)for(i=0;i<m;i++){a[i]=0.0;for(j=0;j<n;j++)a[i]+=b[i*n+j]*c[j];}/*Endofompparallelfor*/

Matrix-Vector Mul)plica)on: OpenMP Fortran

subrouCnePth_mat_vect(m,n,a,b,c)implicitnoneinteger(kind=4)::m,nreal(kind=8)::a(1:m),b(1:m,1:n),c(1:n)integer::i,j!$OMPPARALLELDODEFAULT(NONE)&!$OMPSHARED(m,n,a,b,c)PRIVATE(i,j)doi=1,ma(i)=0.0doj=1,na(i)=a(i)+b(i,j)*c(j)enddoenddo!$OMPENDPARALLELDOreturnendsubrouCnePth_mat_vect

OpenMP Errors

• VariableScoping:Whicharesharedandwhichareprivate•  SequenCalI/Oinaparallelregioncanleadtoanunpredictableorder•  Falsesharing:twoormoreprocessorsaccessdifferentvariablesinthesamecacheline(Performance)• RaceCondiCons• Deadlock

Summa)on Example

Whatiswiththefollowingcodewrong?sum=0.0!$OMPPARALLELDODOi=1,nsum=sum+a(i)ENDDO!$OMPENDPARALLELDO

Summa)on Example Correc)on

sum=0.0!$OMPPARALLELDOREDUCTION(+,sum)DOi=1,nsum=sum+a(i)ENDDO!$OMPENDPARALLELDO

Distributed Memory Parallelism

Message Passing

• CommunicaConondistributedmemorysystemsareasignificantaspectofperformanceandcorrectness• MessagesarerelaCvelyslow,withstartupCmes(latency)takingthousandsofcycles• Oncemessagepassinghasstarted,theaddiConalCmeperbyte(bandwidth)isrelaCvelysmall

Performance on Gardner

•  IntelXeonE5-2683Processor(Haswell)• Processorspeed:2,000cyclespermicrosecond(µsec)•  16FLOPs/cycle:32,000FLOPsperµsec• MPImessagelatency=~2.5µsecor80,000FLOPs• MPImessagebandwidth=~7,000bytes/µsec=4.57FLOPs/byte

Reducing Latency

• ReducethenumberofmessagesbymappingcommunicaCngenCCesontothesameprocessor• CombinemessageshavingthesamesenderanddesCnaCon•  IfprocessorAhasdataneededbyprocessorB,haveAsendittoB,ratherthanwaiCngforBtorequestit.ProcessorAshouldsendassoonasthedataisready,processorBshouldreaditaslateaspossibletoincreasetheprobabilitythatthedatahasarrived

Other Issues With Message Passing

• NetworkCongesCon• Deadlock•  Blocking:aprocessorcannotproceedunClthemessageisfinished• WithblockingcommunicaCon,youmayreachapointwherenoprocessorcanproceed•  Non-blockingcommunicaCon:easiestwaytopreventdeadlock;processorscansendandproceedbeforereceiveisfinished

Message Passing Interface (MPI)

•  InternaConalStandard• MPI1.0releasedin1994• Currentversionis3.1(June2015)• MPI4.0standardintheworks• Availableonallparallelsystems•  InterfacesinC/C++andFortran• BindingsforMATLAB,Python,R,andJava• Worksonbothdistributedmemoryandsharedmemoryhardware• HundredsoffuncCons,butyouwillneed~6-10togetstarted

MPI Basics

•  TwocommonMPIfuncCons:•  MPI_SEND()-tosendamessage•  MPI_RECV()-toreceiveamessage

•  FuncConlikewriteandreadstatements• BothareblockingoperaCons• However,asystembufferisusedthatallowssmallmessagestobenon-blocking,butlargemessageswillbeblocking•  ThesystembufferisbasedontheMPIimplementaCon,notthestandard• BlockingcommunicaConmayleadtodeadlocks

Small Messages

Large Messages

Deadlocks

Source Avoidance

Non-Blocking Communica)on

• Non-BlockingoperaCons•  MPI_ISEND()•  MPI_IRECV()•  MPI_WAIT()

•  TheusercancheckfordataatalaterstageintheprogramwithoutwaiCng•  MPI_TEST()

• Non-blockingoperaConswillperformbeZerthanblockingoperaCons• PossibletooverlapcommunicaConwithcomputaCon

MPI Program Template

#include"mpi.h"/*IniCalizeMPI*/MPI_Init(&argc,&argv);/*Alloweachprocessortodetermineitsrole*/intnum_processor,my_rank;MPI_Comm_size(MPI_COMM_WORLD,&num_processor);MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);/*Sendandreceivesomestuffhere*//*Finalize*/MPI_Finalize()

MPI Summa)on

MPI Summa)on

/*Sendandreceivesomestuffhere*/if(my_rank==0){sum=0.0;for(source=1;source<num_procs;source++){MPI_RECV(&value,1,MPI_FLOAT,source,tag,MPI_COMM_WORLD,&status);sum+=value;}}else{MPI_SEND(&value,1,MPI_FLOAT,0,tag,MPI_COMM_WORLD);}

A More Efficient Receive

•  Inthepreviousexample,rank0receivedthemessagesinorderbasedonranks.•  Thus,ifprocessor2wasdelayed,thenprocessor0wasalsodelayed• MakethefollowingmodificaCon:MPI_RECV(&value,1,MPI_FLOAT,MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&status);• Nowprocessor0canprocessmessagesassoonastheyarrive

MPI Reduc)on

•  LikeOpenMP,MPIalsohasreducConoperaConsthatcanbeusedfortaskssuchassummaCon• ReducConisatypeofcollecCvecommunicaCon

MPI_REDUCE(&value,&sum,1,MPI_FLOAT,MPI_SUM,0,MPI_COMM_WORLD);• ReducConoperaCons:•  MPI_SUM,MPI_MIN,MPI_MAX,MPI_PROD•  MPI_LAND,MPI_LOR

Collec)ve Communica)on

• MostfrequentlyinvokedMPIrouCnesacersendandreceive•  Improveclarity,runfasterthansendandreceive,andreducethelikelihoodofmakinganerror•  Examples•  Broadcast(MPI_Bcast)•  ScaZer(MPI_ScaZer)•  Gather(MPI_Gather)•  AllGather(MPI_Allgather)

Broadcast

Sca`er

Gather

All Gather

MPI Data Types

• PrimiCvedatatypes(correspondtothosefoundintheunderlyinglanguage)•  Fortran

•  MPI_CHARACTER•  MPI_INTEGER•  MPI_REAL,MPI_DOUBLE_PRECISION•  MPI_LOGICAL

•  C/C++(includemanymorevariantsthanFortran)•  MPI_CHAR•  MPI_INT,MPI_UNSIGNED,MPI_LONG•  MPI_FLOAT,MPI_DOUBLE•  MPI_C_BOOL

MPI Data Types

• Deriveddatatypesarealsopossible•  Vector–dataseparatedbyaconstantstride•  ConCguous–dataseparatedbyastrideof1•  Struct–mixedtypes•  Indexed–arrayofindices

MPI Vector

MPI Con)guous

MPI Struct

MPI Indexed

MPI Synchroniza)on

•  ImplicitsynchronizaCon•  BlockingcommunicaCon•  CollecCvecommunicaCon

•  ExplicitsynchronizaCon•  MPI_Wait•  MPI_Waitany•  MPI_Barrier

• Remember,synchronizaConcanhinderperformance

Addi)onal MPI Features

• Virtualtopologies• User-createdcommunicators• User-specifiedderiveddatatypes• ParallelI/O• One-sidedcommunicaCon(MPI_Put,MPI_Get)• Non-blockingcollecCvecommunicaCon• RemoteMemoryAccess

Vectoriza)on

Accelerators

• Principle:SplitanoperaConintoindependentpartsandexecutethepartsconcurrentlyinanaccelerator(SIMD)

DOI=1,1000A(I)=B(I)+C(I)ENDDO• AcceleratorslikeGraphicsProcessingUnitsarespecializedforSIMDoperaCons

Popularity

•  In2012,13%oftheperformanceshareoftheTop500Supercomputerswasprovidedthroughaccelerators•  Today,theperformanceshareforacceleratorsis~38%•  Trend:BuilddiversecompuCngpla�ormswithmulCprocessors,SMPnodes,andaccelerators• Caveat:DifficulttouseheterogenoushardwareeffecCvely

Modern Accelerators

NVidiaVolta•  7.5TFLOPsdoubleprecision•  30TFLOPshalfprecision•  5120cores•  16GBRAM•  900GB/smemorybandwidth

IntelXeonPhiKnight'sLanding•  3TFLOPsdoubleprecision•  72cores(288threads)•  16GBRAM•  384GB/smemorybandwidth•  Includesbothascalarandavectorunit

NVidia Architecture

•  SIMT:SingleInstrucCon,MulCpleThread(SimilartoSIMD)•  Threadblocksareexecutedbyawarpof32cores• PerformanceisdependentonmemoryalignmentandeliminaCngsharedmemoryaccess

Programming GPUs

•  Low-LevelProgramming•  Proprietary–NVidia'sCUDA•  Portable–OpenCL

• High-LevelProgramming•  OpenACC•  OpenMP4.5•  AllowsforasinglecodebasethatmaybecompiledonbothmulCcoreandGPUs

CUDA Matrix Mul)plica)on intmain(void){int*a,*b,*c;//CPUcopiesint*dev_a,*dev_b,*dev_c;//GPUcopies/*AllocatememoryonGPU*/arraysize=n*sizeof(int);cudaMalloc((void**)\&dev_a,arraysize);cudaMalloc((void**)\&dev_b,arraysize);cudaMalloc((void**)\&dev_c,sizeof(in));usemallocfora,b,ciniCalizea,b/*MoveaandbtotheGPU*/cudaMemcpy(dev_a,a,arraysize,cudaMemcpyHostToDevice);cudaMemcpy(dev_b,b,arraysize,cudaMemcpyHosZoDevice);/*LaunchGPUkernel*/dot<<<n/threads_per_block,threads_per_block>>>(dev_a,dev_b,dev_c);/*CopyresultsbacktoCPUcudaMemcpy(c,dev_c,sizeof(int),cudaMemcpyDeviceToHost)return0;}

CUDA Matrix Mul)plica)on

_global_voiddot(int*a,int*b,int*c){_shared_inttemp[threads_per_block];intindex=threadIdx.x+blockIdx.x+blockDim.x;/*Convertwarpindextoglobal*/temp[threadIdx.x]=a[index]*b[index];/*AvoidracecondiCononsumwithinblock*/_syncthreads();if(threadIdx.x==0){intsum=0;for(inti=0;i<threads_per_block;i++)sum+=temp[i];/*addtoglobalsum*/atomicAdd(c,sum)}}

OpenACC and OpenMP

•  SharethesamemodelofcomputaCon• Host-centric–hostdeviceoffloadscoderegionsanddatatoaccelerators• Device–hasindependentmemoryandmulCplethreads• Mappingclause–DefinestherelaConshipbetweenmemoryonthehostdeviceandmemoryontheaccelerator

OpenACC Matrix Mul)plica)on

intmain(intargc,char**argv){intn=atoi(argv[1]);float*a=(float*)malloc(sizeof(float)*n*n);float*b=(float*)malloc(sizeof(float)*n*n);float*c=(float*)malloc(sizeof(float)*n*n);#pragmaaccdatacopyin(a[0:n*n],b[0:n*n]),copy(c[0:n*n]){inti,j,k;#pragmaacckernelsfor(i=0;i<n;i++){for(j=0;j<n;j++){a[i*n+j]=(float)i+j;b[i*n+j]=(float)i–j;c[i*n+j]=0.0;}}

OpenACC Matrix Mul)plica)on

#pragmaacckernelsfor(i=0;i<n;i++){for(j=0;j<n;j++){for(k=0;k<n;k++){c[i*n+j]+=a[i*n+k]*b[k*n+j];}}}}}

Speedup?

•  Someproblemscanachieveaspeedupof10x-100xbyoffloadingsomeoftheworktoaGPU(e.g.linpack,matrixmulCplicaCon)•  Inreality:•  OnlyasmallfracConofapplicaConscanuClizeaGPUeffecCvely•  TheaveragespeedupofthoseapplicaConsis~2x-4x•  OpCmizingyourcodeusingmulCcoretechnologiesisprobablyabeZeruseofyourCme

Problem Areas

• Branchingcancutyourperformanceby50%sinceinstrucConsneedtobeissuedforbothbranches• MemorybandwidthonGPUsispoor•  Needtomakesureyouroperandsareusedmorethanonce•  FortheNVidiaVolta,alloperandsneedtobeused67xtoachievepeakFLOPs

Accelerator Trends

•  IntelhascanceledKnight'sLandingandKnight'sHill;futureXeonPhistatusisunknown•  GPUsarebecominglessrigidlySIMDandincludingbeZermemorybandwidth• Moreprogrammersmovingfromlow-levelprogramminglikeCUDAtohigh-leveldirecCves•  PGICompilerwaspurchasedbyNVidiain2013;cutsupportforOpenCL•  OpenACCandOpenMPsCllhavenotmerged•  EfficiencyofGPUsissCllproblemaCcandtheprogrammingmodelisamovingtarget•  NewMoCvaCon:DeepLearning

Hybrid Compu)ng

Current Trend

• Accelerator-basedmachinesaregrabbingthehighrankingsontheTop500,butclusterswithstandardcoresaresCllmostimportanteconomically•  EconomicsdictatesdistributedmemorysystemofsharedmemoryboardswithmulCcorecommoditychips,perhapswithsomeformofacceleratordistributedsparingly

Conclusion

Topics for Possible Future Workshops

•  ScienCficCompuCng(C++orFortranorBoth)•  ParallelCompuCngTheory(e.g.,Loadbalancing,measuringperformance)•  EmbarrassinglyParallel•  pthreads•  OpenMP• MPI•  OpenACC•  CUDA•  ParallelDebugging/Profiling

introducon to parallel compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/intro... · 2. read...

Documents