introducon to parallel compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/intro... · 2. read...

127
Introduc)on to Parallel Compu)ng September 2018 Mike Jarsulic

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Introduc)on to Parallel Compu)ng

September2018

MikeJarsulic

Page 2: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Introduc)on

Page 3: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

So Far...

Year Users Jobs CPUHours SupportTickets

2012 36 * 1.2M *

2013 71 * 1.9M 141

2014 84 739K 5.3M 391

2015 125 1.5M 5.2M 372

2016 153 1.6M 6.1M 363

2017 256 4.8M 14.7M 759

2018 296 3.0M 8.8M 631

TOTAL * 11.7M 40.2M 2657

Page 4: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

However...

•  TheBSDresearchcommunitydoesnotunderstandhowtouClizeresourcesproperly•  Example:

•  August1st,2018•  1021coresinuse•  39jobshadrequestedatotalof358cores•  All39jobsweresinglecorejobs•  Thus,319coreswerededicated,butidle

• Mytakeaways:•  TheBSDresearchcommunitydoesnotreallyunderstandparallelcompuCng!•  IhavenotdoneagoodoftrainingtheBSDresearchcommunityinthisarea!

Page 5: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Office Hours

• HeldeveryTuesdayandThursdayatPeckPavilionN161• Userscomeintoworkone-on-onetobeZeruClizeourHPCresources•  ThreecommonquesConsonparallelcompuCng:

1.  Isthisaworthwhileskilltolearn?Aren'tsingleprocessorsystemsfastenoughandalwaysimproving?

2.  Whycan'tprocessormanufacturersjustmakefasterprocessors?3.  WillyouwritemeaprogramthatwillautomaCcallyconvertmyserial

applicaConintoaparallelapplicaCon?

Page 6: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Obligatory Cost per Genome Slide

Page 7: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Mo)va)on

Page 8: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Goals (of this presenta)on)

• UnderstandthebasicterminologyfoundinparallelcompuCng• Recognizethedifferenttypesofparallelarchitectures•  Learnhowparallelperformanceismeasured• Gainahigh-levelunderstandingtothedifferentparadigmsinparallelprogrammingandwhatlibrariesareavailabletoimplementthoseparadigms

• PleasenotethatyouwillnotbeaparallelprogrammingrockstaracerthisthreehourintroducCon

Page 9: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Overall Goals

• Hopefully,therewillbeenoughinterestacerthispresentaContosupportaworkshopseriesonparallelcompuCng• Possibletopicsforworkshopsare:•  ScienCficProgramming•  ParameterizaCon•  OpenMP•  MPI•  CUDA•  OpenACC

Page 10: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Today's Outline

•  Terminology• ParallelArchitectures• MeasuringParallelPerformance•  EmbarrassinglyParallelWorkflows•  SharedMemory• DistributedMemory• VectorizaCon• HybridCompuCng

Page 11: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

A Note on Sources

•  PresentaCongivenatSC17byStoutandJablonowskifromtheUniversityofMichigan•  IntroducContoParallelCompuCng2ndEdiConbyGrama,Gupta,etal.•  IntroducContoParallelProgrammingbyPacheco•  UsingOpenMPbyChapman,Jost,andVanDerPas•  UsingMPIbyGropp,Lusk,Skjellum•  AdvancedMPIbyGropp,Hoefler,etal.•  CUDAforEngineersbyStorCandYurtoglu•  CUDAbyExamplebySandersandKandrot•  ParallelProgrammingwithOpenACCbyFarber•  TheLLNLOpenMPandMPItutorials

Page 12: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

A Note on Format

•  Today'sSeminarwillberecordedforaITM/CTSAgrant•  QuesCons• Whiteboard•  Swearing

Page 13: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Terminology

Page 14: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Caveats

•  TherearenostandardizeddefiniConsfortermsinparallelcompuCng,thusresourcesoutsideofthispresentaConmayusethetermsinadifferentway• ManyalgorithmsorsocwareapplicaConsthatyouwillfindintherealworlddonotfallneatlyintothecategoriesmenConedinthispresentaConsincetheyblendapproachesinwaysthatdifficulttocategorize.

Page 15: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Basic Terminology

•  ParallelCompuFng–solvingaprobleminwhichmulCpletaskscooperatecloselyandmakesimultaneoususeofmulCpleprocessors•  EmbarrassinglyParallel–solvingmanysimilar,independenttasks;parameterizaCon• MulFprocessor–mulCpleprocessors(cores)onasinglechip•  SymmetricMulFprocessing(SMP)-mulCpleprocessorssharingasingleaddressspace,OSinstance,storage,etc.AllprocessorsaretreatedequallybytheOS•  UMA–UniformMemoryAccess;allprocessorssharethememoryspaceuniformly•  NUMA–Non-UniformMemoryAccess;memoryaccessCmedependsonthelocaConofthememoryrelaCvetotheprocessor

Page 16: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

More Basic Terminology

• ClusterCompuFng–AparallelsystemconsisCngofacombinaConofcommodityunits(processorsorSMPs)• CapacityCluster–Aclusterdesignedtorunavarietytypesofproblemsregardlessofthesize;thegoalistoperformasmuchresearchaspossible• CapabilityCluster–Thefastest,largestmachinesdesignedtosolvelargeproblems;thegoalistodemonstratecapabilityoftheparallelsystem• HighPerformanceCompuFng–Solvinglargeproblemsviaaclusterofcomputersthatconsistoffastnetworksandmassivestorage

Page 17: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Even More Basic Terminology

•  InstrucFon-LevelParallelism(ILP)-ImprovesprocessorperformancebyhavingmulCpleprocessorcomponentssimultaneouslyexecuCnginstrucCons• Pipelining–Breakingataskintostepsperformedbydifferentprocessorswithinputsstreamingthrough,muchlikeanassemblyline•  SuperscalarExecuFon-TheabilityforaprocessortoexecutemorethanoneinstrucConduringaclockcyclebysimultaneouslydispatchingmulCpleinstrucConstodifferentexecuConunitsontheprocessor• CacheCoherence-SynchronizaConbetweenlocalcachesforeachprocessor

Page 18: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Crash Simula)on Example

•  SimplifiedmodelbasedonacrashsimulaConfortheFordMotorCompany•  IllustratesvariousaspectscommontomanysimulaConsandapplicaCons•  ThisexamplewasprovidedbyQ.StoutandC.JablonowskioftheUniversityofMichigan

Page 19: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Finite Element Representa)on

• Carismodeledbyatriangulatedsurface(elements)•  ThesimulaConmodelsthemovementoftheelements,incorporaCngtheforcesontheelementstodeterminetheirnewposiCon•  IneachCmestep,themovementofeachelementdependsonitsinteracConwiththeotherelementstowhichitisphysicallyadjacent•  Inacrash,elementsmayenduptouchingthatwerenottouchinginiCally•  ThestateofanelementisitslocaCon,velocity,andinformaConsuchaswhetheritismetalthatisbending

Page 20: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Car

Page 21: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Serial Crash Simula)on

1.  Forallelements2.  ReadState(element),ProperCes(element),Neighbor_list(element)3.  ForCme=1toend_of_simulaCon4.  Forelement=1tonum_elements5.  ComputeState(element)fornextCmestep,basedonthe

previousstateoftheelementanditsneighborsandtheproperCesoftheelement

Page 22: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Simple Approach to Paralleliza)on (Concepts)

• DistributedMemory–Parallelsystembasedonprocessorslinkedwithafastnetwork;processorscommunicateviamessages• OwnerComputes–Distributeelementstoprocessors;eachprocessorupdatestheelementswhichitcontains•  SingleProgramMulFpleData(SPMD)-Allmachinesrunthesameprogramonindependentdata;dominantformofparallelcompuCng

Page 23: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Distributed Car

Page 24: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Basic Parallel Version

ConcurrentlyforallprocessorsP1.  ForallelementsassignedtoP2.  ReadState(element),ProperCes(element),Neighbor-

list(element)3.  ForCme=1toend_of_simulaCon4.  Forelement=1tonum_elements_in_P5.  ComputeState(element)fornextCmestep,basedon

previousstateofelementanditsneighbors,andonproperCesoftheelement

Page 25: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Notes

• Mostofthecodeisthesameas,orsimilarto,serialcode.• High-levelstructureremainsthesame:asequenceofsteps•  Thesequenceisaserialconstruct,but•  Nowthestepsareperformedinparallel,but•  CalculaConsforindividualelementsareserial

• QuesCon:•  Howdoeseachprocessorkeeptrackofadjacencyinfoforneighborsinotherprocessors?

Page 26: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Distributed Car (ghost cells)

Page 27: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Parallel Architectures

Page 28: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Flynn's Taxonomies

•  HardwareClassificaCons•  {S,M}I{S,M}D•  SingleInstrucFon(SI)-SysteminwhichallprocessorsexecutethesameinstrucCon• MulFpleInstrucFon(MI)-SysteminwhichdifferentprocessorsmayexecutedifferentinstrucCons•  SingleData(SD)-Systeminwhichallprocessorsoperateonthesamedata• MulFpleData(MD)-Systeminwhichdifferentprocessorsmayoperateondifferentdata• M.J.Flynn.SomecomputerorganizaConsandtheireffecCveness.IEEETransacConsonComputers,C-21(9):948–960,1972.

Page 29: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Flynn's Taxonomies

•  SISD–ClassicvonNeumannarchitecture;serialcomputer• MIMD–CollecConsofautonomousprocessorsthatcanexecutemulCpleindependentprograms;eachofwhichcanhaveitsowndatastream•  SIMD–DataisdividedamongtheprocessorsandeachdataitemissubjectedtothesamesequenceofinstrucCons;GPUs,AdvancedVectorExtensions(AVX)• MISD–Veryrare;systolicarrays;smartphonescarriedbyChupacabras

Page 30: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

CRAY-1 Vector Machine (1976)

Page 31: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Vector Machines Today

Page 32: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

SoWware Taxonomies

• DataParallel(SIMD)•  ParallelismthatisaresultofidenCcaloperaConsbeingappliedconcurrentlyondifferentdataitems;e.g.,manymatrixalgorithms•  Difficulttoapplytocomplexproblems

•  SingleProgram,MulFpleData(SPMD)•  AsingleapplicaConisrunacrossmulCpleprocesses/threadsonaMIMDarchitecture•  Mostprocessesexecutethesamecodebutdonotworkinlock-step•  Dominantformofparallelprogramming

Page 33: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

SISD vs. SIMD

Page 34: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MIMD Architectures (Shared Memory)

UniformMemoryAccess(UMA) Non-UniformMemoryAccess(NUMA)

Page 35: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

More MIMD Architectures

DistributedMemory HybridMemory

Page 36: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Shared Memory (SM)

• AYributes:•  Globalmemoryspace•  EachprocessorwilluClizeit'sowncacheforaporConofglobalmemory;consistencyofthecacheismaintainedbyhardware

• Advantages:•  User-friendlyprogrammingtechniques(OpenMPandOpenACC)•  Lowlatencyfordatasharingbetweentasks

• Disadvantages:•  Globalmemoryspace-to-CPUpathmaybeaboZleneck•  Non-UniformMemoryAccess•  ProgrammerresponsibleforsynchronizaCon

Page 37: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Distributed Memory (DM)

• AYributes:•  Memoryissharedamongstprocessorsthroughmessagepassing

• Advantages:•  Memoryscalesbasedonthenumberofprocessors•  Accesstoaprocessor'sownmemoryisfast•  CosteffecCve

• Disadvantages:•  Errorprone;programmersareresponsibleforthedetailsofthecommunicaCon•  Complexdatastructuresmaybedifficulttodistribute

Page 38: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Hardware/SoWware Models

•  Socwareandhardwaremodelsdonotneedtomatch

• DMsocwareonSMhardware:•  MessagePassingInterface(MPI)-designedforDMHardwarebutavailableonSMsystems

•  SMsocwareonDMhardware•  RemoteMemoryAccess(RMA)includedwithinMPI-3•  ParCConedGlobalAddressSpace(PGAS)languages

•  UnifiedParallelC(extensiontoISOC99)•  CoarrayFortran(Fortran2008)

Page 39: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Difficul)es

•  SerializaConcausesboZlenecks• Workloadisnotdistributed• Debuggingishard•  Serialapproachdoesn'tparallelize

Page 40: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Parallel Performance

Page 41: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Defini)ons

•  SerialTime(n)-Timeforaserialprogramtosolveaproblemwithaninputofsizen•  ParallelTime(n,p)-Timeforaparallelprogramtosolvethesameproblemwithaninputsizeofnusingpprocessors

•  SerialTime(n)≤ParallelTime(n,1)

•  Speedup(n,p)=SerialTime(n)/ParallelTime(n,p)• Work(n,p)=p*ParallelTime(n,p)•  Efficiency(n,p)=SerialTime(n)/[p*ParallelTime(n,p)]

Page 42: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Defini)ons

•  ExpectaCons:•  0<Speedup≤p•  SerialWork≤ParallelWork<∞•  0<Efficiency≤1

•  LinearSpeedup:Speedup=p,givensomerestricConontherelaConshipbetweennandp

Page 43: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

In The Real World

Page 44: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Superlinear Speedup

•  SuperlinearSpeedup:Speedup>p,thusEfficiency>1

• Why?•  ParallelcomputerhaspCmesasmuchRAMsoahigherfracConofprogrammemoryisinRAMinsteadofdisk.•  Parallelprogramusedadifferentalgorithmwhichwasnotpossiblewiththeserialprogram.

Page 45: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Amdahl's Law

LetfbethefracConofCmespentonserialoperaConsLetParallelOps=1–fandassumelinearspeedup•  Forpprocessors:•  ParallelTime(p)≥SerialTime(p)*[f+(ParallelOps/p)]•  Speedup(p)≤1/(f+ParallelOps/p)

•  Thus,Speedup(p)≤1/f,nomaZerthenumberofprocessorsused

Page 46: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Maximum Possible Performance

Page 47: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Pessimis)c Interpreta)on

Page 48: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Op)mis)c Interpreta)on

• WeshouldbemoreopCmisCcthanAmdahl'sLawbecause:•  Algorithmswithmuchsmallervaluesoff•  MoreCmespentinRAMthandisk•  TimespentinfisadecreasingfracConofthetotalCmeasproblemsizeincreases

Page 49: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Common Program Structure

Page 50: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Scaling Strategy

•  Increasenaspincreases•  Amdahlconsideredstrongscalingwherenisfixed

• UClizeweakscaling•  Theamountofdataperprocessorisfixed•  EfficiencycanremainhighifcommunicaCondoesnotincreaseexcessively

Page 51: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Embarrassingly Parallel Jobs

Page 52: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Parameteriza)on

• Requirements•  Independenttasks•  CommonapplicaCons•  Onejobsubmissionpertask

•  Techniques•  Parametersweeps(modifyparametersmathemaCcallyforeachtask)•  List-drivenanalysis(modifyparametersbasedonalistofvalues)

Page 53: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Example: List-Driven Analysis

• Problem:CalculatethetheoreCcalperformancefortheCRIHPCclusters(historically)• Rpeak=Nodes*CoresPerNodes*ClockSpeed*InstructPerCycle

Page 54: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

List

Cluster Nodes CoresPerNode ClockSpeed InstructPerCycle

ibicluster 99 8 2.66e09 4

tarbell 40 64 2.20e09 8

gardner 126 28 2.00e09 16

Page 55: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Base

#PBS–NRpeak.@Cluster#PBS–lnodes=1:ppn=1#PBS–lmem=1gb#PBS–lwallCme=20:00./rpeak-n@Nodes–c@CoresPerNode\–s@ClockSpeed-i@InstructPerCycle

Page 56: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Shared Memory Parallelism

Page 57: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Shared Memory

• Allprocessorscandirectlyaccessallthememoryinthesystem(thoughaccessCmesmaybedifferentduetoNUMA)•  Socwareportability–ApplicaConscanbedevelopedonserialmachinesandrunonparallelmachineswithoutanychanges•  LatencyHiding–Abilitytomasktheaccesslatencyformemoryaccess,I/O,andcommunicaCons•  Scheduling–Supportssystem-leveldynamicmappingoftaskstoprocessors•  Easeofprogramming–Significantlyeasiertoprogram(inmyopinion)usingasharedmemorymodeloveradistributedmemorymodel

Page 58: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Paralleliza)on Techniques: pthreads

• POSIXthreads•  StandardthreadingimplementaConforUNIX-likeoperaCngsystems(Linux,MacOSX)•  LibrarythatcanbelinkedwithCprograms•  SomecompilersincludeaFortranversionofpthreads• KnowledgeofpthreadscaneasilybetransferredtoprogrammingotherwidelyusedthreadingspecificaCons(e.g.,Javathreads)

Page 59: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Matrix-Vector Mul)plica)on (pthreads)

#include<stdio.h>#include<stdlib.h>#include<pthread.h>/*Globalvariable:accessibletoallthreads*/intthread_count;void*Pth_mat_vect(void*rank);/*ThreadfuncCon*/intmain(intargc,char*argv[]){longthread;/*Uselongincaseofa64-bitsystem*/pthread_t*thread_handles;/*Getnumberofthreadsfromthecommandline*/thread_count=strtol(argv[1],NULL,10);thread_handles=malloc(thread_count*sizeof(pthread_t));

Page 60: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Matrix-Vector Mul)plica)on (pthreads)

/*Createathreadforeachrank*/for(thread=0;thread<thread_count;thread++)pthread_create(&thread_handles[thread],NULL,Pth_mat_vect,(void*)thread);/*Jointheresultsfromeachthread*/for(thread=0;thread<thread_count;thread++)pthread_join(thread_handles[thread],NULL);free(thread_handles);return0;}/*main*/

Page 61: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Matrix-Vector Mul)plica)on (pthreads)

void*Pth_mat_vect(void*rank){longmy_rank=(long)rank;inti,j;intlocal_m=m/thread_count;intmy_first_row=my_rank*local_m;intmy_last_row=(my_rank+1)*local_m–1;for(i=my_first_row;i<=my_last_row;i++){y[i]=0.0;for(j=0;j<n;j++)y[i]+=A[i][j]*x[j];}returnNULL;}/*Pth_mat_vect*/

Page 62: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Paralleliza)on Techniques: OpenMP

• OpenMPissortofanHPCstandardforsharedmemoryprogramming• OpenMPversion4.5releasedin2015andincludesacceleratorsupportasanadvancedfeature• APIforthread-basedparallelism•  Explicitprogrammingmodel,compilerinterpretsdirecCves• BasedonacombinaConofcompilerdirecCves,libraryrouCnes,andenvironmentvariables• Usesthefork-joinmodelofparallelexecuCon• AvailableinmostFortranandCcompilers

Page 63: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

OpenMP Goals

•  StandardizaCon:standardamongallsharedmemoryarchitecturesandhardwarepla�orms•  Lean:simpleandlimitedsetofcompilerdirecCvesforsharedmemoryprogramming.Ocensignificantperformancegainsusingjust4-6direcCvesincomplexapplicaCons.•  Easeofuse:supportsincrementalparallelizaConofaserialprogram,notanall-or-nothingapproach.• Portability:supportsFortran,C,andC++

Page 64: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

OpenMP Building Blocks

•  CompilerDirecCves(embeddedincode)•  Parallelregions(PARALLEL)•  Parallelloops(PARALLELDO)•  Parallelworkshare(PARALLELWORKSHARE)•  ParallelsecCons(PARALLELSECTIONS)•  Paralleltasks(PARALLELTASK)•  SerialsecCons(SINGLE)•  SynchronizaCon(BARRIER,CRITICAL,ATOMIC,…)•  Datastructures(PRIVATE,SHARED,REDUCTION)

•  Run-CmelibraryrouCnes(calledincode)•  OMP_SET_NUM_THREADS•  OMP_GET_NUM_THREADS

•  UNIXEnvironmentVariables(setbeforeprogramexecuCon)•  OMP_NUM_THREADS

Page 65: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Fork-Join Model

• ParallelexecuConisachievedbygeneraCngthreadswhichareexecutedinparallel• MasterthreadexecutesinserialunClthefirstparallelregionisencountered•  Fork–Themasterthreadcreatedateamofthreadswhichareexecutedinparallel•  Join–Whentheteammemberscompletethework,theysynchronizeandterminate.ThemasterthreadconCnuessequenCally.

Page 66: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Fork-Join Model

Page 67: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Work-Sharing Constructs

Page 68: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Barriers

• Barriersmaybeneededforcorrectness•  SynchronizaCondegradesperformance

Page 69: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

OpenMP Nota)on

• OpenMPrecognizesthefollowingcompilerdirecCvesthatstartwith:•  !$OMP(inFortran)•  #pragmaomp(inC/C++)

Page 70: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Parallel Loops

•  Eachthreadexecutespartoftheloop•  ThenumberofiteraConsisstaCcallyassignedtothethreadsuponentrytotheloop• NumberofiteraConscannotbechangedduringexecuCons•  ImplicitBARRIERattheendoftheloop• Highefficiency

Page 71: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Parallel Loop Example (Fortran)

!$OMPPARALLEL!$OMPDODOi=1,na(i)=b(i)+c(i)ENDDO!$OMPENDDO!$OMPENDPARALLEL

Page 72: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Parallel Sec)ons

• MulCpleindependentsecConscanbeexecutedconcurrentlybyseparatethreads•  EachparallelsecConisassignedtoaspecificthreadwhichexecutestheworkfromstarttofinish•  ImplictBARRIERattheendofeachSECTION• NestedparallelsecConsarepossible,butimpracCcal

Page 73: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Parallel Sec)ons Example (Fortran)

!$OMPPARALLELSECTIONS!$OMPSECTIONDOi=1,na(i)=b(i)+c(i)ENDDO!$OMPSECTIONDOi=1,kd(i)=e(i)+f(i)ENDDO!$OMPENDPARALLELSECTIONS

Page 74: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Parallel Tasks and Workshares

• ParallelTasks•  Unstructuredparallelism•  Irregularproblemslike:

•  Recursivealgorithms•  Unboundedloops

• ParallelWorkshares•  Fortranonly•  Usedtoparallelizeassignments

Page 75: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Matrix-Vector Mul)plica)on: OpenMP C

#include<stdio.h>#include<stdlib.h>voidOmp_mat_vect(intm,intn,double*restricta,double*restrictb,double*restrictc){inti,j;#pragmaompparallelfordefault(none)\shared(m,n,a,b,c)private(i,j)for(i=0;i<m;i++){a[i]=0.0;for(j=0;j<n;j++)a[i]+=b[i*n+j]*c[j];}/*Endofompparallelfor*/

Page 76: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Matrix-Vector Mul)plica)on: OpenMP Fortran

subrouCnePth_mat_vect(m,n,a,b,c)implicitnoneinteger(kind=4)::m,nreal(kind=8)::a(1:m),b(1:m,1:n),c(1:n)integer::i,j!$OMPPARALLELDODEFAULT(NONE)&!$OMPSHARED(m,n,a,b,c)PRIVATE(i,j)doi=1,ma(i)=0.0doj=1,na(i)=a(i)+b(i,j)*c(j)enddoenddo!$OMPENDPARALLELDOreturnendsubrouCnePth_mat_vect

Page 77: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

OpenMP Errors

• VariableScoping:Whicharesharedandwhichareprivate•  SequenCalI/Oinaparallelregioncanleadtoanunpredictableorder•  Falsesharing:twoormoreprocessorsaccessdifferentvariablesinthesamecacheline(Performance)• RaceCondiCons• Deadlock

Page 78: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Summa)on Example

Whatiswiththefollowingcodewrong?sum=0.0!$OMPPARALLELDODOi=1,nsum=sum+a(i)ENDDO!$OMPENDPARALLELDO

Page 79: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Summa)on Example Correc)on

sum=0.0!$OMPPARALLELDOREDUCTION(+,sum)DOi=1,nsum=sum+a(i)ENDDO!$OMPENDPARALLELDO

Page 80: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Distributed Memory Parallelism

Page 81: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Message Passing

• CommunicaConondistributedmemorysystemsareasignificantaspectofperformanceandcorrectness• MessagesarerelaCvelyslow,withstartupCmes(latency)takingthousandsofcycles• Oncemessagepassinghasstarted,theaddiConalCmeperbyte(bandwidth)isrelaCvelysmall

Page 82: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Performance on Gardner

•  IntelXeonE5-2683Processor(Haswell)• Processorspeed:2,000cyclespermicrosecond(µsec)•  16FLOPs/cycle:32,000FLOPsperµsec• MPImessagelatency=~2.5µsecor80,000FLOPs• MPImessagebandwidth=~7,000bytes/µsec=4.57FLOPs/byte

Page 83: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Reducing Latency

• ReducethenumberofmessagesbymappingcommunicaCngenCCesontothesameprocessor• CombinemessageshavingthesamesenderanddesCnaCon•  IfprocessorAhasdataneededbyprocessorB,haveAsendittoB,ratherthanwaiCngforBtorequestit.ProcessorAshouldsendassoonasthedataisready,processorBshouldreaditaslateaspossibletoincreasetheprobabilitythatthedatahasarrived

Page 84: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Other Issues With Message Passing

• NetworkCongesCon• Deadlock•  Blocking:aprocessorcannotproceedunClthemessageisfinished• WithblockingcommunicaCon,youmayreachapointwherenoprocessorcanproceed•  Non-blockingcommunicaCon:easiestwaytopreventdeadlock;processorscansendandproceedbeforereceiveisfinished

Page 85: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Message Passing Interface (MPI)

•  InternaConalStandard• MPI1.0releasedin1994• Currentversionis3.1(June2015)• MPI4.0standardintheworks• Availableonallparallelsystems•  InterfacesinC/C++andFortran• BindingsforMATLAB,Python,R,andJava• Worksonbothdistributedmemoryandsharedmemoryhardware• HundredsoffuncCons,butyouwillneed~6-10togetstarted

Page 86: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Basics

•  TwocommonMPIfuncCons:•  MPI_SEND()-tosendamessage•  MPI_RECV()-toreceiveamessage

•  FuncConlikewriteandreadstatements• BothareblockingoperaCons• However,asystembufferisusedthatallowssmallmessagestobenon-blocking,butlargemessageswillbeblocking•  ThesystembufferisbasedontheMPIimplementaCon,notthestandard• BlockingcommunicaConmayleadtodeadlocks

Page 87: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Small Messages

Page 88: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Large Messages

Page 89: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Deadlocks

Source Avoidance

Page 90: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Non-Blocking Communica)on

• Non-BlockingoperaCons•  MPI_ISEND()•  MPI_IRECV()•  MPI_WAIT()

•  TheusercancheckfordataatalaterstageintheprogramwithoutwaiCng•  MPI_TEST()

• Non-blockingoperaConswillperformbeZerthanblockingoperaCons• PossibletooverlapcommunicaConwithcomputaCon

Page 91: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Program Template

#include"mpi.h"/*IniCalizeMPI*/MPI_Init(&argc,&argv);/*Alloweachprocessortodetermineitsrole*/intnum_processor,my_rank;MPI_Comm_size(MPI_COMM_WORLD,&num_processor);MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);/*Sendandreceivesomestuffhere*//*Finalize*/MPI_Finalize()

Page 92: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Summa)on

Page 93: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Summa)on

/*Sendandreceivesomestuffhere*/if(my_rank==0){sum=0.0;for(source=1;source<num_procs;source++){MPI_RECV(&value,1,MPI_FLOAT,source,tag,MPI_COMM_WORLD,&status);sum+=value;}}else{MPI_SEND(&value,1,MPI_FLOAT,0,tag,MPI_COMM_WORLD);}

Page 94: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

A More Efficient Receive

•  Inthepreviousexample,rank0receivedthemessagesinorderbasedonranks.•  Thus,ifprocessor2wasdelayed,thenprocessor0wasalsodelayed• MakethefollowingmodificaCon:MPI_RECV(&value,1,MPI_FLOAT,MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&status);• Nowprocessor0canprocessmessagesassoonastheyarrive

Page 95: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Reduc)on

•  LikeOpenMP,MPIalsohasreducConoperaConsthatcanbeusedfortaskssuchassummaCon• ReducConisatypeofcollecCvecommunicaCon

MPI_REDUCE(&value,&sum,1,MPI_FLOAT,MPI_SUM,0,MPI_COMM_WORLD);• ReducConoperaCons:•  MPI_SUM,MPI_MIN,MPI_MAX,MPI_PROD•  MPI_LAND,MPI_LOR

Page 96: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Collec)ve Communica)on

• MostfrequentlyinvokedMPIrouCnesacersendandreceive•  Improveclarity,runfasterthansendandreceive,andreducethelikelihoodofmakinganerror•  Examples•  Broadcast(MPI_Bcast)•  ScaZer(MPI_ScaZer)•  Gather(MPI_Gather)•  AllGather(MPI_Allgather)

Page 97: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Broadcast

Page 98: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Sca`er

Page 99: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Gather

Page 100: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

All Gather

Page 101: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Data Types

• PrimiCvedatatypes(correspondtothosefoundintheunderlyinglanguage)•  Fortran

•  MPI_CHARACTER•  MPI_INTEGER•  MPI_REAL,MPI_DOUBLE_PRECISION•  MPI_LOGICAL

•  C/C++(includemanymorevariantsthanFortran)•  MPI_CHAR•  MPI_INT,MPI_UNSIGNED,MPI_LONG•  MPI_FLOAT,MPI_DOUBLE•  MPI_C_BOOL

Page 102: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Data Types

• Deriveddatatypesarealsopossible•  Vector–dataseparatedbyaconstantstride•  ConCguous–dataseparatedbyastrideof1•  Struct–mixedtypes•  Indexed–arrayofindices

Page 103: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Vector

Page 104: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Con)guous

Page 105: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Struct

Page 106: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Indexed

Page 107: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI Synchroniza)on

•  ImplicitsynchronizaCon•  BlockingcommunicaCon•  CollecCvecommunicaCon

•  ExplicitsynchronizaCon•  MPI_Wait•  MPI_Waitany•  MPI_Barrier

• Remember,synchronizaConcanhinderperformance

Page 108: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Addi)onal MPI Features

• Virtualtopologies• User-createdcommunicators• User-specifiedderiveddatatypes• ParallelI/O• One-sidedcommunicaCon(MPI_Put,MPI_Get)• Non-blockingcollecCvecommunicaCon• RemoteMemoryAccess

Page 109: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Vectoriza)on

Page 110: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Accelerators

• Principle:SplitanoperaConintoindependentpartsandexecutethepartsconcurrentlyinanaccelerator(SIMD)

DOI=1,1000A(I)=B(I)+C(I)ENDDO• AcceleratorslikeGraphicsProcessingUnitsarespecializedforSIMDoperaCons

Page 111: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Popularity

•  In2012,13%oftheperformanceshareoftheTop500Supercomputerswasprovidedthroughaccelerators•  Today,theperformanceshareforacceleratorsis~38%•  Trend:BuilddiversecompuCngpla�ormswithmulCprocessors,SMPnodes,andaccelerators• Caveat:DifficulttouseheterogenoushardwareeffecCvely

Page 112: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Modern Accelerators

NVidiaVolta•  7.5TFLOPsdoubleprecision•  30TFLOPshalfprecision•  5120cores•  16GBRAM•  900GB/smemorybandwidth

IntelXeonPhiKnight'sLanding•  3TFLOPsdoubleprecision•  72cores(288threads)•  16GBRAM•  384GB/smemorybandwidth•  Includesbothascalarandavectorunit

Page 113: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

NVidia Architecture

•  SIMT:SingleInstrucCon,MulCpleThread(SimilartoSIMD)•  Threadblocksareexecutedbyawarpof32cores• PerformanceisdependentonmemoryalignmentandeliminaCngsharedmemoryaccess

Page 114: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Programming GPUs

•  Low-LevelProgramming•  Proprietary–NVidia'sCUDA•  Portable–OpenCL

• High-LevelProgramming•  OpenACC•  OpenMP4.5•  AllowsforasinglecodebasethatmaybecompiledonbothmulCcoreandGPUs

Page 115: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

CUDA Matrix Mul)plica)on intmain(void){int*a,*b,*c;//CPUcopiesint*dev_a,*dev_b,*dev_c;//GPUcopies/*AllocatememoryonGPU*/arraysize=n*sizeof(int);cudaMalloc((void**)\&dev_a,arraysize);cudaMalloc((void**)\&dev_b,arraysize);cudaMalloc((void**)\&dev_c,sizeof(in));usemallocfora,b,ciniCalizea,b/*MoveaandbtotheGPU*/cudaMemcpy(dev_a,a,arraysize,cudaMemcpyHostToDevice);cudaMemcpy(dev_b,b,arraysize,cudaMemcpyHosZoDevice);/*LaunchGPUkernel*/dot<<<n/threads_per_block,threads_per_block>>>(dev_a,dev_b,dev_c);/*CopyresultsbacktoCPUcudaMemcpy(c,dev_c,sizeof(int),cudaMemcpyDeviceToHost)return0;}

Page 116: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

CUDA Matrix Mul)plica)on

_global_voiddot(int*a,int*b,int*c){_shared_inttemp[threads_per_block];intindex=threadIdx.x+blockIdx.x+blockDim.x;/*Convertwarpindextoglobal*/temp[threadIdx.x]=a[index]*b[index];/*AvoidracecondiCononsumwithinblock*/_syncthreads();if(threadIdx.x==0){intsum=0;for(inti=0;i<threads_per_block;i++)sum+=temp[i];/*addtoglobalsum*/atomicAdd(c,sum)}}

Page 117: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

OpenACC and OpenMP

•  SharethesamemodelofcomputaCon• Host-centric–hostdeviceoffloadscoderegionsanddatatoaccelerators• Device–hasindependentmemoryandmulCplethreads• Mappingclause–DefinestherelaConshipbetweenmemoryonthehostdeviceandmemoryontheaccelerator

Page 118: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

OpenACC Matrix Mul)plica)on

intmain(intargc,char**argv){intn=atoi(argv[1]);float*a=(float*)malloc(sizeof(float)*n*n);float*b=(float*)malloc(sizeof(float)*n*n);float*c=(float*)malloc(sizeof(float)*n*n);#pragmaaccdatacopyin(a[0:n*n],b[0:n*n]),copy(c[0:n*n]){inti,j,k;#pragmaacckernelsfor(i=0;i<n;i++){for(j=0;j<n;j++){a[i*n+j]=(float)i+j;b[i*n+j]=(float)i–j;c[i*n+j]=0.0;}}

Page 119: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

OpenACC Matrix Mul)plica)on

#pragmaacckernelsfor(i=0;i<n;i++){for(j=0;j<n;j++){for(k=0;k<n;k++){c[i*n+j]+=a[i*n+k]*b[k*n+j];}}}}}

Page 120: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Speedup?

•  Someproblemscanachieveaspeedupof10x-100xbyoffloadingsomeoftheworktoaGPU(e.g.linpack,matrixmulCplicaCon)•  Inreality:•  OnlyasmallfracConofapplicaConscanuClizeaGPUeffecCvely•  TheaveragespeedupofthoseapplicaConsis~2x-4x•  OpCmizingyourcodeusingmulCcoretechnologiesisprobablyabeZeruseofyourCme

Page 121: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Problem Areas

• Branchingcancutyourperformanceby50%sinceinstrucConsneedtobeissuedforbothbranches• MemorybandwidthonGPUsispoor•  Needtomakesureyouroperandsareusedmorethanonce•  FortheNVidiaVolta,alloperandsneedtobeused67xtoachievepeakFLOPs

Page 122: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Accelerator Trends

•  IntelhascanceledKnight'sLandingandKnight'sHill;futureXeonPhistatusisunknown•  GPUsarebecominglessrigidlySIMDandincludingbeZermemorybandwidth• Moreprogrammersmovingfromlow-levelprogramminglikeCUDAtohigh-leveldirecCves•  PGICompilerwaspurchasedbyNVidiain2013;cutsupportforOpenCL•  OpenACCandOpenMPsCllhavenotmerged•  EfficiencyofGPUsissCllproblemaCcandtheprogrammingmodelisamovingtarget•  NewMoCvaCon:DeepLearning

Page 123: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Hybrid Compu)ng

Page 124: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Current Trend

• Accelerator-basedmachinesaregrabbingthehighrankingsontheTop500,butclusterswithstandardcoresaresCllmostimportanteconomically•  EconomicsdictatesdistributedmemorysystemofsharedmemoryboardswithmulCcorecommoditychips,perhapswithsomeformofacceleratordistributedsparingly

Page 125: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

MPI+X

Page 126: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Conclusion

Page 127: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon

Topics for Possible Future Workshops

•  ScienCficCompuCng(C++orFortranorBoth)•  ParallelCompuCngTheory(e.g.,Loadbalancing,measuringperformance)•  EmbarrassinglyParallel•  pthreads•  OpenMP• MPI•  OpenACC•  CUDA•  ParallelDebugging/Profiling