introducon to parallel compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/intro... · 2. read...
TRANSCRIPT
![Page 1: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/1.jpg)
Introduc)on to Parallel Compu)ng
September2018
MikeJarsulic
![Page 2: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/2.jpg)
Introduc)on
![Page 3: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/3.jpg)
So Far...
Year Users Jobs CPUHours SupportTickets
2012 36 * 1.2M *
2013 71 * 1.9M 141
2014 84 739K 5.3M 391
2015 125 1.5M 5.2M 372
2016 153 1.6M 6.1M 363
2017 256 4.8M 14.7M 759
2018 296 3.0M 8.8M 631
TOTAL * 11.7M 40.2M 2657
![Page 4: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/4.jpg)
However...
• TheBSDresearchcommunitydoesnotunderstandhowtouClizeresourcesproperly• Example:
• August1st,2018• 1021coresinuse• 39jobshadrequestedatotalof358cores• All39jobsweresinglecorejobs• Thus,319coreswerededicated,butidle
• Mytakeaways:• TheBSDresearchcommunitydoesnotreallyunderstandparallelcompuCng!• IhavenotdoneagoodoftrainingtheBSDresearchcommunityinthisarea!
![Page 5: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/5.jpg)
Office Hours
• HeldeveryTuesdayandThursdayatPeckPavilionN161• Userscomeintoworkone-on-onetobeZeruClizeourHPCresources• ThreecommonquesConsonparallelcompuCng:
1. Isthisaworthwhileskilltolearn?Aren'tsingleprocessorsystemsfastenoughandalwaysimproving?
2. Whycan'tprocessormanufacturersjustmakefasterprocessors?3. WillyouwritemeaprogramthatwillautomaCcallyconvertmyserial
applicaConintoaparallelapplicaCon?
![Page 6: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/6.jpg)
Obligatory Cost per Genome Slide
![Page 7: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/7.jpg)
Mo)va)on
![Page 8: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/8.jpg)
Goals (of this presenta)on)
• UnderstandthebasicterminologyfoundinparallelcompuCng• Recognizethedifferenttypesofparallelarchitectures• Learnhowparallelperformanceismeasured• Gainahigh-levelunderstandingtothedifferentparadigmsinparallelprogrammingandwhatlibrariesareavailabletoimplementthoseparadigms
• PleasenotethatyouwillnotbeaparallelprogrammingrockstaracerthisthreehourintroducCon
![Page 9: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/9.jpg)
Overall Goals
• Hopefully,therewillbeenoughinterestacerthispresentaContosupportaworkshopseriesonparallelcompuCng• Possibletopicsforworkshopsare:• ScienCficProgramming• ParameterizaCon• OpenMP• MPI• CUDA• OpenACC
![Page 10: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/10.jpg)
Today's Outline
• Terminology• ParallelArchitectures• MeasuringParallelPerformance• EmbarrassinglyParallelWorkflows• SharedMemory• DistributedMemory• VectorizaCon• HybridCompuCng
![Page 11: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/11.jpg)
A Note on Sources
• PresentaCongivenatSC17byStoutandJablonowskifromtheUniversityofMichigan• IntroducContoParallelCompuCng2ndEdiConbyGrama,Gupta,etal.• IntroducContoParallelProgrammingbyPacheco• UsingOpenMPbyChapman,Jost,andVanDerPas• UsingMPIbyGropp,Lusk,Skjellum• AdvancedMPIbyGropp,Hoefler,etal.• CUDAforEngineersbyStorCandYurtoglu• CUDAbyExamplebySandersandKandrot• ParallelProgrammingwithOpenACCbyFarber• TheLLNLOpenMPandMPItutorials
![Page 12: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/12.jpg)
A Note on Format
• Today'sSeminarwillberecordedforaITM/CTSAgrant• QuesCons• Whiteboard• Swearing
![Page 13: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/13.jpg)
Terminology
![Page 14: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/14.jpg)
Caveats
• TherearenostandardizeddefiniConsfortermsinparallelcompuCng,thusresourcesoutsideofthispresentaConmayusethetermsinadifferentway• ManyalgorithmsorsocwareapplicaConsthatyouwillfindintherealworlddonotfallneatlyintothecategoriesmenConedinthispresentaConsincetheyblendapproachesinwaysthatdifficulttocategorize.
![Page 15: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/15.jpg)
Basic Terminology
• ParallelCompuFng–solvingaprobleminwhichmulCpletaskscooperatecloselyandmakesimultaneoususeofmulCpleprocessors• EmbarrassinglyParallel–solvingmanysimilar,independenttasks;parameterizaCon• MulFprocessor–mulCpleprocessors(cores)onasinglechip• SymmetricMulFprocessing(SMP)-mulCpleprocessorssharingasingleaddressspace,OSinstance,storage,etc.AllprocessorsaretreatedequallybytheOS• UMA–UniformMemoryAccess;allprocessorssharethememoryspaceuniformly• NUMA–Non-UniformMemoryAccess;memoryaccessCmedependsonthelocaConofthememoryrelaCvetotheprocessor
![Page 16: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/16.jpg)
More Basic Terminology
• ClusterCompuFng–AparallelsystemconsisCngofacombinaConofcommodityunits(processorsorSMPs)• CapacityCluster–Aclusterdesignedtorunavarietytypesofproblemsregardlessofthesize;thegoalistoperformasmuchresearchaspossible• CapabilityCluster–Thefastest,largestmachinesdesignedtosolvelargeproblems;thegoalistodemonstratecapabilityoftheparallelsystem• HighPerformanceCompuFng–Solvinglargeproblemsviaaclusterofcomputersthatconsistoffastnetworksandmassivestorage
![Page 17: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/17.jpg)
Even More Basic Terminology
• InstrucFon-LevelParallelism(ILP)-ImprovesprocessorperformancebyhavingmulCpleprocessorcomponentssimultaneouslyexecuCnginstrucCons• Pipelining–Breakingataskintostepsperformedbydifferentprocessorswithinputsstreamingthrough,muchlikeanassemblyline• SuperscalarExecuFon-TheabilityforaprocessortoexecutemorethanoneinstrucConduringaclockcyclebysimultaneouslydispatchingmulCpleinstrucConstodifferentexecuConunitsontheprocessor• CacheCoherence-SynchronizaConbetweenlocalcachesforeachprocessor
![Page 18: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/18.jpg)
Crash Simula)on Example
• SimplifiedmodelbasedonacrashsimulaConfortheFordMotorCompany• IllustratesvariousaspectscommontomanysimulaConsandapplicaCons• ThisexamplewasprovidedbyQ.StoutandC.JablonowskioftheUniversityofMichigan
![Page 19: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/19.jpg)
Finite Element Representa)on
• Carismodeledbyatriangulatedsurface(elements)• ThesimulaConmodelsthemovementoftheelements,incorporaCngtheforcesontheelementstodeterminetheirnewposiCon• IneachCmestep,themovementofeachelementdependsonitsinteracConwiththeotherelementstowhichitisphysicallyadjacent• Inacrash,elementsmayenduptouchingthatwerenottouchinginiCally• ThestateofanelementisitslocaCon,velocity,andinformaConsuchaswhetheritismetalthatisbending
![Page 20: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/20.jpg)
Car
![Page 21: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/21.jpg)
Serial Crash Simula)on
1. Forallelements2. ReadState(element),ProperCes(element),Neighbor_list(element)3. ForCme=1toend_of_simulaCon4. Forelement=1tonum_elements5. ComputeState(element)fornextCmestep,basedonthe
previousstateoftheelementanditsneighborsandtheproperCesoftheelement
![Page 22: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/22.jpg)
Simple Approach to Paralleliza)on (Concepts)
• DistributedMemory–Parallelsystembasedonprocessorslinkedwithafastnetwork;processorscommunicateviamessages• OwnerComputes–Distributeelementstoprocessors;eachprocessorupdatestheelementswhichitcontains• SingleProgramMulFpleData(SPMD)-Allmachinesrunthesameprogramonindependentdata;dominantformofparallelcompuCng
![Page 23: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/23.jpg)
Distributed Car
![Page 24: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/24.jpg)
Basic Parallel Version
ConcurrentlyforallprocessorsP1. ForallelementsassignedtoP2. ReadState(element),ProperCes(element),Neighbor-
list(element)3. ForCme=1toend_of_simulaCon4. Forelement=1tonum_elements_in_P5. ComputeState(element)fornextCmestep,basedon
previousstateofelementanditsneighbors,andonproperCesoftheelement
![Page 25: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/25.jpg)
Notes
• Mostofthecodeisthesameas,orsimilarto,serialcode.• High-levelstructureremainsthesame:asequenceofsteps• Thesequenceisaserialconstruct,but• Nowthestepsareperformedinparallel,but• CalculaConsforindividualelementsareserial
• QuesCon:• Howdoeseachprocessorkeeptrackofadjacencyinfoforneighborsinotherprocessors?
![Page 26: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/26.jpg)
Distributed Car (ghost cells)
![Page 27: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/27.jpg)
Parallel Architectures
![Page 28: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/28.jpg)
Flynn's Taxonomies
• HardwareClassificaCons• {S,M}I{S,M}D• SingleInstrucFon(SI)-SysteminwhichallprocessorsexecutethesameinstrucCon• MulFpleInstrucFon(MI)-SysteminwhichdifferentprocessorsmayexecutedifferentinstrucCons• SingleData(SD)-Systeminwhichallprocessorsoperateonthesamedata• MulFpleData(MD)-Systeminwhichdifferentprocessorsmayoperateondifferentdata• M.J.Flynn.SomecomputerorganizaConsandtheireffecCveness.IEEETransacConsonComputers,C-21(9):948–960,1972.
![Page 29: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/29.jpg)
Flynn's Taxonomies
• SISD–ClassicvonNeumannarchitecture;serialcomputer• MIMD–CollecConsofautonomousprocessorsthatcanexecutemulCpleindependentprograms;eachofwhichcanhaveitsowndatastream• SIMD–DataisdividedamongtheprocessorsandeachdataitemissubjectedtothesamesequenceofinstrucCons;GPUs,AdvancedVectorExtensions(AVX)• MISD–Veryrare;systolicarrays;smartphonescarriedbyChupacabras
![Page 30: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/30.jpg)
CRAY-1 Vector Machine (1976)
![Page 31: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/31.jpg)
Vector Machines Today
![Page 32: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/32.jpg)
SoWware Taxonomies
• DataParallel(SIMD)• ParallelismthatisaresultofidenCcaloperaConsbeingappliedconcurrentlyondifferentdataitems;e.g.,manymatrixalgorithms• Difficulttoapplytocomplexproblems
• SingleProgram,MulFpleData(SPMD)• AsingleapplicaConisrunacrossmulCpleprocesses/threadsonaMIMDarchitecture• Mostprocessesexecutethesamecodebutdonotworkinlock-step• Dominantformofparallelprogramming
![Page 33: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/33.jpg)
SISD vs. SIMD
![Page 34: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/34.jpg)
MIMD Architectures (Shared Memory)
UniformMemoryAccess(UMA) Non-UniformMemoryAccess(NUMA)
![Page 35: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/35.jpg)
More MIMD Architectures
DistributedMemory HybridMemory
![Page 36: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/36.jpg)
Shared Memory (SM)
• AYributes:• Globalmemoryspace• EachprocessorwilluClizeit'sowncacheforaporConofglobalmemory;consistencyofthecacheismaintainedbyhardware
• Advantages:• User-friendlyprogrammingtechniques(OpenMPandOpenACC)• Lowlatencyfordatasharingbetweentasks
• Disadvantages:• Globalmemoryspace-to-CPUpathmaybeaboZleneck• Non-UniformMemoryAccess• ProgrammerresponsibleforsynchronizaCon
![Page 37: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/37.jpg)
Distributed Memory (DM)
• AYributes:• Memoryissharedamongstprocessorsthroughmessagepassing
• Advantages:• Memoryscalesbasedonthenumberofprocessors• Accesstoaprocessor'sownmemoryisfast• CosteffecCve
• Disadvantages:• Errorprone;programmersareresponsibleforthedetailsofthecommunicaCon• Complexdatastructuresmaybedifficulttodistribute
![Page 38: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/38.jpg)
Hardware/SoWware Models
• Socwareandhardwaremodelsdonotneedtomatch
• DMsocwareonSMhardware:• MessagePassingInterface(MPI)-designedforDMHardwarebutavailableonSMsystems
• SMsocwareonDMhardware• RemoteMemoryAccess(RMA)includedwithinMPI-3• ParCConedGlobalAddressSpace(PGAS)languages
• UnifiedParallelC(extensiontoISOC99)• CoarrayFortran(Fortran2008)
![Page 39: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/39.jpg)
Difficul)es
• SerializaConcausesboZlenecks• Workloadisnotdistributed• Debuggingishard• Serialapproachdoesn'tparallelize
![Page 40: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/40.jpg)
Parallel Performance
![Page 41: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/41.jpg)
Defini)ons
• SerialTime(n)-Timeforaserialprogramtosolveaproblemwithaninputofsizen• ParallelTime(n,p)-Timeforaparallelprogramtosolvethesameproblemwithaninputsizeofnusingpprocessors
• SerialTime(n)≤ParallelTime(n,1)
• Speedup(n,p)=SerialTime(n)/ParallelTime(n,p)• Work(n,p)=p*ParallelTime(n,p)• Efficiency(n,p)=SerialTime(n)/[p*ParallelTime(n,p)]
![Page 42: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/42.jpg)
Defini)ons
• ExpectaCons:• 0<Speedup≤p• SerialWork≤ParallelWork<∞• 0<Efficiency≤1
• LinearSpeedup:Speedup=p,givensomerestricConontherelaConshipbetweennandp
![Page 43: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/43.jpg)
In The Real World
![Page 44: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/44.jpg)
Superlinear Speedup
• SuperlinearSpeedup:Speedup>p,thusEfficiency>1
• Why?• ParallelcomputerhaspCmesasmuchRAMsoahigherfracConofprogrammemoryisinRAMinsteadofdisk.• Parallelprogramusedadifferentalgorithmwhichwasnotpossiblewiththeserialprogram.
![Page 45: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/45.jpg)
Amdahl's Law
LetfbethefracConofCmespentonserialoperaConsLetParallelOps=1–fandassumelinearspeedup• Forpprocessors:• ParallelTime(p)≥SerialTime(p)*[f+(ParallelOps/p)]• Speedup(p)≤1/(f+ParallelOps/p)
• Thus,Speedup(p)≤1/f,nomaZerthenumberofprocessorsused
![Page 46: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/46.jpg)
Maximum Possible Performance
![Page 47: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/47.jpg)
Pessimis)c Interpreta)on
![Page 48: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/48.jpg)
Op)mis)c Interpreta)on
• WeshouldbemoreopCmisCcthanAmdahl'sLawbecause:• Algorithmswithmuchsmallervaluesoff• MoreCmespentinRAMthandisk• TimespentinfisadecreasingfracConofthetotalCmeasproblemsizeincreases
![Page 49: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/49.jpg)
Common Program Structure
![Page 50: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/50.jpg)
Scaling Strategy
• Increasenaspincreases• Amdahlconsideredstrongscalingwherenisfixed
• UClizeweakscaling• Theamountofdataperprocessorisfixed• EfficiencycanremainhighifcommunicaCondoesnotincreaseexcessively
![Page 51: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/51.jpg)
Embarrassingly Parallel Jobs
![Page 52: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/52.jpg)
Parameteriza)on
• Requirements• Independenttasks• CommonapplicaCons• Onejobsubmissionpertask
• Techniques• Parametersweeps(modifyparametersmathemaCcallyforeachtask)• List-drivenanalysis(modifyparametersbasedonalistofvalues)
![Page 53: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/53.jpg)
Example: List-Driven Analysis
• Problem:CalculatethetheoreCcalperformancefortheCRIHPCclusters(historically)• Rpeak=Nodes*CoresPerNodes*ClockSpeed*InstructPerCycle
![Page 54: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/54.jpg)
List
Cluster Nodes CoresPerNode ClockSpeed InstructPerCycle
ibicluster 99 8 2.66e09 4
tarbell 40 64 2.20e09 8
gardner 126 28 2.00e09 16
![Page 55: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/55.jpg)
Base
#PBS–NRpeak.@Cluster#PBS–lnodes=1:ppn=1#PBS–lmem=1gb#PBS–lwallCme=20:00./rpeak-n@Nodes–c@CoresPerNode\–s@ClockSpeed-i@InstructPerCycle
![Page 56: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/56.jpg)
Shared Memory Parallelism
![Page 57: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/57.jpg)
Shared Memory
• Allprocessorscandirectlyaccessallthememoryinthesystem(thoughaccessCmesmaybedifferentduetoNUMA)• Socwareportability–ApplicaConscanbedevelopedonserialmachinesandrunonparallelmachineswithoutanychanges• LatencyHiding–Abilitytomasktheaccesslatencyformemoryaccess,I/O,andcommunicaCons• Scheduling–Supportssystem-leveldynamicmappingoftaskstoprocessors• Easeofprogramming–Significantlyeasiertoprogram(inmyopinion)usingasharedmemorymodeloveradistributedmemorymodel
![Page 58: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/58.jpg)
Paralleliza)on Techniques: pthreads
• POSIXthreads• StandardthreadingimplementaConforUNIX-likeoperaCngsystems(Linux,MacOSX)• LibrarythatcanbelinkedwithCprograms• SomecompilersincludeaFortranversionofpthreads• KnowledgeofpthreadscaneasilybetransferredtoprogrammingotherwidelyusedthreadingspecificaCons(e.g.,Javathreads)
![Page 59: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/59.jpg)
Matrix-Vector Mul)plica)on (pthreads)
#include<stdio.h>#include<stdlib.h>#include<pthread.h>/*Globalvariable:accessibletoallthreads*/intthread_count;void*Pth_mat_vect(void*rank);/*ThreadfuncCon*/intmain(intargc,char*argv[]){longthread;/*Uselongincaseofa64-bitsystem*/pthread_t*thread_handles;/*Getnumberofthreadsfromthecommandline*/thread_count=strtol(argv[1],NULL,10);thread_handles=malloc(thread_count*sizeof(pthread_t));
![Page 60: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/60.jpg)
Matrix-Vector Mul)plica)on (pthreads)
/*Createathreadforeachrank*/for(thread=0;thread<thread_count;thread++)pthread_create(&thread_handles[thread],NULL,Pth_mat_vect,(void*)thread);/*Jointheresultsfromeachthread*/for(thread=0;thread<thread_count;thread++)pthread_join(thread_handles[thread],NULL);free(thread_handles);return0;}/*main*/
![Page 61: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/61.jpg)
Matrix-Vector Mul)plica)on (pthreads)
void*Pth_mat_vect(void*rank){longmy_rank=(long)rank;inti,j;intlocal_m=m/thread_count;intmy_first_row=my_rank*local_m;intmy_last_row=(my_rank+1)*local_m–1;for(i=my_first_row;i<=my_last_row;i++){y[i]=0.0;for(j=0;j<n;j++)y[i]+=A[i][j]*x[j];}returnNULL;}/*Pth_mat_vect*/
![Page 62: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/62.jpg)
Paralleliza)on Techniques: OpenMP
• OpenMPissortofanHPCstandardforsharedmemoryprogramming• OpenMPversion4.5releasedin2015andincludesacceleratorsupportasanadvancedfeature• APIforthread-basedparallelism• Explicitprogrammingmodel,compilerinterpretsdirecCves• BasedonacombinaConofcompilerdirecCves,libraryrouCnes,andenvironmentvariables• Usesthefork-joinmodelofparallelexecuCon• AvailableinmostFortranandCcompilers
![Page 63: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/63.jpg)
OpenMP Goals
• StandardizaCon:standardamongallsharedmemoryarchitecturesandhardwarepla�orms• Lean:simpleandlimitedsetofcompilerdirecCvesforsharedmemoryprogramming.Ocensignificantperformancegainsusingjust4-6direcCvesincomplexapplicaCons.• Easeofuse:supportsincrementalparallelizaConofaserialprogram,notanall-or-nothingapproach.• Portability:supportsFortran,C,andC++
![Page 64: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/64.jpg)
OpenMP Building Blocks
• CompilerDirecCves(embeddedincode)• Parallelregions(PARALLEL)• Parallelloops(PARALLELDO)• Parallelworkshare(PARALLELWORKSHARE)• ParallelsecCons(PARALLELSECTIONS)• Paralleltasks(PARALLELTASK)• SerialsecCons(SINGLE)• SynchronizaCon(BARRIER,CRITICAL,ATOMIC,…)• Datastructures(PRIVATE,SHARED,REDUCTION)
• Run-CmelibraryrouCnes(calledincode)• OMP_SET_NUM_THREADS• OMP_GET_NUM_THREADS
• UNIXEnvironmentVariables(setbeforeprogramexecuCon)• OMP_NUM_THREADS
![Page 65: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/65.jpg)
Fork-Join Model
• ParallelexecuConisachievedbygeneraCngthreadswhichareexecutedinparallel• MasterthreadexecutesinserialunClthefirstparallelregionisencountered• Fork–Themasterthreadcreatedateamofthreadswhichareexecutedinparallel• Join–Whentheteammemberscompletethework,theysynchronizeandterminate.ThemasterthreadconCnuessequenCally.
![Page 66: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/66.jpg)
Fork-Join Model
![Page 67: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/67.jpg)
Work-Sharing Constructs
![Page 68: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/68.jpg)
Barriers
• Barriersmaybeneededforcorrectness• SynchronizaCondegradesperformance
![Page 69: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/69.jpg)
OpenMP Nota)on
• OpenMPrecognizesthefollowingcompilerdirecCvesthatstartwith:• !$OMP(inFortran)• #pragmaomp(inC/C++)
![Page 70: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/70.jpg)
Parallel Loops
• Eachthreadexecutespartoftheloop• ThenumberofiteraConsisstaCcallyassignedtothethreadsuponentrytotheloop• NumberofiteraConscannotbechangedduringexecuCons• ImplicitBARRIERattheendoftheloop• Highefficiency
![Page 71: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/71.jpg)
Parallel Loop Example (Fortran)
!$OMPPARALLEL!$OMPDODOi=1,na(i)=b(i)+c(i)ENDDO!$OMPENDDO!$OMPENDPARALLEL
![Page 72: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/72.jpg)
Parallel Sec)ons
• MulCpleindependentsecConscanbeexecutedconcurrentlybyseparatethreads• EachparallelsecConisassignedtoaspecificthreadwhichexecutestheworkfromstarttofinish• ImplictBARRIERattheendofeachSECTION• NestedparallelsecConsarepossible,butimpracCcal
![Page 73: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/73.jpg)
Parallel Sec)ons Example (Fortran)
!$OMPPARALLELSECTIONS!$OMPSECTIONDOi=1,na(i)=b(i)+c(i)ENDDO!$OMPSECTIONDOi=1,kd(i)=e(i)+f(i)ENDDO!$OMPENDPARALLELSECTIONS
![Page 74: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/74.jpg)
Parallel Tasks and Workshares
• ParallelTasks• Unstructuredparallelism• Irregularproblemslike:
• Recursivealgorithms• Unboundedloops
• ParallelWorkshares• Fortranonly• Usedtoparallelizeassignments
![Page 75: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/75.jpg)
Matrix-Vector Mul)plica)on: OpenMP C
#include<stdio.h>#include<stdlib.h>voidOmp_mat_vect(intm,intn,double*restricta,double*restrictb,double*restrictc){inti,j;#pragmaompparallelfordefault(none)\shared(m,n,a,b,c)private(i,j)for(i=0;i<m;i++){a[i]=0.0;for(j=0;j<n;j++)a[i]+=b[i*n+j]*c[j];}/*Endofompparallelfor*/
![Page 76: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/76.jpg)
Matrix-Vector Mul)plica)on: OpenMP Fortran
subrouCnePth_mat_vect(m,n,a,b,c)implicitnoneinteger(kind=4)::m,nreal(kind=8)::a(1:m),b(1:m,1:n),c(1:n)integer::i,j!$OMPPARALLELDODEFAULT(NONE)&!$OMPSHARED(m,n,a,b,c)PRIVATE(i,j)doi=1,ma(i)=0.0doj=1,na(i)=a(i)+b(i,j)*c(j)enddoenddo!$OMPENDPARALLELDOreturnendsubrouCnePth_mat_vect
![Page 77: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/77.jpg)
OpenMP Errors
• VariableScoping:Whicharesharedandwhichareprivate• SequenCalI/Oinaparallelregioncanleadtoanunpredictableorder• Falsesharing:twoormoreprocessorsaccessdifferentvariablesinthesamecacheline(Performance)• RaceCondiCons• Deadlock
![Page 78: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/78.jpg)
Summa)on Example
Whatiswiththefollowingcodewrong?sum=0.0!$OMPPARALLELDODOi=1,nsum=sum+a(i)ENDDO!$OMPENDPARALLELDO
![Page 79: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/79.jpg)
Summa)on Example Correc)on
sum=0.0!$OMPPARALLELDOREDUCTION(+,sum)DOi=1,nsum=sum+a(i)ENDDO!$OMPENDPARALLELDO
![Page 80: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/80.jpg)
Distributed Memory Parallelism
![Page 81: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/81.jpg)
Message Passing
• CommunicaConondistributedmemorysystemsareasignificantaspectofperformanceandcorrectness• MessagesarerelaCvelyslow,withstartupCmes(latency)takingthousandsofcycles• Oncemessagepassinghasstarted,theaddiConalCmeperbyte(bandwidth)isrelaCvelysmall
![Page 82: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/82.jpg)
Performance on Gardner
• IntelXeonE5-2683Processor(Haswell)• Processorspeed:2,000cyclespermicrosecond(µsec)• 16FLOPs/cycle:32,000FLOPsperµsec• MPImessagelatency=~2.5µsecor80,000FLOPs• MPImessagebandwidth=~7,000bytes/µsec=4.57FLOPs/byte
![Page 83: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/83.jpg)
Reducing Latency
• ReducethenumberofmessagesbymappingcommunicaCngenCCesontothesameprocessor• CombinemessageshavingthesamesenderanddesCnaCon• IfprocessorAhasdataneededbyprocessorB,haveAsendittoB,ratherthanwaiCngforBtorequestit.ProcessorAshouldsendassoonasthedataisready,processorBshouldreaditaslateaspossibletoincreasetheprobabilitythatthedatahasarrived
![Page 84: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/84.jpg)
Other Issues With Message Passing
• NetworkCongesCon• Deadlock• Blocking:aprocessorcannotproceedunClthemessageisfinished• WithblockingcommunicaCon,youmayreachapointwherenoprocessorcanproceed• Non-blockingcommunicaCon:easiestwaytopreventdeadlock;processorscansendandproceedbeforereceiveisfinished
![Page 85: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/85.jpg)
Message Passing Interface (MPI)
• InternaConalStandard• MPI1.0releasedin1994• Currentversionis3.1(June2015)• MPI4.0standardintheworks• Availableonallparallelsystems• InterfacesinC/C++andFortran• BindingsforMATLAB,Python,R,andJava• Worksonbothdistributedmemoryandsharedmemoryhardware• HundredsoffuncCons,butyouwillneed~6-10togetstarted
![Page 86: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/86.jpg)
MPI Basics
• TwocommonMPIfuncCons:• MPI_SEND()-tosendamessage• MPI_RECV()-toreceiveamessage
• FuncConlikewriteandreadstatements• BothareblockingoperaCons• However,asystembufferisusedthatallowssmallmessagestobenon-blocking,butlargemessageswillbeblocking• ThesystembufferisbasedontheMPIimplementaCon,notthestandard• BlockingcommunicaConmayleadtodeadlocks
![Page 87: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/87.jpg)
Small Messages
![Page 88: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/88.jpg)
Large Messages
![Page 89: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/89.jpg)
Deadlocks
Source Avoidance
![Page 90: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/90.jpg)
Non-Blocking Communica)on
• Non-BlockingoperaCons• MPI_ISEND()• MPI_IRECV()• MPI_WAIT()
• TheusercancheckfordataatalaterstageintheprogramwithoutwaiCng• MPI_TEST()
• Non-blockingoperaConswillperformbeZerthanblockingoperaCons• PossibletooverlapcommunicaConwithcomputaCon
![Page 91: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/91.jpg)
MPI Program Template
#include"mpi.h"/*IniCalizeMPI*/MPI_Init(&argc,&argv);/*Alloweachprocessortodetermineitsrole*/intnum_processor,my_rank;MPI_Comm_size(MPI_COMM_WORLD,&num_processor);MPI_Comm_rank(MPI_COMM_WORLD,&my_rank);/*Sendandreceivesomestuffhere*//*Finalize*/MPI_Finalize()
![Page 92: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/92.jpg)
MPI Summa)on
![Page 93: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/93.jpg)
MPI Summa)on
/*Sendandreceivesomestuffhere*/if(my_rank==0){sum=0.0;for(source=1;source<num_procs;source++){MPI_RECV(&value,1,MPI_FLOAT,source,tag,MPI_COMM_WORLD,&status);sum+=value;}}else{MPI_SEND(&value,1,MPI_FLOAT,0,tag,MPI_COMM_WORLD);}
![Page 94: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/94.jpg)
A More Efficient Receive
• Inthepreviousexample,rank0receivedthemessagesinorderbasedonranks.• Thus,ifprocessor2wasdelayed,thenprocessor0wasalsodelayed• MakethefollowingmodificaCon:MPI_RECV(&value,1,MPI_FLOAT,MPI_ANY_SOURCE,tag,MPI_COMM_WORLD,&status);• Nowprocessor0canprocessmessagesassoonastheyarrive
![Page 95: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/95.jpg)
MPI Reduc)on
• LikeOpenMP,MPIalsohasreducConoperaConsthatcanbeusedfortaskssuchassummaCon• ReducConisatypeofcollecCvecommunicaCon
MPI_REDUCE(&value,&sum,1,MPI_FLOAT,MPI_SUM,0,MPI_COMM_WORLD);• ReducConoperaCons:• MPI_SUM,MPI_MIN,MPI_MAX,MPI_PROD• MPI_LAND,MPI_LOR
![Page 96: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/96.jpg)
Collec)ve Communica)on
• MostfrequentlyinvokedMPIrouCnesacersendandreceive• Improveclarity,runfasterthansendandreceive,andreducethelikelihoodofmakinganerror• Examples• Broadcast(MPI_Bcast)• ScaZer(MPI_ScaZer)• Gather(MPI_Gather)• AllGather(MPI_Allgather)
![Page 97: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/97.jpg)
Broadcast
![Page 98: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/98.jpg)
Sca`er
![Page 99: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/99.jpg)
Gather
![Page 100: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/100.jpg)
All Gather
![Page 101: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/101.jpg)
MPI Data Types
• PrimiCvedatatypes(correspondtothosefoundintheunderlyinglanguage)• Fortran
• MPI_CHARACTER• MPI_INTEGER• MPI_REAL,MPI_DOUBLE_PRECISION• MPI_LOGICAL
• C/C++(includemanymorevariantsthanFortran)• MPI_CHAR• MPI_INT,MPI_UNSIGNED,MPI_LONG• MPI_FLOAT,MPI_DOUBLE• MPI_C_BOOL
![Page 102: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/102.jpg)
MPI Data Types
• Deriveddatatypesarealsopossible• Vector–dataseparatedbyaconstantstride• ConCguous–dataseparatedbyastrideof1• Struct–mixedtypes• Indexed–arrayofindices
![Page 103: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/103.jpg)
MPI Vector
![Page 104: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/104.jpg)
MPI Con)guous
![Page 105: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/105.jpg)
MPI Struct
![Page 106: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/106.jpg)
MPI Indexed
![Page 107: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/107.jpg)
MPI Synchroniza)on
• ImplicitsynchronizaCon• BlockingcommunicaCon• CollecCvecommunicaCon
• ExplicitsynchronizaCon• MPI_Wait• MPI_Waitany• MPI_Barrier
• Remember,synchronizaConcanhinderperformance
![Page 108: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/108.jpg)
Addi)onal MPI Features
• Virtualtopologies• User-createdcommunicators• User-specifiedderiveddatatypes• ParallelI/O• One-sidedcommunicaCon(MPI_Put,MPI_Get)• Non-blockingcollecCvecommunicaCon• RemoteMemoryAccess
![Page 109: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/109.jpg)
Vectoriza)on
![Page 110: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/110.jpg)
Accelerators
• Principle:SplitanoperaConintoindependentpartsandexecutethepartsconcurrentlyinanaccelerator(SIMD)
DOI=1,1000A(I)=B(I)+C(I)ENDDO• AcceleratorslikeGraphicsProcessingUnitsarespecializedforSIMDoperaCons
![Page 111: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/111.jpg)
Popularity
• In2012,13%oftheperformanceshareoftheTop500Supercomputerswasprovidedthroughaccelerators• Today,theperformanceshareforacceleratorsis~38%• Trend:BuilddiversecompuCngpla�ormswithmulCprocessors,SMPnodes,andaccelerators• Caveat:DifficulttouseheterogenoushardwareeffecCvely
![Page 112: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/112.jpg)
Modern Accelerators
NVidiaVolta• 7.5TFLOPsdoubleprecision• 30TFLOPshalfprecision• 5120cores• 16GBRAM• 900GB/smemorybandwidth
IntelXeonPhiKnight'sLanding• 3TFLOPsdoubleprecision• 72cores(288threads)• 16GBRAM• 384GB/smemorybandwidth• Includesbothascalarandavectorunit
![Page 113: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/113.jpg)
NVidia Architecture
• SIMT:SingleInstrucCon,MulCpleThread(SimilartoSIMD)• Threadblocksareexecutedbyawarpof32cores• PerformanceisdependentonmemoryalignmentandeliminaCngsharedmemoryaccess
![Page 114: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/114.jpg)
Programming GPUs
• Low-LevelProgramming• Proprietary–NVidia'sCUDA• Portable–OpenCL
• High-LevelProgramming• OpenACC• OpenMP4.5• AllowsforasinglecodebasethatmaybecompiledonbothmulCcoreandGPUs
![Page 115: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/115.jpg)
CUDA Matrix Mul)plica)on intmain(void){int*a,*b,*c;//CPUcopiesint*dev_a,*dev_b,*dev_c;//GPUcopies/*AllocatememoryonGPU*/arraysize=n*sizeof(int);cudaMalloc((void**)\&dev_a,arraysize);cudaMalloc((void**)\&dev_b,arraysize);cudaMalloc((void**)\&dev_c,sizeof(in));usemallocfora,b,ciniCalizea,b/*MoveaandbtotheGPU*/cudaMemcpy(dev_a,a,arraysize,cudaMemcpyHostToDevice);cudaMemcpy(dev_b,b,arraysize,cudaMemcpyHosZoDevice);/*LaunchGPUkernel*/dot<<<n/threads_per_block,threads_per_block>>>(dev_a,dev_b,dev_c);/*CopyresultsbacktoCPUcudaMemcpy(c,dev_c,sizeof(int),cudaMemcpyDeviceToHost)return0;}
![Page 116: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/116.jpg)
CUDA Matrix Mul)plica)on
_global_voiddot(int*a,int*b,int*c){_shared_inttemp[threads_per_block];intindex=threadIdx.x+blockIdx.x+blockDim.x;/*Convertwarpindextoglobal*/temp[threadIdx.x]=a[index]*b[index];/*AvoidracecondiCononsumwithinblock*/_syncthreads();if(threadIdx.x==0){intsum=0;for(inti=0;i<threads_per_block;i++)sum+=temp[i];/*addtoglobalsum*/atomicAdd(c,sum)}}
![Page 117: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/117.jpg)
OpenACC and OpenMP
• SharethesamemodelofcomputaCon• Host-centric–hostdeviceoffloadscoderegionsanddatatoaccelerators• Device–hasindependentmemoryandmulCplethreads• Mappingclause–DefinestherelaConshipbetweenmemoryonthehostdeviceandmemoryontheaccelerator
![Page 118: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/118.jpg)
OpenACC Matrix Mul)plica)on
intmain(intargc,char**argv){intn=atoi(argv[1]);float*a=(float*)malloc(sizeof(float)*n*n);float*b=(float*)malloc(sizeof(float)*n*n);float*c=(float*)malloc(sizeof(float)*n*n);#pragmaaccdatacopyin(a[0:n*n],b[0:n*n]),copy(c[0:n*n]){inti,j,k;#pragmaacckernelsfor(i=0;i<n;i++){for(j=0;j<n;j++){a[i*n+j]=(float)i+j;b[i*n+j]=(float)i–j;c[i*n+j]=0.0;}}
![Page 119: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/119.jpg)
OpenACC Matrix Mul)plica)on
#pragmaacckernelsfor(i=0;i<n;i++){for(j=0;j<n;j++){for(k=0;k<n;k++){c[i*n+j]+=a[i*n+k]*b[k*n+j];}}}}}
![Page 120: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/120.jpg)
Speedup?
• Someproblemscanachieveaspeedupof10x-100xbyoffloadingsomeoftheworktoaGPU(e.g.linpack,matrixmulCplicaCon)• Inreality:• OnlyasmallfracConofapplicaConscanuClizeaGPUeffecCvely• TheaveragespeedupofthoseapplicaConsis~2x-4x• OpCmizingyourcodeusingmulCcoretechnologiesisprobablyabeZeruseofyourCme
![Page 121: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/121.jpg)
Problem Areas
• Branchingcancutyourperformanceby50%sinceinstrucConsneedtobeissuedforbothbranches• MemorybandwidthonGPUsispoor• Needtomakesureyouroperandsareusedmorethanonce• FortheNVidiaVolta,alloperandsneedtobeused67xtoachievepeakFLOPs
![Page 122: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/122.jpg)
Accelerator Trends
• IntelhascanceledKnight'sLandingandKnight'sHill;futureXeonPhistatusisunknown• GPUsarebecominglessrigidlySIMDandincludingbeZermemorybandwidth• Moreprogrammersmovingfromlow-levelprogramminglikeCUDAtohigh-leveldirecCves• PGICompilerwaspurchasedbyNVidiain2013;cutsupportforOpenCL• OpenACCandOpenMPsCllhavenotmerged• EfficiencyofGPUsissCllproblemaCcandtheprogrammingmodelisamovingtarget• NewMoCvaCon:DeepLearning
![Page 123: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/123.jpg)
Hybrid Compu)ng
![Page 124: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/124.jpg)
Current Trend
• Accelerator-basedmachinesaregrabbingthehighrankingsontheTop500,butclusterswithstandardcoresaresCllmostimportanteconomically• EconomicsdictatesdistributedmemorysystemofsharedmemoryboardswithmulCcorecommoditychips,perhapswithsomeformofacceleratordistributedsparingly
![Page 125: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/125.jpg)
MPI+X
![Page 126: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/126.jpg)
Conclusion
![Page 127: Introducon to Parallel Compu)ngcri.uchicago.edu/wp-content/uploads/2018/09/Intro... · 2. Read State(element), ProperCes(element), Neighbor_list(element) 3. For Cme=1 to end_of_simulaon](https://reader034.vdocument.in/reader034/viewer/2022042418/5f3436c153014f01453b6b29/html5/thumbnails/127.jpg)
Topics for Possible Future Workshops
• ScienCficCompuCng(C++orFortranorBoth)• ParallelCompuCngTheory(e.g.,Loadbalancing,measuringperformance)• EmbarrassinglyParallel• pthreads• OpenMP• MPI• OpenACC• CUDA• ParallelDebugging/Profiling