lecture 1: an introduction parallel computing csce … •learn fundamentals of concurrent and...
TRANSCRIPT
Lecture1:AnIntroductionParallelComputing
CSCE569,Spring2018
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]://cse.sc.edu/~yanyh
1
CourseInformation
• MeetingTime: 9:40AM– 10:55AMMondayWednesday• ClassRoom:2A15, SwearingenEngineerCenter,301MainSt,
Columbia,SC29208• Grade:60%forfourhomeworks +40%fortwoexams
• Instructor:YonghongYan– http://cse.sc.edu/~yanyh,[email protected]– Office:Room2211, Storey InnovationCenter(HorizonII),550
AssemblySt,Columbia,SC29201– Tel:803-777-7361– OfficeHours:11:00AM– 12:30AM(afterclass)orbyappointment
• PublicCoursewebsite: http://passlab.github.io/CSCE569• Homeworksubmission:https://dropbox.cse.sc.edu• Syllabusorwebsiteformoredetails
2
Objectives
• Learnfundamentalsofconcurrentandparallelcomputing– Describebenefitsandapplicationsofparallelcomputing.– ExplainarchitecturesofmulticoreCPU,GPUsandHPC
clusters• Includingthekeyconceptsinparallelcomputerarchitectures,e.g.sharedmemorysystem,distributedsystem,NUMAandcachecoherence,interconnection
– Understandprinciplesforparallelandconcurrentprogramdesign,e.g.decompositionofworks,taskanddataparallelism,processormapping,mutualexclusion,locks.
• Developskillswritingandanalyzingparallelprograms– WriteparallelprogramusingOpenMP,CUDA,andMPI
programmingmodels.– Performanalysisofparallelprogramproblem.
3
• LotsofmaterialsonInternet.– Onthewebsite,thereisa“Resources”sectionthatprovidesweb
pagelinks,documents,andothermaterialsforthiscourse
Textbooks
4
• Required:IntroductiontoParallelComputing(2ndEdition), PDF, Amazon,covertheory,MPIandOpenMPintroduction,byAnanth Grama,Anshul Gupta,GeorgeKarypis,andVipin Kumar,Addison-Wesley,2003
• Recommended:JohnCheng,MaxGrossman,andTyMcKercher, ProfessionalCUDACProgramming,1stEdition2014, PDF, Amazon.
• ReferencebookforOpenMP:BarbaraChapman,GabrieleJost,andRuudvanderPas, UsingOpenMP:PortableSharedMemoryParallelProgramming,2007, PDF, Amazon.
• ReferencebookforMPI:Choosefrom RecommendedBooksforMPI
Homeworks andExams
• Fourhomeworks:practiceprogrammingskills– Requirebothgoodandcorrectprogramming
• Writeorganizedprogramthatiseasytoread– Reportanddiscussyourfindingsinreport
• Writinggooddocument– 60%Total(10%+10%+20%+20%)
• Exams:Testfundamentals– Close/Openbook(?)– 40%Total
• Midterm:15%,March7thWednesdayduringclass– Theweekbeforespringbreak.
• FinalExam:25%,May2ndWednesday,9:00AM- 11:30AM5
MachineforDevelopmentforOpenMP andMPI
• LinuxmachinesinSwearingen1D39and3D22– AllCSCEstudentsbydefaulthaveaccesstothesemachine
usingtheirstandardlogincredentials• Letmeknowifyou,CSCEornot,cannotaccess
– RemoteaccessisalsoavailableviaSSHoverport222. Namingschemaisasfollows:• l-1d39-01.cse.sc.eduthroughl-1d39-26.cse.sc.edu• l-3d22-01.cse.sc.eduthroughl-3d22-20.cse.sc.edu
• Restrictedto2GBofdataintheirhomefolder(~/).– Formorespace,createadirectoryin/scratchonthelogin
machine,howeverthatdataisnotsharedanditwillonlybeavailableonthatspecificmachine.
6
PuttySSHConnectiononWindows
7
l-1d39-08.cse.sc.edu 222
SSHConnectionfromLinux/MacOSXTerminal
8
-XforenablingX-windowsforwardingsoyoucanusethegraphicsdisplayonyourcomputer.ForMacOSX,youneedhaveXserversoftwareinstalled,e.g.Xquartz(https://www.xquartz.org/)istheoneIuse.
TryinTheLabandFromRemote
• Bringyourlaptop
9
Topics
• Introduction• Programmingonsharedmemorysystem(Chapter7)
– OpenMP– PThread,mutualexclusion,locks,synchronizations– Cilk/Cilkplus(?)
• Principlesofparallelalgorithmdesign(Chapter3)• Analysisofparallelprogramexecutions(Chapter5)
– PerformanceMetricsforParallelSystems• ExecutionTime,Overhead,Speedup,Efficiency,Cost
– ScalabilityofParallelSystems– Useofperformancetools
10
Topics
• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel(?)
• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency
• Manycore GPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroductiontooffloadingmodelinOpenMP(?)
• Parallelalgorithms(Chapter8,9&10)– Denselinearalgebra,stencilandimageprocessing
11
Prerequisites
• Goodreasoningandanalyticalskills• FamiliaritywithandSkillsofC/C++programming
– macro,pointer,array,struct,union,functionpointer,etc.• FamiliaritywithLinuxenvironment
– SSH,Linuxcommands,vim/Emacs editor• Basicknowledgeofcomputerarchitectureanddatastructures– Memoryhierarchy,cache,virtualaddress– Arrayandlink-list
• Talkwithmeifyouhaveconcern• Turninthesurvey
12
Introduction:WhatisandwhyParallelComputing
13
AnExample:Grading
14
15questions300exams
From An Introduction to Parallel Programming, By Peter Pacheco, Morgan Kaufmann Publishers Inc, Copyright © 2010, Elsevier Inc. All rights Reserved
Three Teaching Assistants
• Tograde300copiesofexams,eachhas15questions15
TA#1TA#2 TA#3
DivisionofWork– DataParallelism
• Eachdoesthesametypeofwork(task),butworkingondifferentsheet(data)
16
TA#1
TA#2
TA#3
100exams
100exams
100exams
DivisionofWork– TaskParallelism
• Eachdoesdifferenttypeofwork(task),butworkingonsamesheets(data)
17
TA#1
TA#2
TA#3
Questions1- 5
Questions6- 10
Questions11- 15
Summary
• Data:300copiesofexam• Task:gradetotal300*15questions• Dataparallelism
– Distributed300copiestothreeTAs– Theyworkindependently
• TaskParallelism– Distributed300copiestothreeTAs– Eachgrades5questionsof100copies– Exchangecopies– Grade5questionsagain– Exchangecopies– Grade5questions
• ThethreeTAscandoinparallel,wecanachieve3timespeeduptheoretically
18
Whichapproachcouldbefaster!
Challenges
• ArethethreeTAsgradinginthesameperformance?– OneCPUmaybeslowerthantheother– Theymaynotworkongradingthesametime
• HowtheTAscommunicate?– Aretheysitonthesametable?Oreachtakecopiesandgrade
fromhome?Howtheyshareintermediateresults(taskparallelism)
• Wherethesolutionsarestoredsotheycanrefertowhengrading– Rememberanswersto5questionsvs to15questions
• CacheandMemoryissues
19
WhatisParallelComputing?
• A formofcomputation*:– Largeproblemsdividedintosmallerones– Smalleronesarecarriedoutandsolvedsimultaneously
• UsesmorethanoneCPUsorcoresconcurrentlyforoneprogram– Notconventionaltime-sharing:multipleprogramsswitch
betweeneachotherononeCPU– OrmultipleprogramseachonaCPUandnotinteracting
• Serialprocessing– Someprograms,orpartofaprogramareinherentlyserial– Mostofourprogramsanddesktopapplications
*http://en.wikipedia.org/wiki/Parallel_computing 20
WhyParallelComputing?
• Savetime(executiontime)andmoney!– Parallelprogramcanrunfasterifrunningconcurrentlyinsteadof
sequentially.
• Solvelargerandmorecomplexproblems!– Utilizemorecomputationalresources
From“21stCenturyGrandChallenges|TheWhiteHouse”,http://www.whitehouse.gov/administration/eop/ostp/grand-challengesGrandchallenges:http://en.wikipedia.org/wiki/Grand_Challenges
21
Picturefrom:IntrotoParallelComputing:https://computing.llnl.gov/tutorials/parallel_comp
HighPerformanceComputing(HPC)andParallelComputing
• HPCiswhatreallyneeded*– Parallelcomputingissofartheonlywaytogetthere!!
• Parallelcomputingmakessense!
• ApplicationsthatrequireHPC– Manyproblemdomainsarenaturallyparallelizable– Datacannotfitinmemoryofonemachine
• Computersystems– Physicslimitation:hastobuilditparallel– Parallelsystemsarewidelyaccessible
• Smartphonehas2to4cores+GPUnow
22
*WhatisHPC:http://insidehpc.com/hpc-basic-training/what-is-hpc/Supercomputer:http://en.wikipedia.org/wiki/SupercomputerTOP500(500mostpowerfulcomputersystemsintheworld):http://en.wikipedia.org/wiki/TOP500,http://top500.org/HPCmatter:http://sc14.supercomputing.org/media/social-media
Wewilldiscusseachofthetwoaspecttoday!
Simulation:TheThird PillarofScience
• Traditionalscientificandengineeringparadigm:1) Dotheory orpaperdesign.2) Performexperiments orbuildsystem.
• Limitationsofexperiments:– Toodifficult-- buildlargewindtunnels.– Tooexpensive-- buildathrow-awaypassengerjet.– Tooslow-- waitforclimateorgalacticevolution.– Toodangerous-- weapons,drugdesign,climateexperimentation.
• Computationalscienceparadigm:3) Usehighperformancecomputersystemstosimulate thephenomenon
• Baseonknownphysicallawsandefficientnumericalmethods.
23
FromslidesofKathyYelic’s 2007courseatBerkeley:http://www.cs.berkeley.edu/~yelick/cs267_sp07/
Applications:ScienceandEngineering
• Modelmanydifficultproblemsbyparallelcomputing– Atmosphere,Earth,Environment– Physics- applied,nuclear,particle,condensedmatter,high
pressure,fusion,photonics– Bioscience,Biotechnology,Genetics– Chemistry,MolecularSciences– Geology,Seismology– MechanicalEngineering- fromprostheticstospacecraft– ElectricalEngineering,CircuitDesign,Microelectronics– ComputerScience,Mathematics– Defense,Weapons
24
Applications:IndustrialandCommercial
• Processinglargeamountsofdatainsophisticatedways– Databases,datamining– Oilexploration– Medicalimaginganddiagnosis– Pharmaceuticaldesign– Financialandeconomicmodeling– Managementofnationalandmulti-nationalcorporations– Advancedgraphicsandvirtualreality,particularlyinthe
entertainmentindustry– Networkedvideoandmulti-mediatechnologies– Collaborativeworkenvironments– Websearchengines,webbasedbusinessservices
25
EconomicImpactofHPC
• Airlines:– System-widelogisticsoptimizationsystemsonparallelsystems.– Savings:approx.$100millionperairlineperyear.
• Automotivedesign:– Majorautomotivecompaniesuselargesystems(500+CPUs)for:
• CAD-CAM,crashtesting,structuralintegrityandaerodynamics.• Onecompanyhas500+CPUparallelsystem.
– Savings:approx.$1billionpercompanyperyear.• Semiconductorindustry:
– Semiconductorfirmsuselargesystems(500+CPUs)for• deviceelectronicssimulationandlogicvalidation
– Savings:approx.$1billionpercompanyperyear.• Securitiesindustry:
– Savings:approx.$15billionperyearforU.S.homemortgages.
26FromslidesofKathyYelic’s 2007courseatBerkeley:http://www.cs.berkeley.edu/~yelick/cs267_sp07/
InherentParallelismofApplications
• Example:weatherpredictionandglobalclimatemodeling
27
GlobalClimateModelingProblem
• Problemistocompute:– f(latitude,longitude,elevation,time)à
temperature,pressure,humidity,windvelocity• Approach:
– Discretize thedomain,e.g.,ameasurementpointevery10km– Deviseanalgorithmtopredictweatherattimet+dt givent
• Uses:– Predictmajorevents,e.g.,ElNino– Airqualityforecasting
28
TheRiseofMulticoreProcessors
29
RecentMulticoreProcessors
30
RecentManycore GPUprocessors
31
��
An�Overview�of�the�GK110�Kepler�Architecture�Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��
A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64�bit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��
Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�
� The�new�SMX�processor�architecture�� An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�
each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�
� Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�
�
Kepler�GK110�Full�chip�block�diagram�
��
Streaming�Multiprocessor�(SMX)�Architecture�
Kepler�GK110)s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we)ve�built,�but�also�the�most�programmable�and�power�efficient.��
�
SMX:�192�single�precision�CUDA�cores,�64�double�precision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�
��
Kepler�Memory�Subsystem�/�L1,�L2,�ECC�
Kepler&s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compiler�directed�use�of�an�additional�new�cache�for�read�only�data,�as�described�below.�
�
�
64�KB�Configurable�Shared�Memory�and�L1�Cache�
In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�on�chip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�
48KB�Read�Only�Data�Cache�
In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�read�only�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��
• ~3kcores
UnitsofMeasureinHPC
• Flop:floatingpointoperation(*,/,+,-,etc)• Flop/s:floatingpointoperationspersecond,writtenalsoasFLOPS• Bytes:sizeofdata
– A doubleprecisionfloatingpointnumberis8bytes• Typicalsizesaremillions,billions,trillions…
– Mega Mflop/s=106 flop/sec Mzbyte =220 =1048576=~106 bytes– Giga Gflop/s=109 flop/sec Gbyte =230 =~109 bytes– Tera Tflop/s=1012 flop/secTbyte =240 =~1012 bytes– Peta Pflop/s=1015 flop/sec Pbyte =250 =~1015 bytes– Exa Eflop/s=1018 flop/secEbyte =260 =~1018 bytes– Zetta Zflop/s=1021 flop/secZbyte =270 =~1021 bytes
• www.top500.orgfortheunitsofthefastestmachinesmeasuredusingHighPerformanceLINPACK(HPL)Benchmark– Thefastest:SunwayTaihuLight,~93petaflop/s– Thethird(fastestinUS):DoEORNLTitan,17.59petaflop/s
32
HowtoMeasureandCalculatePerformance(FLOPS)
33
https://passlab.github.io/CSCE569/resources/sum.c
• Calculate#FLOPs(2*Nor3*N)– Checktheloopcount(N)andFLOPsper
loopiteration(2or3).
• Measuretimetocomputeusingtimer– elapsedandelapsed_2areinsecond
• FLOPS=#FLOPs/Time– MFLOPSintheexample
HighPerformanceLINPACK(HPL)BenchmarkPerformance(Rmax)inTop500
• Measured usingtheHighPerformanceLINPACK(HPC)Benchmarkthatsolvesadensesystemoflinearequationsà Rankingthemachines– Ax=b– https://www.top500.org/project/linpack/– https://en.wikipedia.org/wiki/LINPACK_benchmarks
34
Top500(www.top500.org),Nov2017
35
HPCPeakPerformance(Rpeak)Calculation
• NodeperformanceinGflop/s=(CPUspeedinGHz)x(numberofCPUcores)x(CPUinstructionpercycle)x(numberofCPUspernode).– CPUinstructionspercycle(IPC)=#Flopspercycle
• BecausepipelinedCPUcandooneinstructionpercycle• 4or8formostCPU(IntelorAMD)
– http://www.calcverter.com/calculation/CPU-peak-theoretical-performance.php
• HPCPeak(Rpeak)=#nodes*NodePerformanceinGFlops
36
CPUPeakPerformanceExample• IntelX5600seriesCPUsandAMD6100/6200/6300seriesCPUshave4
instructionspercycleIntelE5-2600seriesCPUshave8instructionspercycle
• Example1:Dual-CPUserverbasedonIntelX5675(3.06GHz6-cores)CPUs:– 3.06x6x4x2=144.88GFLOPS
• Example2:Dual-CPUserverbasedonIntelE5-2670(2.6GHz8-cores)CPUs:– 2.6x8x8x2=332.8GFLOPS– With8nodes:332.8GFLOPSx8=2,442.4GFLOPS=2.44TFLOPS
• Example3:Dual-CPUserverbasedonAMD6176(2.3GHz12-cores)CPUs:– 2.3x12x4x2=220.8GFLOPS
• Example4:Dual-CPUserverbasedonAMD6274(2.2GHz16-cores)CPUs:– 2.2x16x4x2=281.6GFLOPS
37https://saiclearning.wordpress.com/2014/04/08/how-to-calculate-peak-theoretical-performance-of-a-cpu-based-hpc-system/
Performance(HPL)DevelopmentOverYearsofTop500Machines
38
4KindsofRankingofHPC/Supercomputers
1. Top500:accordingtotheMeasured HighPerformanceLINPACK(HPL)Benchmarkperformance
– NotPeakperformance,Nototherapplications
2. RankingaccordingtoHPCG benchmarkperformance3. Graph500Rankingaccordingtographprocessingcapability
– ShortestPathandBreadthFirstSearch– https://graph500.org
4. Green500RankingaccordingtoPowerefficiency(GFLOPS/Watts)
– https://www.top500.org/green500/– Generatesublist inthefollowingslidesfrom
https://www.top500.org/statistics/sublist/39
HPCGRanking
• HPCG:HighPerformanceConjugateGradients(HPCG)Benchmark(http://www.hpcg-benchmark.org/)
40
Graph500(https://graph500.org)
• Rankingaccordingtothecapabilityofprocessinglarge-scalegraph(ShortestPathandBreadthFirstSearch)
41
Green500:PowerEfficiency(GFLOPS/Watts)
42
• PowerEfficiency=HPLPerformance/Power– E.g.TaihuLight #1ofTop500:=93,014.6/15,371=6.051
Gflops/watts)• https://www.top500.org/green500/
Green500:PowerEfficiency(GFLOPS/Watts)
43
• https://www.top500.org/green500/
PerformanceEfficiency
• HPCPerformanceEfficiency=ActualMeasuredPerformanceGFLOPS/TheoreticalPeakPerformanceGFLOPS– E.g.#1inTop500
• 93,014.6/125,435.9=74.2%
44
https://www.penguincomputing.com/company/blog/calculate-hpc-efficiency/
HPLPerformanceEfficiencyofTop500(2015list)
• Mostly40%- 90%(ok)
45
HPCGEfficiencyofTop70ofTop500(2015list)
• Mostlybelow5%andonlysomearound10%
46
RankingSummary• HighPerformanceLINPACK(HPL)forTop500
– Denselinearalgebra(Ax=b),highlycomputationintensive– RankTop500forabsolutecomputationcapability
• HPCG:HighPerformanceConjugateGradients(HPCG)Benchmark,HPLalternatives– SparseMatrix-vectormultiplication,balancedmemoryandcomputation
intensity– Rankingmachineswithregardstothecombinationofcomputationand
memoryperformance
• Graph500:ShortestPathandBreadthFirstSearch– Rankingaccordingtothecapabilityofprocessinglarge-scalegraph– Stressingnetworkandmemorysystems
• Green500ofTop500(HPLGFlops/watts)– Powerefficiency
47
Whyisparallelcomputing,namelymulticore,manycore andclusters,theonlyway,sofar,forhighperformance?
48
SemiconductorTrend:“Moore’sLaw”
GordonMoore,FounderofIntel• 1965:sincetheintegratedcircuitwasinvented,thenumberof
transistors/inch2 inthesecircuitsroughlydoubledeveryyear;thistrendwouldcontinuefortheforeseeablefuture
• 1975:revised- circuitcomplexitydoubleseverytwoyears
49Imagecredit:Intel
MicroprocessorTransistorCounts1971-2011&Moore'sLaw
50
https://en.wikipedia.org/wiki/Transistor_count
Moore’sLawTrends
• Moretransistors=↑opportunitiesforexploitingparallelismintheinstructionlevel(ILP)– Pipeline,superscalar,VLIW(VeryLongInstructionWord),SIMD(Single
InstructionMultipleData)orvector,speculation,branchprediction• Generalpathofscaling
– Widerinstructionissue,longerpiepline– Morespeculation– Moreandlargerregistersandcache
• Increasingcircuitdensity~=increasingfrequency~=increasingperformance
• Transparenttousers– Aneasyjobofgettingbetterperformance:buyingfasterprocessors(higher
frequency)
• Wehaveenjoyedthisfreelunchforseveraldecades,however…
51
ProblemsofTraditionalILPScaling
• Fundamentalcircuitlimitations1– delays⇑ asissuequeues⇑ andmulti-portregisterfiles⇑– increasingdelayslimitperformancereturnsfromwiderissue
• Limitedamountofinstruction-levelparallelism1
– inefficientforcodeswithdifficult-to-predictbranches
• Powerandheatstallclockfrequencies
52
[1]Thecaseforasingle-chipmultiprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.
ILPImpacts
53
Simulationsof8-issueSuperscalar
54
Power/HeatDensityLimitsFrequency
55
• Somefundamentalphysicallimitsarebeingreached
WeWillHaveThis…
56
01/17/2007 CS267-Lecture1 57
RevolutionHappedAlready• Chipdensityis
continuingincrease~2xevery2years– Clockspeedisnot– Numberofprocessor
coresmaydoubleinstead
• Thereislittleornohiddenparallelism(ILP)tobefound
• Parallelismmustbeexposedtoandmanagedbysoftware– Nofreelunch
Source:Intel,Microsoft(Sutter)andStanford(Olukotun,Hammond)
IBMBG/L
ASCIWhitePacific
EDSAC1UNIVAC1
IBM7090
CDC6600
IBM360/195CDC7600
Cray1
CrayX-MPCray2
TMCCM-2
TMCCM-5 CrayT3D
ASCIRed
1950 1960 1970 1980 1990 2000 2010
1KFlop/s
1MFlop/s
1GFlop/s
1TFlop/s
1PFlop/s
Scalar
Super Scalar
Parallel
Vector
1941 1 (Floating Point operations / second, Flop/s)1945 100 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2005 131,000,000,000,000 (131 Tflop/s)
Super Scalar/Vector/Parallel
(103)
(106)
(109)
(1012)
(1015)
2XTransistors/ChipEvery1.5Years
TheTrends
Nowit’sUpToProgrammers
• Addingmoreprocessorsdoesn’thelpmuchifprogrammersaren’tawareofthem…– …ordon’tknowhowtousethem.
• Serialprogramsdon’tbenefitfromthisapproach(inmostcases).
59
ConcludingRemarks
• Thelawsofphysicshavebroughtustothedoorstepofmulticoretechnology– Theworstorthebesttimetomajorincomputerscience
• IEEERebootingComputing(http://rebootingcomputing.ieee.org/)
• Serialprogramstypicallydon’tbenefitfrommultiplecores.• Automaticparallelizationfromserialprogramisn’tthemostefficientapproachtousemulticorecomputers.– Provednotaviableapproach
• Learningtowriteparallelprogramsinvolves– learninghowtocoordinatethecores.
• Parallelprogramsareusuallyverycomplexandtherefore,requiresoundprogramtechniquesanddevelopment.
60
References
• IntroductiontoParallelComputing,Blaise Barney,LawrenceLivermoreNationalLaboratory– https://computing.llnl.gov/tutorials/parallel_comp
• SomeslidesareadaptedfromnotesofRiceUniversityJohnMellor-Crummey’s classandBerkely KathyYelic’s class.
• Examplesarefromchapter01slidesofbook“AnIntroductiontoParallelProgramming”byPeterPacheco– Notethecopyrightnotice
• LatestHPCnews– http://www.hpcwire.com
• World-widepremierconferenceforsupercomputing– http://www.supercomputing.org/,theweekbefore
thanksgivingweek 61
62
• “Ithinkthereisaworldmarketformaybefivecomputers.”– ThomasWatson,chairmanofIBM,1943.
• “Thereisnoreasonforanyindividualtohaveacomputerintheirhome”
– KenOlson,presidentandfounderofDigitalEquipmentCorporation,1977.
• “640K[ofmemory]oughttobeenoughforanybody.”– BillGates,chairmanofMicrosoft,1981.
• “Onseveralrecentoccasions,Ihavebeenaskedwhetherparallelcomputingwillsoonberelegatedtothetrashheapreservedforpromisingtechnologiesthatneverquitemakeit.”
– KenKennedy,CRPCDirectory,1994
http://highscalability.com/blog/2014/12/31/linus-the-whole-parallel-computing-is-the-future-is-a-bunch.html
VisionandWisdombyExperts
A simple example• Computenvaluesandaddthemtogether.• Serialsolution:
63
Example(cont.)
• Wehavepcores,pmuchsmallerthann.• Eachcoreperformsapartialsumofapproximatelyn/pvalues.
Each core uses it’s own private variablesand executes this block of codeindependently of the other cores.
64
Example(cont.)
• Aftereachcorecompletesexecutionofthecode,isaprivatevariablemy_sum containsthesumofthevaluescomputedbyitscallstoCompute_next_value.
• Ex.,8cores,n=24,thenthecallstoCompute_next_valuereturn:
1,4,3,9,2,8,5,1,1,5,2,7,2,5,0,4,1,8,6,5,1,2,3,9
65
Example(cont.)
• Onceallthecoresaredonecomputingtheirprivatemy_sum,theyformaglobalsumbysendingresultstoadesignated“master” corewhichaddsthefinalresult.
66
Example(cont.)
67
SPMD:Allrunthesameprogram,butperformdifferentlydependingonwhotheyare.
Example(cont.)
Core 0 1 2 3 4 5 6 7my_sum 8 19 7 15 7 13 12 14
Globalsum8+19+7+15+7+13+12+14=95
Core 0 1 2 3 4 5 6 7my_sum 95 19 7 15 7 13 12 14
68
Butwait!There’samuchbetterwaytocomputetheglobalsum.
69
Betterparallelalgorithm
• Don’tmakethemastercoredoallthework.• Shareitamongtheothercores.• Pairthecoressothatcore0addsitsresultwithcore1’sresult.
• Core2addsitsresultwithcore3’sresult,etc.• Workwithoddandevennumberedpairsofcores.
70
Betterparallelalgorithm(cont.)
• Repeattheprocessnowwithonlytheevenlyrankedcores.• Core0addsresultfromcore2.• Core4addstheresultfromcore6,etc.
• Nowcoresdivisibleby4repeattheprocess,andsoforth,untilcore0hasthefinalresult.
71
Multiplecoresformingaglobalsum
72
Analysis
• Inthefirstexample,themastercoreperforms7receivesand7additions.
• Inthesecondexample,themastercoreperforms3receivesand3additions.
• Theimprovementismorethanafactorof2!
73
Analysis(cont.)
• Thedifferenceismoredramaticwithalargernumberofcores.
• Ifwehave1000cores:– Thefirstexamplewouldrequirethemastertoperform999
receivesand999additions.– Thesecondexamplewouldonlyrequire10receivesand10
additions.
• That’sanimprovementofalmostafactorof100!
74