lecture 1: an introduction parallel computing csce … •learn fundamentals of concurrent and...

Lecture1:AnIntroductionParallelComputing

CSCE569,Spring2018

DepartmentofComputerScienceandEngineeringYonghongYan

[email protected]://cse.sc.edu/~yanyh

1

CourseInformation

• MeetingTime: 9:40AM– 10:55AMMondayWednesday• ClassRoom:2A15, SwearingenEngineerCenter,301MainSt,

Columbia,SC29208• Grade:60%forfourhomeworks +40%fortwoexams

• Instructor:YonghongYan– http://cse.sc.edu/~yanyh,[email protected]– Office:Room2211, Storey InnovationCenter(HorizonII),550

AssemblySt,Columbia,SC29201– Tel:803-777-7361– OfficeHours:11:00AM– 12:30AM(afterclass)orbyappointment

• PublicCoursewebsite: http://passlab.github.io/CSCE569• Homeworksubmission:https://dropbox.cse.sc.edu• Syllabusorwebsiteformoredetails

2

Objectives

• Learnfundamentalsofconcurrentandparallelcomputing– Describebenefitsandapplicationsofparallelcomputing.– ExplainarchitecturesofmulticoreCPU,GPUsandHPC

clusters• Includingthekeyconceptsinparallelcomputerarchitectures,e.g.sharedmemorysystem,distributedsystem,NUMAandcachecoherence,interconnection

– Understandprinciplesforparallelandconcurrentprogramdesign,e.g.decompositionofworks,taskanddataparallelism,processormapping,mutualexclusion,locks.

• Developskillswritingandanalyzingparallelprograms– WriteparallelprogramusingOpenMP,CUDA,andMPI

programmingmodels.– Performanalysisofparallelprogramproblem.

3

• LotsofmaterialsonInternet.– Onthewebsite,thereisa“Resources”sectionthatprovidesweb

pagelinks,documents,andothermaterialsforthiscourse

Textbooks

4

• Required:IntroductiontoParallelComputing(2ndEdition), PDF, Amazon,covertheory,MPIandOpenMPintroduction,byAnanth Grama,Anshul Gupta,GeorgeKarypis,andVipin Kumar,Addison-Wesley,2003

• Recommended:JohnCheng,MaxGrossman,andTyMcKercher, ProfessionalCUDACProgramming,1stEdition2014, PDF, Amazon.

• ReferencebookforOpenMP:BarbaraChapman,GabrieleJost,andRuudvanderPas, UsingOpenMP:PortableSharedMemoryParallelProgramming,2007, PDF, Amazon.

• ReferencebookforMPI:Choosefrom RecommendedBooksforMPI

Homeworks andExams

• Fourhomeworks:practiceprogrammingskills– Requirebothgoodandcorrectprogramming

• Writeorganizedprogramthatiseasytoread– Reportanddiscussyourfindingsinreport

• Writinggooddocument– 60%Total(10%+10%+20%+20%)

• Exams:Testfundamentals– Close/Openbook(?)– 40%Total

• Midterm:15%,March7thWednesdayduringclass– Theweekbeforespringbreak.

• FinalExam:25%,May2ndWednesday,9:00AM- 11:30AM5

MachineforDevelopmentforOpenMP andMPI

• LinuxmachinesinSwearingen1D39and3D22– AllCSCEstudentsbydefaulthaveaccesstothesemachine

usingtheirstandardlogincredentials• Letmeknowifyou,CSCEornot,cannotaccess

– RemoteaccessisalsoavailableviaSSHoverport222. Namingschemaisasfollows:• l-1d39-01.cse.sc.eduthroughl-1d39-26.cse.sc.edu• l-3d22-01.cse.sc.eduthroughl-3d22-20.cse.sc.edu

• Restrictedto2GBofdataintheirhomefolder(~/).– Formorespace,createadirectoryin/scratchonthelogin

machine,howeverthatdataisnotsharedanditwillonlybeavailableonthatspecificmachine.

6

PuttySSHConnectiononWindows

7

l-1d39-08.cse.sc.edu 222

SSHConnectionfromLinux/MacOSXTerminal

8

-XforenablingX-windowsforwardingsoyoucanusethegraphicsdisplayonyourcomputer.ForMacOSX,youneedhaveXserversoftwareinstalled,e.g.Xquartz(https://www.xquartz.org/)istheoneIuse.

TryinTheLabandFromRemote

• Bringyourlaptop

9

Topics

• Introduction• Programmingonsharedmemorysystem(Chapter7)

– OpenMP– PThread,mutualexclusion,locks,synchronizations– Cilk/Cilkplus(?)

• Principlesofparallelalgorithmdesign(Chapter3)• Analysisofparallelprogramexecutions(Chapter5)

– PerformanceMetricsforParallelSystems• ExecutionTime,Overhead,Speedup,Efficiency,Cost

– ScalabilityofParallelSystems– Useofperformancetools

10

Topics

• Programmingonlargescalesystems(Chapter6)– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel(?)

• Parallelarchitecturesandhardware– Parallelcomputerarchitectures– Memoryhierarchyandcachecoherency

• Manycore GPUarchitecturesandprogramming– GPUsarchitectures– CUDAprogramming– IntroductiontooffloadingmodelinOpenMP(?)

• Parallelalgorithms(Chapter8,9&10)– Denselinearalgebra,stencilandimageprocessing

11

Prerequisites

• Goodreasoningandanalyticalskills• FamiliaritywithandSkillsofC/C++programming

– macro,pointer,array,struct,union,functionpointer,etc.• FamiliaritywithLinuxenvironment

– SSH,Linuxcommands,vim/Emacs editor• Basicknowledgeofcomputerarchitectureanddatastructures– Memoryhierarchy,cache,virtualaddress– Arrayandlink-list

• Talkwithmeifyouhaveconcern• Turninthesurvey

12

Introduction:WhatisandwhyParallelComputing

13

AnExample:Grading

14

15questions300exams

From An Introduction to Parallel Programming, By Peter Pacheco, Morgan Kaufmann Publishers Inc, Copyright © 2010, Elsevier Inc. All rights Reserved

Three Teaching Assistants

• Tograde300copiesofexams,eachhas15questions15

TA#1TA#2 TA#3

DivisionofWork– DataParallelism

• Eachdoesthesametypeofwork(task),butworkingondifferentsheet(data)

16

TA#1

TA#2

TA#3

100exams

100exams

100exams

DivisionofWork– TaskParallelism

• Eachdoesdifferenttypeofwork(task),butworkingonsamesheets(data)

17

TA#1

TA#2

TA#3

Questions1- 5

Questions6- 10

Questions11- 15

Summary

• Data:300copiesofexam• Task:gradetotal300*15questions• Dataparallelism

– Distributed300copiestothreeTAs– Theyworkindependently

• TaskParallelism– Distributed300copiestothreeTAs– Eachgrades5questionsof100copies– Exchangecopies– Grade5questionsagain– Exchangecopies– Grade5questions

• ThethreeTAscandoinparallel,wecanachieve3timespeeduptheoretically

18

Whichapproachcouldbefaster!

Challenges

• ArethethreeTAsgradinginthesameperformance?– OneCPUmaybeslowerthantheother– Theymaynotworkongradingthesametime

• HowtheTAscommunicate?– Aretheysitonthesametable?Oreachtakecopiesandgrade

fromhome?Howtheyshareintermediateresults(taskparallelism)

• Wherethesolutionsarestoredsotheycanrefertowhengrading– Rememberanswersto5questionsvs to15questions

• CacheandMemoryissues

19

WhatisParallelComputing?

• A formofcomputation*:– Largeproblemsdividedintosmallerones– Smalleronesarecarriedoutandsolvedsimultaneously

• UsesmorethanoneCPUsorcoresconcurrentlyforoneprogram– Notconventionaltime-sharing:multipleprogramsswitch

betweeneachotherononeCPU– OrmultipleprogramseachonaCPUandnotinteracting

• Serialprocessing– Someprograms,orpartofaprogramareinherentlyserial– Mostofourprogramsanddesktopapplications

*http://en.wikipedia.org/wiki/Parallel_computing 20

WhyParallelComputing?

• Savetime(executiontime)andmoney!– Parallelprogramcanrunfasterifrunningconcurrentlyinsteadof

sequentially.

• Solvelargerandmorecomplexproblems!– Utilizemorecomputationalresources

From“21stCenturyGrandChallenges|TheWhiteHouse”,http://www.whitehouse.gov/administration/eop/ostp/grand-challengesGrandchallenges:http://en.wikipedia.org/wiki/Grand_Challenges

21

Picturefrom:IntrotoParallelComputing:https://computing.llnl.gov/tutorials/parallel_comp

HighPerformanceComputing(HPC)andParallelComputing

• HPCiswhatreallyneeded*– Parallelcomputingissofartheonlywaytogetthere!!

• Parallelcomputingmakessense!

• ApplicationsthatrequireHPC– Manyproblemdomainsarenaturallyparallelizable– Datacannotfitinmemoryofonemachine

• Computersystems– Physicslimitation:hastobuilditparallel– Parallelsystemsarewidelyaccessible

• Smartphonehas2to4cores+GPUnow

22

*WhatisHPC:http://insidehpc.com/hpc-basic-training/what-is-hpc/Supercomputer:http://en.wikipedia.org/wiki/SupercomputerTOP500(500mostpowerfulcomputersystemsintheworld):http://en.wikipedia.org/wiki/TOP500,http://top500.org/HPCmatter:http://sc14.supercomputing.org/media/social-media

Wewilldiscusseachofthetwoaspecttoday!

Simulation:TheThird PillarofScience

• Traditionalscientificandengineeringparadigm:1) Dotheory orpaperdesign.2) Performexperiments orbuildsystem.

• Limitationsofexperiments:– Toodifficult-- buildlargewindtunnels.– Tooexpensive-- buildathrow-awaypassengerjet.– Tooslow-- waitforclimateorgalacticevolution.– Toodangerous-- weapons,drugdesign,climateexperimentation.

• Computationalscienceparadigm:3) Usehighperformancecomputersystemstosimulate thephenomenon

• Baseonknownphysicallawsandefficientnumericalmethods.

23

FromslidesofKathyYelic’s 2007courseatBerkeley:http://www.cs.berkeley.edu/~yelick/cs267_sp07/

Applications:ScienceandEngineering

• Modelmanydifficultproblemsbyparallelcomputing– Atmosphere,Earth,Environment– Physics- applied,nuclear,particle,condensedmatter,high

pressure,fusion,photonics– Bioscience,Biotechnology,Genetics– Chemistry,MolecularSciences– Geology,Seismology– MechanicalEngineering- fromprostheticstospacecraft– ElectricalEngineering,CircuitDesign,Microelectronics– ComputerScience,Mathematics– Defense,Weapons

24

Applications:IndustrialandCommercial

• Processinglargeamountsofdatainsophisticatedways– Databases,datamining– Oilexploration– Medicalimaginganddiagnosis– Pharmaceuticaldesign– Financialandeconomicmodeling– Managementofnationalandmulti-nationalcorporations– Advancedgraphicsandvirtualreality,particularlyinthe

entertainmentindustry– Networkedvideoandmulti-mediatechnologies– Collaborativeworkenvironments– Websearchengines,webbasedbusinessservices

25

EconomicImpactofHPC

• Airlines:– System-widelogisticsoptimizationsystemsonparallelsystems.– Savings:approx.$100millionperairlineperyear.

• Automotivedesign:– Majorautomotivecompaniesuselargesystems(500+CPUs)for:

• CAD-CAM,crashtesting,structuralintegrityandaerodynamics.• Onecompanyhas500+CPUparallelsystem.

– Savings:approx.$1billionpercompanyperyear.• Semiconductorindustry:

– Semiconductorfirmsuselargesystems(500+CPUs)for• deviceelectronicssimulationandlogicvalidation

– Savings:approx.$1billionpercompanyperyear.• Securitiesindustry:

– Savings:approx.$15billionperyearforU.S.homemortgages.

26FromslidesofKathyYelic’s 2007courseatBerkeley:http://www.cs.berkeley.edu/~yelick/cs267_sp07/

InherentParallelismofApplications

• Example:weatherpredictionandglobalclimatemodeling

27

GlobalClimateModelingProblem

• Problemistocompute:– f(latitude,longitude,elevation,time)à

temperature,pressure,humidity,windvelocity• Approach:

– Discretize thedomain,e.g.,ameasurementpointevery10km– Deviseanalgorithmtopredictweatherattimet+dt givent

• Uses:– Predictmajorevents,e.g.,ElNino– Airqualityforecasting

28

TheRiseofMulticoreProcessors

29

RecentMulticoreProcessors

30

RecentManycore GPUprocessors

31

��

An�Overview�of�the�GK110�Kepler�Architecture�Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��

A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64�bit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��

Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�

� The�new�SMX�processor�architecture�� An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�

each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�

� Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�

�

Kepler�GK110�Full�chip�block�diagram�

��

Streaming�Multiprocessor�(SMX)�Architecture�

Kepler�GK110)s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we)ve�built,�but�also�the�most�programmable�and�power�efficient.��

�

SMX:�192�single�precision�CUDA�cores,�64�double�precision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�

��

Kepler�Memory�Subsystem�/�L1,�L2,�ECC�

Kepler&s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compiler�directed�use�of�an�additional�new�cache�for�read�only�data,�as�described�below.�

�

�

64�KB�Configurable�Shared�Memory�and�L1�Cache�

In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�on�chip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�

48KB�Read�Only�Data�Cache�

In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�read�only�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��

• ~3kcores

UnitsofMeasureinHPC

• Flop:floatingpointoperation(*,/,+,-,etc)• Flop/s:floatingpointoperationspersecond,writtenalsoasFLOPS• Bytes:sizeofdata

– A doubleprecisionfloatingpointnumberis8bytes• Typicalsizesaremillions,billions,trillions…

– Mega Mflop/s=106 flop/sec Mzbyte =220 =1048576=~106 bytes– Giga Gflop/s=109 flop/sec Gbyte =230 =~109 bytes– Tera Tflop/s=1012 flop/secTbyte =240 =~1012 bytes– Peta Pflop/s=1015 flop/sec Pbyte =250 =~1015 bytes– Exa Eflop/s=1018 flop/secEbyte =260 =~1018 bytes– Zetta Zflop/s=1021 flop/secZbyte =270 =~1021 bytes

• www.top500.orgfortheunitsofthefastestmachinesmeasuredusingHighPerformanceLINPACK(HPL)Benchmark– Thefastest:SunwayTaihuLight,~93petaflop/s– Thethird(fastestinUS):DoEORNLTitan,17.59petaflop/s

32

HowtoMeasureandCalculatePerformance(FLOPS)

33

https://passlab.github.io/CSCE569/resources/sum.c

• Calculate#FLOPs(2*Nor3*N)– Checktheloopcount(N)andFLOPsper

loopiteration(2or3).

• Measuretimetocomputeusingtimer– elapsedandelapsed_2areinsecond

• FLOPS=#FLOPs/Time– MFLOPSintheexample

HighPerformanceLINPACK(HPL)BenchmarkPerformance(Rmax)inTop500

• Measured usingtheHighPerformanceLINPACK(HPC)Benchmarkthatsolvesadensesystemoflinearequationsà Rankingthemachines– Ax=b– https://www.top500.org/project/linpack/– https://en.wikipedia.org/wiki/LINPACK_benchmarks

34

Top500(www.top500.org),Nov2017

35

HPCPeakPerformance(Rpeak)Calculation

• NodeperformanceinGflop/s=(CPUspeedinGHz)x(numberofCPUcores)x(CPUinstructionpercycle)x(numberofCPUspernode).– CPUinstructionspercycle(IPC)=#Flopspercycle

• BecausepipelinedCPUcandooneinstructionpercycle• 4or8formostCPU(IntelorAMD)

– http://www.calcverter.com/calculation/CPU-peak-theoretical-performance.php

• HPCPeak(Rpeak)=#nodes*NodePerformanceinGFlops

36

CPUPeakPerformanceExample• IntelX5600seriesCPUsandAMD6100/6200/6300seriesCPUshave4

instructionspercycleIntelE5-2600seriesCPUshave8instructionspercycle

• Example1:Dual-CPUserverbasedonIntelX5675(3.06GHz6-cores)CPUs:– 3.06x6x4x2=144.88GFLOPS

• Example2:Dual-CPUserverbasedonIntelE5-2670(2.6GHz8-cores)CPUs:– 2.6x8x8x2=332.8GFLOPS– With8nodes:332.8GFLOPSx8=2,442.4GFLOPS=2.44TFLOPS

• Example3:Dual-CPUserverbasedonAMD6176(2.3GHz12-cores)CPUs:– 2.3x12x4x2=220.8GFLOPS

• Example4:Dual-CPUserverbasedonAMD6274(2.2GHz16-cores)CPUs:– 2.2x16x4x2=281.6GFLOPS

37https://saiclearning.wordpress.com/2014/04/08/how-to-calculate-peak-theoretical-performance-of-a-cpu-based-hpc-system/

Performance(HPL)DevelopmentOverYearsofTop500Machines

38

4KindsofRankingofHPC/Supercomputers

1. Top500:accordingtotheMeasured HighPerformanceLINPACK(HPL)Benchmarkperformance

– NotPeakperformance,Nototherapplications

2. RankingaccordingtoHPCG benchmarkperformance3. Graph500Rankingaccordingtographprocessingcapability

– ShortestPathandBreadthFirstSearch– https://graph500.org

4. Green500RankingaccordingtoPowerefficiency(GFLOPS/Watts)

– https://www.top500.org/green500/– Generatesublist inthefollowingslidesfrom

https://www.top500.org/statistics/sublist/39

HPCGRanking

• HPCG:HighPerformanceConjugateGradients(HPCG)Benchmark(http://www.hpcg-benchmark.org/)

40

Graph500(https://graph500.org)

• Rankingaccordingtothecapabilityofprocessinglarge-scalegraph(ShortestPathandBreadthFirstSearch)

41

Green500:PowerEfficiency(GFLOPS/Watts)

42

• PowerEfficiency=HPLPerformance/Power– E.g.TaihuLight #1ofTop500:=93,014.6/15,371=6.051

Gflops/watts)• https://www.top500.org/green500/

Green500:PowerEfficiency(GFLOPS/Watts)

43

• https://www.top500.org/green500/

PerformanceEfficiency

• HPCPerformanceEfficiency=ActualMeasuredPerformanceGFLOPS/TheoreticalPeakPerformanceGFLOPS– E.g.#1inTop500

• 93,014.6/125,435.9=74.2%

44

https://www.penguincomputing.com/company/blog/calculate-hpc-efficiency/

HPLPerformanceEfficiencyofTop500(2015list)

• Mostly40%- 90%(ok)

45

HPCGEfficiencyofTop70ofTop500(2015list)

• Mostlybelow5%andonlysomearound10%

46

RankingSummary• HighPerformanceLINPACK(HPL)forTop500

– Denselinearalgebra(Ax=b),highlycomputationintensive– RankTop500forabsolutecomputationcapability

• HPCG:HighPerformanceConjugateGradients(HPCG)Benchmark,HPLalternatives– SparseMatrix-vectormultiplication,balancedmemoryandcomputation

intensity– Rankingmachineswithregardstothecombinationofcomputationand

memoryperformance

• Graph500:ShortestPathandBreadthFirstSearch– Rankingaccordingtothecapabilityofprocessinglarge-scalegraph– Stressingnetworkandmemorysystems

• Green500ofTop500(HPLGFlops/watts)– Powerefficiency

47

Whyisparallelcomputing,namelymulticore,manycore andclusters,theonlyway,sofar,forhighperformance?

48

SemiconductorTrend:“Moore’sLaw”

GordonMoore,FounderofIntel• 1965:sincetheintegratedcircuitwasinvented,thenumberof

transistors/inch2 inthesecircuitsroughlydoubledeveryyear;thistrendwouldcontinuefortheforeseeablefuture

• 1975:revised- circuitcomplexitydoubleseverytwoyears

49Imagecredit:Intel

MicroprocessorTransistorCounts1971-2011&Moore'sLaw

50

https://en.wikipedia.org/wiki/Transistor_count

Moore’sLawTrends

• Moretransistors=↑opportunitiesforexploitingparallelismintheinstructionlevel(ILP)– Pipeline,superscalar,VLIW(VeryLongInstructionWord),SIMD(Single

InstructionMultipleData)orvector,speculation,branchprediction• Generalpathofscaling

– Widerinstructionissue,longerpiepline– Morespeculation– Moreandlargerregistersandcache

• Increasingcircuitdensity~=increasingfrequency~=increasingperformance

• Transparenttousers– Aneasyjobofgettingbetterperformance:buyingfasterprocessors(higher

frequency)

• Wehaveenjoyedthisfreelunchforseveraldecades,however…

51

ProblemsofTraditionalILPScaling

• Fundamentalcircuitlimitations1– delays⇑ asissuequeues⇑ andmulti-portregisterfiles⇑– increasingdelayslimitperformancereturnsfromwiderissue

• Limitedamountofinstruction-levelparallelism1

– inefficientforcodeswithdifficult-to-predictbranches

• Powerandheatstallclockfrequencies

52

[1]Thecaseforasingle-chipmultiprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.

ILPImpacts

53

Simulationsof8-issueSuperscalar

54

Power/HeatDensityLimitsFrequency

55

• Somefundamentalphysicallimitsarebeingreached

WeWillHaveThis…

56

01/17/2007 CS267-Lecture1 57

RevolutionHappedAlready• Chipdensityis

continuingincrease~2xevery2years– Clockspeedisnot– Numberofprocessor

coresmaydoubleinstead

• Thereislittleornohiddenparallelism(ILP)tobefound

• Parallelismmustbeexposedtoandmanagedbysoftware– Nofreelunch

Source:Intel,Microsoft(Sutter)andStanford(Olukotun,Hammond)

IBMBG/L

ASCIWhitePacific

EDSAC1UNIVAC1

IBM7090

CDC6600

IBM360/195CDC7600

Cray1

CrayX-MPCray2

TMCCM-2

TMCCM-5 CrayT3D

ASCIRed

1950 1960 1970 1980 1990 2000 2010

1KFlop/s

1MFlop/s

1GFlop/s

1TFlop/s

1PFlop/s

Scalar

Super Scalar

Parallel

Vector

1941 1 (Floating Point operations / second, Flop/s)1945 100 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1968 10,000,000 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2005 131,000,000,000,000 (131 Tflop/s)

Super Scalar/Vector/Parallel

(103)

(106)

(109)

(1012)

(1015)

2XTransistors/ChipEvery1.5Years

TheTrends

Nowit’sUpToProgrammers

• Addingmoreprocessorsdoesn’thelpmuchifprogrammersaren’tawareofthem…– …ordon’tknowhowtousethem.

• Serialprogramsdon’tbenefitfromthisapproach(inmostcases).

59

ConcludingRemarks

• Thelawsofphysicshavebroughtustothedoorstepofmulticoretechnology– Theworstorthebesttimetomajorincomputerscience

• IEEERebootingComputing(http://rebootingcomputing.ieee.org/)

• Serialprogramstypicallydon’tbenefitfrommultiplecores.• Automaticparallelizationfromserialprogramisn’tthemostefficientapproachtousemulticorecomputers.– Provednotaviableapproach

• Learningtowriteparallelprogramsinvolves– learninghowtocoordinatethecores.

• Parallelprogramsareusuallyverycomplexandtherefore,requiresoundprogramtechniquesanddevelopment.

60

References

• IntroductiontoParallelComputing,Blaise Barney,LawrenceLivermoreNationalLaboratory– https://computing.llnl.gov/tutorials/parallel_comp

• SomeslidesareadaptedfromnotesofRiceUniversityJohnMellor-Crummey’s classandBerkely KathyYelic’s class.

• Examplesarefromchapter01slidesofbook“AnIntroductiontoParallelProgramming”byPeterPacheco– Notethecopyrightnotice

• LatestHPCnews– http://www.hpcwire.com

• World-widepremierconferenceforsupercomputing– http://www.supercomputing.org/,theweekbefore

thanksgivingweek 61

62

• “Ithinkthereisaworldmarketformaybefivecomputers.”– ThomasWatson,chairmanofIBM,1943.

• “Thereisnoreasonforanyindividualtohaveacomputerintheirhome”

– KenOlson,presidentandfounderofDigitalEquipmentCorporation,1977.

• “640K[ofmemory]oughttobeenoughforanybody.”– BillGates,chairmanofMicrosoft,1981.

• “Onseveralrecentoccasions,Ihavebeenaskedwhetherparallelcomputingwillsoonberelegatedtothetrashheapreservedforpromisingtechnologiesthatneverquitemakeit.”

– KenKennedy,CRPCDirectory,1994

http://highscalability.com/blog/2014/12/31/linus-the-whole-parallel-computing-is-the-future-is-a-bunch.html

VisionandWisdombyExperts

A simple example• Computenvaluesandaddthemtogether.• Serialsolution:

63

Example(cont.)

• Wehavepcores,pmuchsmallerthann.• Eachcoreperformsapartialsumofapproximatelyn/pvalues.

Each core uses it’s own private variablesand executes this block of codeindependently of the other cores.

64

Example(cont.)

• Aftereachcorecompletesexecutionofthecode,isaprivatevariablemy_sum containsthesumofthevaluescomputedbyitscallstoCompute_next_value.

• Ex.,8cores,n=24,thenthecallstoCompute_next_valuereturn:

1,4,3,9,2,8,5,1,1,5,2,7,2,5,0,4,1,8,6,5,1,2,3,9

65

Example(cont.)

• Onceallthecoresaredonecomputingtheirprivatemy_sum,theyformaglobalsumbysendingresultstoadesignated“master” corewhichaddsthefinalresult.

66

Example(cont.)

67

SPMD:Allrunthesameprogram,butperformdifferentlydependingonwhotheyare.

Example(cont.)

Core 0 1 2 3 4 5 6 7my_sum 8 19 7 15 7 13 12 14

Globalsum8+19+7+15+7+13+12+14=95

Core 0 1 2 3 4 5 6 7my_sum 95 19 7 15 7 13 12 14

68

Butwait!There’samuchbetterwaytocomputetheglobalsum.

69

Betterparallelalgorithm

• Don’tmakethemastercoredoallthework.• Shareitamongtheothercores.• Pairthecoressothatcore0addsitsresultwithcore1’sresult.

• Core2addsitsresultwithcore3’sresult,etc.• Workwithoddandevennumberedpairsofcores.

70

Betterparallelalgorithm(cont.)

• Repeattheprocessnowwithonlytheevenlyrankedcores.• Core0addsresultfromcore2.• Core4addstheresultfromcore6,etc.

• Nowcoresdivisibleby4repeattheprocess,andsoforth,untilcore0hasthefinalresult.

71

Multiplecoresformingaglobalsum

72

Analysis

• Inthefirstexample,themastercoreperforms7receivesand7additions.

• Inthesecondexample,themastercoreperforms3receivesand3additions.

• Theimprovementismorethanafactorof2!

73

Analysis(cont.)

• Thedifferenceismoredramaticwithalargernumberofcores.

• Ifwehave1000cores:– Thefirstexamplewouldrequirethemastertoperform999

receivesand999additions.– Thesecondexamplewouldonlyrequire10receivesand10

additions.

• That’sanimprovementofalmostafactorof100!

74

lecture 1: an introduction parallel computing csce … •learn fundamentals of concurrent and...

Documents