cs 61c: great ideas in computer architecture lecture 18...

63
CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMD Krste Asanović & Randy Katz http://inst.eecs.berkeley.edu/~cs61c

Upload: others

Post on 26-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

CS61C:GreatIdeasinComputerArchitecture

Lecture18:ParallelProcessing– SIMD

KrsteAsanović &RandyKatz

http://inst.eecs.berkeley.edu/~cs61c

Page 2: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

61CSurvey

Itwouldbenicetohaveareviewlectureeveryonceinawhile,actuallyshowingus

howthingsfitinthebiggerpicture

CS61c Lecture18:ParallelProcessing- SIMD 2

Page 3: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 3

Page 4: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

61CTopicssofar…• Whatwelearned:

1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Datapath architecture6. Pipelining7. Caches8. Performanceevaluation9. Floatingpoint

• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!

CS61c Lecture18:ParallelProcessing- SIMD 4

Page 5: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

ReferenceProblem•Matrixmultiplication

−Basicoperationinmanyengineering,data,andimagingprocessingtasks

− Imagefiltering,noisereduction,…−Manycloselyrelatedoperations

§ E.g.stereovision(project4)•dgemm

−double-precisionfloating-pointmatrixmultiplication

CS61c Lecture18:ParallelProcessing- SIMD 5

Page 6: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

ApplicationExample:DeepLearning• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying

CS61c Lecture18:ParallelProcessing- SIMD 6

Page 7: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Matrices

CS61c Lecture18:ParallelProcessing- SIMD 7

𝑖

𝑗N-1

N-1

0

0

• SquarematrixofdimensionNxN

Page 8: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

MatrixMultiplication

CS61c 8

𝑖

𝑗

𝑘

𝑘C=A*BCij =Σk (Aik *Bkj)

A

B

C

Page 9: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Reference:Python• MatrixmultiplicationinPython

CS61c Lecture18:ParallelProcessing- SIMD 9

N Python[Mflops]32 5.4160 5.5480 5.4960 5.3

• 1MFLOP=1Millionfloating-pointoperationspersecond(fadd,fmul)

• dgemm(N…)takes2*N3 flops

Page 10: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

C• c=a*b• a,b,careNxNmatrices

CS61c Lecture18:ParallelProcessing- SIMD 10

Page 11: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

TimingProgramExecution

CS61c Lecture18:ParallelProcessing- SIMD 11

Page 12: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

CversusPython

CS61c Lecture18:ParallelProcessing- SIMD 12

N C[GFLOPS] Python[GFLOPS]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053

Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!

240x!

Page 13: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 13

Page 14: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

WhyParallelProcessing?• CPUClockRatesarenolongerincreasing

−Technical&economicchallenges§ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications

§ Energycostsareprohibitive

• Parallelprocessingisonlypathtohigherspeed−Compareairlines:

§ Maximumspeedlimitedbyspeedofsoundandeconomics§ Usemoreandlargerairplanestoincreasethroughput§ Andsmallerseats…

CS61c Lecture18:ParallelProcessing- SIMD 14

Page 15: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

UsingParallelismforPerformance• Twobasicways:

−Multiprogramming§ runmultipleindependentprogramsinparallel§ “Easy”

−Parallelcomputing§ runoneprogramfaster§ “Hard”

•We’llfocusonparallelcomputinginthenextfewlectures

15CS61c Lecture18:ParallelProcessing- SIMD

Page 16: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssignedtocomputere.g.,Search“Katz”• ParallelThreadsAssignedtocoree.g.,Lookup,Ads• ParallelInstructions>[email protected].,5pipelinedinstructions• ParallelData>[email protected].,Addof4pairsofwords• HardwaredescriptionsAllgates@onetime• ProgrammingLanguages

16

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…Memory(Cache)Input/Output

Computer

CacheMemory

CoreInstructionUnit(s) Functional

Unit(s)A3+B3A2+B2A1+B1A0+B0

Today’sLecture

Page 17: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Single-Instruction/Single-DataStream(SISD)

• Sequentialcomputerthatexploitsnoparallelismineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines

E.g.ourtrustedRISC-Vpipeline

17

ProcessingUnit

CS61c Lecture18:ParallelProcessing- SIMD

Thisiswhatwediduptonowin61C

Page 18: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)

• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)

18CS61c Lecture18:ParallelProcessing- SIMD

Today’stopic.

Page 19: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)

• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers

19

InstructionPool

PU

PU

PU

PU

DataPoo

l

CS61c Lecture18:ParallelProcessing- SIMD

TopicofLecture19andbeyond.

Page 20: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Multiple-Instruction/Single-DataStream(MISD)

• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance

20CS61c Lecture18:ParallelProcessing- SIMD

Thishasfewapplications.Notcoveredin61C.

Page 21: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Flynn*Taxonomy,1966

• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures– usuallybothinsamesystem!

• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives

21CS61c Lecture18:ParallelProcessing- SIMD

Page 22: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 22

Page 23: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

SIMD– “SingleInstructionMultipleData”

23CS61c Lecture18:ParallelProcessing- SIMD

Page 24: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

SIMDApplications&Implementations• Applications

− Scientificcomputing§ Matlab,NumPy

− Graphicsandvideoprocessing§ Photoshop,…

− BigData§ Deeplearning

− Gaming− …

• Implementations− x86− ARM− RISC-Vvectorextensions

CS61c Lecture18:ParallelProcessing- SIMD 24

Page 25: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

25

FirstSIMDExtensions:MITLincolnLabsTX-2,1957

CS61c

Page 26: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

x86SIMDEvolution

CS61c Lecture18:ParallelProcessing- SIMD 26

http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf

• Newinstructions• New,wider,moreregisters• Moreparallelism

Page 27: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

LaptopCPUSpecs$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4

machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz

machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0RDRAND F16C

machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS

CS61c Lecture18:ParallelProcessing- SIMD 27

Page 28: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

SIMDRegisters

CS61c Lecture18:ParallelProcessing- SIMD 28

Page 29: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

SIMDDataTypes

CS61c Lecture18:ParallelProcessing- SIMD 29

(NowalsoAVX-512available)

Page 30: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

SIMDVectorMode

CS61c Lecture18:ParallelProcessing- SIMD 30

Page 31: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 31

Page 32: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Problem• Today’scompilers(largely)donotgenerateSIMDcode• Backtoassembly…• x86(notusingRISC-Vasnovectorhardwareyet)

−Over1000instructionstolearn…−GreenBook

• Canweusethecompilertogenerateallnon-SIMDinstructions?

CS61c Lecture18:ParallelProcessing- SIMD 32

Page 33: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

x86IntrinsicsAVXDataTypes

CS61c Lecture18:ParallelProcessing- SIMD 33

Intrinsics: Directaccesstoregisters&assemblyfromCRegister

Page 34: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

IntrinsicsAVXCodeNomenclature

CS61c Lecture18:ParallelProcessing- SIMD 34

Page 35: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

https://software.intel.com/sites/landingpage/IntrinsicsGuide/CS61c Lecture18:ParallelProcessing- SIMD 35

4parallelmultiplies

2instructionsperclockcycle(CPI=0.5)

assemblyinstruction

x86SIMD“Intrinsics”

Page 36: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

RawDouble-PrecisionThroughput

Characteristic Value

CPU i7-5557U

Clockrate(sustained) 3.1GHz

Instructions perclock(mul_pd) 2

Parallel multipliesperinstruction 4

PeakdoubleFLOPS 24.8GFLOPS

CS61c Lecture18:ParallelProcessing- SIMD 36

Actualperformanceislowerbecauseofoverheadhttps://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/

Page 37: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

VectorizedMatrixMultiplication

CS61c 37

𝑖

𝑗

𝑘

𝑘InnerLoop:

fori …;i+=4forj...

i+=4

Page 38: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

“Vectorized”dgemm

CS61c Lecture18:ParallelProcessing- SIMD 38

Page 39: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Performance

NGflops

scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64

CS61c Lecture18:ParallelProcessing- SIMD 39

• 4xfaster• Butstill<<theoretical25GFLOPS!

Page 40: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 40

Page 41: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

AtriptoLA

Get toSFO&check-in SFOà LAX Getto destination3hours 1hour 3 hours

CS61c Lecture18:ParallelProcessing- SIMD 41

Commercialairline:

Supersonicaircraft:

Get toSFO&check-in SFOà LAX Getto destination3hours 6min 3 hours

Totaltime:7hours

Totaltime:6.1hoursSpeedup:

Flyingtime Sflight =60/6=10xTriptime Strip =7/6.1=1.15x

Page 42: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Amdahl’sLaw• GetenhancementE foryournewPC− E.g.floating-pointrocketbooster• E− Speedsupsometask(e.g.arithmetic)byfactorSE− F isfractionofprogramthatusesthis”task”

CS61c Lecture18:ParallelProcessing- SIMD 42

1-F F

1-F F/ SE

ExecutionTime:

Speedup:

T0 (noE)TE (withE)

nospeedup speedupsection

Speedup=TO /TE =1/((1-F)+F/SE)

Page 43: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

BigIdea:Amdahl’sLaw

43

Partnotspedup Partspedup

Example: Theexecutiontimeofhalf ofaprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?

CS61c Lecture18:ParallelProcessing- SIMD

Speedup=TO /TE =1/((1-F)+F/SE)

TO /TE =1/((1-F)+F/SE)=1/((1- 0.5)+0.5/2)=1.33<<2

Page 44: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Maximum“Achievable”Speed-Up

44

Question: Whatisareasonable#ofparallelprocessorstospeedupanalgorithmwithF=95%?(i.e.19/20th canbespedup)

a)Maximumspeedup: Smax =1/(1-0.95)=20butneedsSE=∞!

b) Reasonable“engineering”compromise:Equaltimeinsequentialandparallelcode:(1-F)=F/SE

SE =F/(1-F)=0.95/0.05=20S=1/(0.05+0.05)=10

CS61c

Speedup=1/((1-F)+F/SE)(SE->∞)=1/(1-F)

Page 45: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

45

Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited

Inthisregion,thesequentialportionlimitstheperformance

500processorsfor19x

20processorsfor10x

CS61c

Page 46: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

StrongandWeakScaling• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.− Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem

− Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors

• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork− Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf

46CS61c Lecture18:ParallelProcessing- SIMD

Page 47: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

PeerInstruction

47

Supposeaprogramspends80%ofitstimeinasquare-rootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?

CS61c Lecture18:ParallelProcessing- SIMD

Speedup=1/((1-F)+F/SE)RED: 5

GREEN: 20ORANGE: 500

Noneofabove

Page 48: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Administrivia• Midterm2isTuesdayduringclasstime!

− ReviewSession:Saturday10-12pm@HearstAnnexA1−Onedouble-sidedcribsheet;we’llprovidetheRISCVcard− BringyourID!We’llcheckit− RoomassignmentspostedonPiazza

• Project2pre-labdueFriday• HW4due11/3,butadvisabletofinishbythemidterm• Project2-2due11/6,butalsogoodtofinishbeforeMT2

48CS61c Lecture19:ThreadLevalParallelProcessing

Page 49: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 49

Page 50: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Amdahl’sLawappliedtodgemm• Measureddgemm performance

− Peak 5.5GFLOPS− Largematrices 3.6GFLOPS− Processor 24.8GFLOPS

• Whyarewenotgetting(closeto)25GFLOPS?− Somethingelse(notfloating-pointALU)islimitingperformance!− Butwhat?Possibleculprits:

§ Cache§ Hazards§ Let’slookatboth!

CS61c Lecture18:ParallelProcessing- SIMD 50

Page 51: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

PipelineHazards– dgemm

CS61c Lecture18:ParallelProcessing- SIMD 51

Page 52: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

LoopUnrolling

CS61c Lecture18:ParallelProcessing- SIMD 52

Compilerdoestheunrolling

Howdoyouverifythatthegeneratedcodeisactuallyunrolled?

4registers

Page 53: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Performance

NGflops

scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91

CS61c Lecture18:ParallelProcessing- SIMD 53

Page 54: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 54

Page 55: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

FPUversusMemoryAccess• Howmanyfloating-pointoperationsdoesmatrixmultiplytake?− F=2xN3 (N3 multiplies,N3 adds)

• Howmanymemoryload/stores?− M=3xN2 (forA,B,C)

• Manymorefloating-pointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…

CS61c Lecture18:ParallelProcessing- SIMD 55

Page 56: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Butmemoryisaccessedrepeatedly

• q=F/M=1!(2loadsand2floating-pointoperations)

CS61c Lecture18:ParallelProcessing- SIMD 56

Innerloop:

Page 57: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

CS61c Lecture18:ParallelProcessing- SIMD 57

Second-LevelCache(SRAM)

TypicalMemoryHierarchyControl

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’sSize(bytes): 100’s 10K’sM’sG’sT’s

• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!

Cost/bit:highest lowest

Third-LevelCache(SRAM)

Page 58: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Sub-MatrixMultiplicationor:BeatingAmdahl’sLaw

CS61c 58

Page 59: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Blocking• Idea:

− Rearrangecodetousevaluesloadedincachemanytimes−Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation

−à throughputlimitedbyFPhardwareandcache,notslowDRAM

− P&H,RISC-Veditionp.465

CS61c Lecture18:ParallelProcessing- SIMD 59

Page 60: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

MemoryAccessBlocking

CS61c Lecture18:ParallelProcessing- SIMD 60

Page 61: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Performance

NGflops

scalar avx unroll blocking32 1.30 4.56 12.95 13.80160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82

CS61c Lecture18:ParallelProcessing- SIMD 61

Page 62: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

Agenda• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…CS61c Lecture18:ParallelProcessing- SIMD 62

Page 63: CS 61C: Great Ideas in Computer Architecture Lecture 18 ...inst.eecs.berkeley.edu/~cs61c/fa17/lec/18/L18 SIMD (1up).pdf · Great Ideas in Computer Architecture Lecture 18: Parallel

AndinConclusion,…• ApproachestoParallelism

− SISD,SIMD,MIMD(nextlecture)• SIMD

− Oneinstructionoperatesonmultipleoperandssimultaneously• Example:matrixmultiplication

− Floatingpointheavyà exploitMoore’slawtomakefast• Amdahl’sLaw:

− Serialsectionslimitspeedup− Cache

§ Blocking− Hazards

§ Loopunrolling

63CS61c Lecture18:ParallelProcessing- SIMD