cs 61c: great ideas in computer architecture lecture 18: parallel …cs61c/fa16/lec/18/l18.pdf ·...
TRANSCRIPT
![Page 1: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/1.jpg)
CS61C:GreatIdeasinComputerArchitecture
Lecture18:ParallelProcessing– SIMD
BernhardBoser&RandyKatz
http://inst.eecs.berkeley.edu/~cs61c
![Page 2: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/2.jpg)
61CSurvey
Itwouldbenicetohaveareviewlectureeveryonceinawhile,
actuallyshowingushowthingsfitinthebiggerpicture
CS61c Lecture18:ParallelProcessing- SIMD 2
![Page 3: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/3.jpg)
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 3
![Page 4: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/4.jpg)
61CTopicssofar…• Whatwelearned:
1. Binarynumbers2. C3. Pointers4. Assemblylanguage5. Datapath architecture6. Pipelining7. Caches8. Performanceevaluation9. Floatingpoint
• Whatdoesthisbuyus?− Promise:executionspeed− Let’scheck!
CS61c Lecture18:ParallelProcessing- SIMD 4
![Page 5: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/5.jpg)
ReferenceProblem
•Matrixmultiplication−Basicoperationinmanyengineering,data,andimagingprocessingtasks
−Imagefiltering,noisereduction,…−Manycloselyrelatedoperations
§ E.g.stereovision(project4)
•dgemm−doubleprecisionfloatingpointmatrixmultiplication
CS61c Lecture18:ParallelProcessing- SIMD 5
![Page 6: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/6.jpg)
ApplicationExample:DeepLearning
• Imageclassification(cats…)•Pick“best”vacationphotos•Machinetranslation•Cleanupaccent•Fingerprintverification•Automaticgameplaying
CS61c Lecture18:ParallelProcessing- SIMD 6
![Page 7: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/7.jpg)
Matrices
CS61c Lecture18:ParallelProcessing- SIMD 7
𝑐"#
• Square(orrectangular)NxNarrayofnumbers− DimensionN
𝐶 = 𝐴 ' 𝐵
𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗N-1
N-1
00
![Page 8: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/8.jpg)
MatrixMultiplication
CS61c 8
𝑪 = 𝑨 ' 𝑩𝑐"# = )𝑎"+𝑏+#
�
+
𝑖
𝑗
𝑘
𝑘
![Page 9: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/9.jpg)
Reference:Python• MatrixmultiplicationinPython
CS61c Lecture18:ParallelProcessing- SIMD 9
N Python[Mflops]32 5.4160 5.5480 5.4960 5.3
• 1Mflop =1Millionfloatingpointoperationspersecond(fadd,fmul)
• dgemm(N…)takes2*N3 flops
![Page 10: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/10.jpg)
C
• c=axb• a,b,careNxNmatrices
CS61c Lecture18:ParallelProcessing- SIMD 10
![Page 11: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/11.jpg)
TimingProgramExecution
CS61c Lecture18:ParallelProcessing- SIMD 11
![Page 12: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/12.jpg)
CversusPython
CS61c Lecture18:ParallelProcessing- SIMD 12
N C[Gflops] Python[Gflops]32 1.30 0.0054160 1.30 0.0055480 1.32 0.0054960 0.91 0.0053
Whichclassgivesyouthiskindofpower?Wecouldstophere…butwhy?Let’sdobetter!
240x!
![Page 13: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/13.jpg)
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 13
![Page 14: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/14.jpg)
WhyParallelProcessing?
• CPUClockRatesarenolongerincreasing−Technical&economicchallenges
§ Advancedcoolingtechnologytooexpensiveorimpracticalformostapplications
§ Energycostsareprohibitive
• Parallelprocessingisonlypathtohigherspeed−Compareairlines:
§ Maximumspeedlimitedbyspeedofsoundandeconomics§ Usemoreandlargerairplanestoincreasethroughput§ Andsmallerseats…
CS61c Lecture18:ParallelProcessing- SIMD 14
![Page 15: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/15.jpg)
UsingParallelismforPerformance
• Twobasicways:−Multiprogramming
§ runmultipleindependentprogramsinparallel§ “Easy”
−Parallelcomputing§ runoneprogramfaster§ “Hard”
•We’llfocusonparallelcomputinginthenextfewlectures
15CS61c Lecture18:ParallelProcessing- SIMD
![Page 16: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/16.jpg)
New-SchoolMachineStructures(It’sabitmorecomplicated!)
• ParallelRequestsAssigned tocomputere.g.,Search“Katz”
• ParallelThreadsAssigned tocoree.g.,Lookup,Ads
• ParallelInstructions>[email protected].,5pipelined instructions
• ParallelData>1dataitem@one timee.g.,Addof4pairsofwords
• HardwaredescriptionsAllgates@onetime
• ProgrammingLanguages 16
SmartPhone
WarehouseScale
Computer
SoftwareHardware
HarnessParallelism&AchieveHighPerformance
LogicGates
Core Core…
Memory(Cache)
Input/Output
Computer
CacheMemory
Core
InstructionUnit(s) FunctionalUnit(s)
A3+B3A2+B2A1+B1A0+B0
Today’sLecture
![Page 17: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/17.jpg)
Single-Instruction/Single-DataStream(SISD)
• Sequentialcomputerthatexploitsnoparallelism ineithertheinstructionordatastreams.ExamplesofSISDarchitecturearetraditionaluniprocessormachines
E.g.ourtrustedMIPS
17
ProcessingUnit
CS61c Lecture18:ParallelProcessing- SIMD
Thisiswhatwediduptonowin61C
![Page 18: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/18.jpg)
Single-Instruction/Multiple-DataStream(SIMDor“sim-dee”)
• SIMDcomputerexploitsmultipledatastreamsagainstasingleinstructionstreamtooperationsthatmaybenaturallyparallelized,e.g.,IntelSIMDinstructionextensionsorNVIDIAGraphicsProcessingUnit(GPU)
18CS61c Lecture18:ParallelProcessing- SIMD
Today’stopic.
![Page 19: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/19.jpg)
Multiple-Instruction/Multiple-DataStreams(MIMDor“mim-dee”)
• Multipleautonomousprocessorssimultaneouslyexecutingdifferentinstructionsondifferentdata.• MIMDarchitecturesincludemulticoreandWarehouse-ScaleComputers
19
InstructionPool
PU
PU
PU
PU
DataPoo
l
CS61c Lecture18:ParallelProcessing- SIMD
TopicofLecture19andbeyond.
![Page 20: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/20.jpg)
Multiple-Instruction/Single-DataStream(MISD)
• Multiple-Instruction,Single-Datastreamcomputerthatexploitsmultipleinstructionstreamsagainstasingledatastream.• Historicalsignificance
20CS61c Lecture18:ParallelProcessing- SIMD
Thishasfewapplications.Notcoveredin61C.
![Page 21: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/21.jpg)
Flynn*Taxonomy,1966
• SIMDandMIMDarecurrentlythemostcommonparallelisminarchitectures– usuallybothinsamesystem!• Mostcommonparallelprocessingprogrammingstyle:SingleProgramMultipleData(“SPMD”)− SingleprogramthatrunsonallprocessorsofaMIMD− Cross-processorexecutioncoordinationusingsynchronizationprimitives
21CS61c Lecture18:ParallelProcessing- SIMD
![Page 22: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/22.jpg)
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 22
![Page 23: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/23.jpg)
SIMD– “SingleInstructionMultipleData”
23CS61c Lecture18:ParallelProcessing- SIMD
![Page 24: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/24.jpg)
SIMDApplications&Implementations
• Applications− Scientificcomputing
§ Matlab,NumPy− Graphicsandvideoprocessing
§ Photoshop,…− BigData
§ Deeplearning− Gaming−…
• Implementations− x86− ARM−…
CS61c Lecture18:ParallelProcessing- SIMD 24
![Page 25: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/25.jpg)
25
FirstSIMDExtensions:MITLincolnLabsTX-2,1957
CS61c
![Page 26: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/26.jpg)
x86SIMDEvolution
CS61c Lecture18:ParallelProcessing- SIMD 26
http://svmoore.pbworks.com/w/file/fetch/70583970/VectorOps.pdf
• Newinstructions• New,wider,moreregisters• Moreparallelism
![Page 27: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/27.jpg)
CPUSpecs(Bernhard’sLaptop)$ sysctl -a | grep cpuhw.physicalcpu: 2hw.logicalcpu: 4
machdep.cpu.brand_string: Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS
CS61c Lecture18:ParallelProcessing- SIMD 27
![Page 28: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/28.jpg)
SIMDRegisters
CS61c Lecture18:ParallelProcessing- SIMD 28
![Page 29: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/29.jpg)
SIMDDataTypes
CS61c Lecture18:ParallelProcessing- SIMD 29
![Page 30: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/30.jpg)
SIMDVectorMode
CS61c Lecture18:ParallelProcessing- SIMD 30
![Page 31: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/31.jpg)
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 31
![Page 32: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/32.jpg)
Problem
• Today’scompilers(largely)donotgenerateSIMDcode•Backtoassembly…• x86
−Over1000instructionstolearn…−GreenBook
•Canweusethecompilertogenerateallnon-SIMDinstructions?
CS61c Lecture18:ParallelProcessing- SIMD 32
![Page 33: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/33.jpg)
x86IntrinsicsAVXDataTypes
CS61c Lecture18:ParallelProcessing- SIMD 33
Intrinsics: Directaccesstoregisters&assemblyfromC
Register
![Page 34: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/34.jpg)
IntrinsicsAVXCodeNomenclature
CS61c Lecture18:ParallelProcessing- SIMD 34
![Page 35: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/35.jpg)
x86SIMD“Intrinsics”
https://software.intel.com/sites/landingpage/IntrinsicsGuide/
CS61c Lecture18:ParallelProcessing- SIMD 35
4parallelmultiplies
2instructionsperclockcycle(CPI=0.5)
assemblyinstruction
![Page 36: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/36.jpg)
RawDoublePrecisionThroughput(Bernhard’sPowerbook Pro)
Characteristic Value
CPU i7-5557U
Clockrate(sustained) 3.1GHz
Instructions perclock(mul_pd) 2
Parallel multipliesperinstruction 4
Peakdoubleflops 24.8Gflops
CS61c Lecture18:ParallelProcessing- SIMD 36
Actualperformanceislowerbecauseofoverhead
https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/
![Page 37: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/37.jpg)
VectorizedMatrixMultiplication
CS61c 37
𝑖
𝑗
𝑘
𝑘
InnerLoop:
fori …;i+=4forj...
i+=4
![Page 38: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/38.jpg)
“Vectorized”dgemm
CS61c Lecture18:ParallelProcessing- SIMD 38
![Page 39: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/39.jpg)
Performance
NGflops
scalar avx32 1.30 4.56160 1.30 5.47480 1.32 5.27960 0.91 3.64
CS61c Lecture18:ParallelProcessing- SIMD 39
• 4xfaster• Butstill<<theoretical25Gflops!
![Page 40: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/40.jpg)
Weareflying…
• Survey:
• But…thereissomuchmaterialtocover!− Solution:targetedreading−Weeklyhomeworkwithintegratedreading&lecturereview
CS61c Lecture18:ParallelProcessing- SIMD 40
![Page 41: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/41.jpg)
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 41
![Page 42: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/42.jpg)
AtriptoLA
Get toSFO&check-in SFOà LAX Getto destination
3hours 1hour 3 hours
CS61c Lecture18:ParallelProcessing- SIMD 42
Commercialairline:
Supersonicaircraft:
Get toSFO&check-in SFOà LAX Getto destination
3hours 6min 3 hours
Totaltime:7hours
Totaltime:6.1hours
Speedup:
Flyingtime Sflight =60/6=10xTriptime Strip =7/6.1=1.15x
![Page 43: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/43.jpg)
Amdahl’sLaw
• GetenhancementE foryournewPC− E.g.floatingpointrocketbooster
• E− Speedsupsometask(e.g.arithmetic)byfactorSE− F isfractionofprogramthatusesthis”task”
CS61c Lecture18:ParallelProcessing- SIMD 43
1-F F
1-F F/ SE
ExecutionTime:
Speedup:
T0 (noE)
TE (withE)
𝑆 =𝑇6𝑇7
=1
1 − 𝐹 + 𝐹𝑆7
nospeedup speedupsection
![Page 44: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/44.jpg)
BigIdea:Amdahl’sLaw
44
Partnotspedup Partspedup
Example:Theexecutiontimeofhalf ofaprogramcanbeacceleratedbyafactorof2.Whatistheprogramspeed-upoverall?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
𝑆 =𝑇6𝑇7=
1
1− 0.5 + 0.52= 1.33 ≪ 2
CS61c Lecture18:ParallelProcessing- SIMD
![Page 45: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/45.jpg)
Maximum“Achievable”Speed-Up
45
Question: Whatisareasonable#ofparallelprocessorstospeedupanalgorithmwithF=95%?(i.e.19/20th canbespedup)
a)Maximumspeedup:
b)Reasonable“engineering”compromise:
𝑆BCD =1
1 − 𝐹 + 𝐹𝑆7E
FG⟹I
=1
1− 𝐹
𝐹 = 95% ⟹𝑆BCD = 20 but𝑆7 → ∞ !?
1 − 𝐹 =𝐹𝑆7
⟹ 𝑆7 =𝐹
1− 𝐹 =0.950.05 = 19
Then𝑆 = FOPQR = 10
Equaltime insequentialandparallelcode
CS61c
![Page 46: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/46.jpg)
46
Iftheportionoftheprogramthatcanbeparallelizedissmall,thenthespeedupislimited
Inthisregion,thesequentialportionlimitstheperformance
500processorsfor19x
20processorsfor10x
CS61c
![Page 47: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/47.jpg)
StrongandWeakScaling
• Togetgoodspeeduponaparallelprocessorwhilekeepingtheproblemsizefixedisharderthangettinggoodspeedupbyincreasingthesizeoftheproblem.− Strongscaling:whenspeedupcanbeachievedonaparallelprocessorwithoutincreasingthesizeoftheproblem
−Weakscaling:whenspeedupisachievedonaparallelprocessorbyincreasingthesizeoftheproblemproportionallytotheincreaseinthenumberofprocessors
• Loadbalancingisanotherimportantfactor:everyprocessordoingsameamountofwork− Justoneunitwithtwicetheloadofotherscutsspeedupalmostinhalf
47CS61c Lecture18:ParallelProcessing- SIMD
![Page 48: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/48.jpg)
Clickers/PeerInstruction
48
Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
Answer SEA 5B 16C 20D 100E Noneoftheabove
CS61c Lecture18:ParallelProcessing- SIMD
![Page 49: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/49.jpg)
Clickers/PeerInstruction
49
Supposeaprogramspends80%ofitstimeinasquarerootroutine.Howmuchmustyouspeedupsquareroottomaketheprogramrun5timesfaster?
𝑆 =𝑇6𝑇7=
1
1 − 𝐹 + 𝐹𝑆7
Answer SEA 5B 16C 20D 100E Noneoftheabove
CS61c Lecture18:ParallelProcessing- SIMD
![Page 50: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/50.jpg)
Administrivia• MT2is
− Tuesday,November1,− 3:30-5pm− seewebforroom assignments
• TAReviewSession:§ Sunday10/30,3:30– 5PMin10Evans§ SeePiazza
50CS61c Lecture19:ThreadLevalParallelProcessing
![Page 51: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/51.jpg)
MT2Topics• Coverslecturematerialupto10/20
− Caches− notfloatingpoint
• Combinatoriallogicincludingsynthesisandtruthtables• FSMs• Timingandtimingdiagrams• Pipelining• Datapath,hazards,stalls• Performance(e.g.CPI,instructionspersecond,latency)• Caches• AlltopicscoveredinMT1
− Focusisnewmaterial,butdonotbesurprisedbye.g.MIPSassembly
51CS61c Lecture19:ThreadLevalParallelProcessing
![Page 52: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/52.jpg)
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 52
![Page 53: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/53.jpg)
Amdahl’sLawappliedtodgemm
• Measureddgemm performance− Peak 5.5Gflops− Largematrices 3.6Gflops− Processor 24.8Gflops
• Whyarewenotgetting(closeto)25Gflops?− Somethingelse(notfloatingpointALU)islimitingperformance!
− Butwhat?Possibleculprits:§ Cache§ Hazards§ Let’slookatboth!
CS61c Lecture18:ParallelProcessing- SIMD 53
![Page 54: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/54.jpg)
PipelineHazards– dgemm
CS61c Lecture18:ParallelProcessing- SIMD 54
![Page 55: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/55.jpg)
LoopUnrolling
CS61c Lecture18:ParallelProcessing- SIMD 55
Compilerdoestheunrolling
Howdoyouverifythatthegeneratedcodeisactuallyunrolled?
4registers
![Page 56: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/56.jpg)
Performance
NGflops
scalar avx unroll32 1.30 4.56 12.95160 1.30 5.47 19.70480 1.32 5.27 14.50960 0.91 3.64 6.91
CS61c Lecture18:ParallelProcessing- SIMD 56
![Page 57: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/57.jpg)
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 57
![Page 58: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/58.jpg)
FPUversusMemoryAccess
• Howmanyfloatingpointoperationsdoesmatrixmultiplytake?− F=2xN3 (N3 multiplies,N3 adds)
• Howmanymemoryload/stores?−M=3xN2 (forA,B,C)
• Manymorefloatingpointoperationsthanmemoryaccesses− q=F/M=2/3*N− Good,sincearithmeticisfasterthanmemoryaccess− Let’scheckthecode…
CS61c Lecture18:ParallelProcessing- SIMD 58
![Page 59: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/59.jpg)
Butmemoryisaccessedrepeatedly
• q=F/M=1!(2loadsand2floatingpointoperations)
CS61c Lecture18:ParallelProcessing- SIMD 59
Innerloop:
![Page 60: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/60.jpg)
CS61c Lecture18:ParallelProcessing- SIMD 60
Second-LevelCache(SRAM)
TypicalMemoryHierarchy
Control
Datapath
SecondaryMemory(Disk
OrFlash)
On-ChipComponents
RegFile
MainMemory(DRAM)Data
CacheInstrCache
Speed(cycles):½’s 1’s 10’s 100’s-10001,000,000’s
Size(bytes): 100’s 10K’s M’sG’sT’s
• Wherearetheoperands(A,B,C)stored?• WhathappensasNincreases?• Idea:arrangethatmostaccessesaretofastcache!
Cost/bit:highest lowest
Third-LevelCache(SRAM)
![Page 61: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/61.jpg)
Sub-MatrixMultiplicationor:BeatingAmdahl’sLaw
CS61c 61
![Page 62: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/62.jpg)
Blocking
• Idea:−Rearrangecodetousevaluesloadedincachemanytimes
−Only“few”accessestoslowmainmemory(DRAM)perfloatingpointoperation
−à throughputlimitedbyFPhardwareandcache,notslowDRAM
−P&Hp.556
CS61c Lecture18:ParallelProcessing- SIMD 62
![Page 63: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/63.jpg)
MemoryAccessBlocking
CS61c Lecture18:ParallelProcessing- SIMD 63
![Page 64: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/64.jpg)
Performance
NGflops
scalar avx unroll blocking32 1.30 4.56 12.95 13.80160 1.30 5.47 19.70 21.79480 1.32 5.27 14.50 20.17960 0.91 3.64 6.91 15.82
CS61c Lecture18:ParallelProcessing- SIMD 64
![Page 65: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/65.jpg)
Agenda
• 61C– thebigpicture• Parallelprocessing• Singleinstruction,multipledata• SIMDmatrixmultiplication• Amdahl’slaw• Loopunrolling• Memoryaccessstrategy- blocking• AndinConclusion,…
CS61c Lecture18:ParallelProcessing- SIMD 65
![Page 66: CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel …cs61c/fa16/lec/18/L18.pdf · 2016-10-28 · Agenda • 61C –the big picture • Parallel processing • Single](https://reader034.vdocument.in/reader034/viewer/2022050403/5f8086c5d738b629fa6018db/html5/thumbnails/66.jpg)
AndinConclusion,…
• ApproachestoParallelism− SISD,SIMD,MIMD(nextlecture)
• SIMD− Oneinstructionoperatesonmultipleoperandssimultaneously
• Example:matrixmultiplication− Floatingpointheavyà exploitMoore’slawtomakefast
• Amdahl’sLaw:− Serialsectionslimitspeedup− Cache
§ Blocking− Hazards
§ Loopunrolling
66CS61c Lecture18:ParallelProcessing- SIMD