![Page 1: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/1.jpg)
HAOHUAN FU
Dr. Haohuan Fu is an associate professor in the Ministry of Education Key
Laboratory for Earth System Modeling, and Center of Earth System Science in
Tsinghua University. He is also the associate director of the National
Supercomputing Center in Wuxi. His research interests include design
methodologies for highly efficient and highly scalable simulation applications
that can take advantage of emerging multi-core, many-core, and
reconfigurable architectures, and make full utilization of current Peta-Flops
and future Exa-Flops supercomputers; and intelligent data management,
analysis, and data mining platforms that combine the statistical methods and
machine learning technologies.
swDNN: A Library for Accelerating Deep Learning Applications
on Sunway TaihuLight
![Page 2: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/2.jpg)
swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight Supercomputer
Haohuan FuNational Supercomputing Center in Wuxi
Tsinghua University
![Page 3: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/3.jpg)
Deep Leaning from Big PictureDataScience(“BigData”)
DataAnalysis
DataAnalysis
MachineLearning
SVMK-MeansClustering
DeeplearningDeepneuralNets
ConvolutionalNeuralNets
RecommenderSystemsCollaborativeFilteringRegressionBayesian
NetworksDecisionTreesRandomForestsSemanticAnalysis
DataManagement
QueriesIndexing
DistributedStorage
![Page 4: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/4.jpg)
AmountofData
Performance
NewAImodel(deeplearning)
MostlearningAlgorithm
Deep Leaning vs Traditional Machine Learning
![Page 5: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/5.jpg)
FromAndrewNg FromNVIDIA
What is Deep Leaning good at?
![Page 6: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/6.jpg)
Ecosystem of Deep Learning
![Page 7: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/7.jpg)
Ecosystem of Deep Learning
![Page 8: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/8.jpg)
ComplexityofTask
SizeofM
odel
SizeofModelAm
ountofData
ComplexityofTask
ComputationRequired
Challenges to Deep Learning
![Page 9: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/9.jpg)
• LargeModels (Memory)• MoreFloatPointOperations(Computing)
8layers1.4Gflop/Image2GBGPU memory
~16%Error
152layers22.6Gflop/Image56GBGPUmemory
~3.5%Error
2012AlexNet
2015ResNet
2ExaFLOP25M|7,000hrs~16%Error
20ExaFLOP100M|12,000hrs
~5%Error
2014DeepSpeech1
2015DeepSpeech2
Images Speech
16xComputing28xMemory
10xComputing4x MemoryforParameters
ComputingCapacityTflopsà Pflops
Challenges to Deep Learning
![Page 10: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/10.jpg)
“Investmentsincomputersystems— andIthinkthebleeding-edgeofAI,anddeeplearningspecifically,isshiftingtoHPC— cancutdownthetimetorunanexperimentfromaweektoadayandsometimesevenfaster.”
Bigdata+Deeplearning+Highperformancecomputing=IntelligenceGTC’14:DeepLearningMeetsHeterogeneousComputing
High Performance Deep Learning
ProfessorofStanford
![Page 11: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/11.jpg)
http://www.theverge.com/
Big Sur:Facebook’sSupercomputerforDeeplearning40Pflops
High Performance Deep Learning
![Page 12: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/12.jpg)
• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism
• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism
Towards High Performance Deep Learning
![Page 13: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/13.jpg)
• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism
• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism
Towards High Performance Deep Learning
260 core inone chip
40,960 nodesin one system
![Page 14: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/14.jpg)
Sunway-I:- CMA service, 1998- commercial chip- 0.384 Tflops- 48th of TOP500
Sunway BlueLight:- NSCC-Jinan, 2011- 16-core processor- 1 Pflops- 14th of TOP500
Sunway TaihuLight:- NSCC-Wuxi, 2016- 260-core processor- 125 Pflops- 1st of TOP500
The Sunway Machine Family
![Page 15: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/15.jpg)
System TaihuLight Tianhe-2 Titan Sequoia Cori
Peak Performance (PFlops) 125.4 54.9 27.1 20.1 27.9
Total Memory (TB) 1310 1024 710 1572 879
Linpack Performance (PFlops) 93.0(74%) 33.9(62%) 17.6(65%) 17.2(85.3) 14.0(50%)
Rank of Top500 1 2 3 4 5
Performance/Power (Mflops/W) 6051.3 1901.5 2142.8 2176.6 3266.8
Rank of Green500 4 135 100 90 26
GTEPS 23755.7 2061.48 ### 23751 ###
Rank of Graph500 2 8 ### 3 ###
HPCG (Pflops) 0.3712 0.5801 0.3223 0.3304 0.3554
Rank of HPCG 4 2 7 6 5
Sunway TaihuLight V.S. Other Systems
14
![Page 16: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/16.jpg)
Execution Pipelines
SIMD width Clock I-Cache size per CPE
LDM sizeper CPE
Memory Bandwidth
2 256bit 1.45GHz 16KB 64KB 136GB/s
Architecture of SW26010
![Page 17: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/17.jpg)
• BasicneuronlayersinDNN• Fully-connectedlayer(BLAS);Poolinglayer• Activationfunction;BatchNormalization• *ConvolutionalLayer
• Convolutionallayersoccupyover90%time.
Scale up:Library for Deep Learning
![Page 18: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/18.jpg)
𝑹𝑩𝑾𝑴𝑬𝑴→𝑳𝑫𝑴 =8𝐵𝑦𝑡𝑒
2𝑓𝑙𝑜𝑝/8×1.45𝐺𝐻𝑧 = 139.2𝐺𝐵𝑝𝑠
CPUUtility= ABCD.E
E= 0.32%
8by8CPEmesh
Registers
LDM
PeakPerformanceperCG/012 = 742.4829:;<
46.4GB/s�64
4~32GB/s
8GB/s
ConsideringExecutionEfficiency(EE)/012 = 742.4829:;< · OO
ConsideringLDM− REGExchange
/012 = 742.4829:;< · OO · VWX 1,46.4 · 64
[\]^_`→bcd
e
PeakPerformanceperCG/012 = 742.4829:;<
ConsideringExecutionEfficiency(EE)/012 = 742.4829:;< · OO
ConsideringDirectMemoryAccess/012 = 742.4829:;< · OO ·
VWX 1,g
bhijkj→lkm
e
DirectMemoryAccess REG-LDM-MEM
ConsideringMEM− LDMExchange/012 = 742.4829:;< · OO
·VWX 1,46.4 · 64
[\]^_`→bcd
e
· VWX 1,n\]^_`→bcd
[\]^_`→bcd
eMainMemory
A Performance Model for SW26010
![Page 19: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/19.jpg)
• LDMblocking:loopsplittingandloopscheduling• WecannotapplyblockingtoeverydimensionduetothelimitedsizeofLDM.• TheDMAbandwidthbetweenmemoryandLDMwillbeaffectedbythesizeofleadingdimensionofloops.
• ToincreasedatareuseinLDM,weshouldarrangetheDMAoperationsatthe most outerloops.
TableI.MeasuredDMABandwidths(GBps)on1CG
LDM-related optimizations
![Page 20: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/20.jpg)
𝑅BWKLM→NOK =𝑁Q + 𝑏TU𝑏V 𝐷𝑆2bTU𝑏V𝑁Q/𝑇
=
1𝑏[U𝑏V
+ 1𝑁Q
𝐷𝑆
2/𝑇
2𝑁\×𝑁Q×𝑏𝐶𝑜×𝑏VCachedCalculation
𝑁\×𝑏𝐶𝑜×𝑏V
𝑁\×𝑁Q
Inputelements
Filterelements
DMAgetData
LDM-related optimizations
![Page 21: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/21.jpg)
𝑅BWKLM→NOK =𝐵 + 𝐾_𝑁Q 𝐷𝑆2𝐾_𝐵𝑁Q/𝑇
=
1𝐾_𝑁Q
+ 1𝐵 𝐷𝑆
2/𝑇
𝑁\×𝐵×𝐾_
𝑁\×𝑁Q
Inputelements
Filterelements
DMAgetData
2𝑁\×𝑁Q×𝐵×𝐾_CachedCalculation
LDM-related optimizations
![Page 22: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/22.jpg)
• GEMMoperationinLDM• HowtodesigndatalayoutondistributedLDMspaceof64CPE?• Howtosharedatabetween64CPEwhenComputing?• HowtoloaddatafromLDMtoREGensureRBW<46.4GB/s?
• Solution• RegisterCommunication• Vectorization
Register-Related Optimization
![Page 23: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/23.jpg)
• DataLayouts• Dividerowandcolumninto8parts• EachCPEmaintains1/8×1/8matrix
• DataSharing• RegisterCommunications
• 14cycleslatency• DatafetchwithProducerConsumermode
Register-Related Optimization
![Page 24: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/24.jpg)
• Vectorization :ReduceRBWbetweenLDM-REG
𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇
v𝑟𝑏V
𝑟𝑏eQ
SIMDLoad+Extend
SIMDLoad
𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 4𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇
𝑅𝐵𝑊NOK→abc =4 + 4×4 ×8𝐵𝑦𝑡𝑒
2×4×4/(1.45𝐺𝐻𝑧×8) = 23.2𝐺𝐵/𝑠 < 46.4𝑟𝑏eU = 𝑟𝑏V = 4
v
Register-Related Optimization
datablockedinregisters
datablockedforvmaddoperation
![Page 25: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/25.jpg)
Register-Related Optimization
𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 4𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇 𝑅𝐵𝑊NOK→abc =
4 + 4×4 ×8𝐵𝑦𝑡𝑒2×4×4/(1.45𝐺𝐻𝑧×8) = 23.2𝐺𝐵/𝑠 < 46.4
𝑟𝑏eU = 𝑟𝑏V = 4
![Page 26: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/26.jpg)
Principles:
• ReduceReadAfterWrite(RAW) Hazard:
postponeissuingofdependentinstructions
• IncreaseInstructionLevelParallelism:
Overlap InstructionexecutionsonP0andP1pipelines
EE=16/26=61.5% EE=16/17=94.1%Instruction Reordering
![Page 27: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/27.jpg)
Double-precisionperformanceresultsofourconvolutionkernelswithdifferent(Ni,No)rangingfrom(64,64)to(384,384),comparedwiththeK40mGPUresultswithcuDNNv5.(B=128,outputimage=64× 64,filter=3× 3)
• Performance of Convolutional Layers– Convolution performance above 1.6 Tflops– Speedup ranging from 1.91x to 9.75x compared with
cudnnv5.1.
Performance
![Page 28: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/28.jpg)
• Performance is irrelevant to kernel size – MorestablethancuDNN
Performance
![Page 29: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/29.jpg)
• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism
• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism
Towards High Performance Deep Learning
![Page 30: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/30.jpg)
• SWCaffe :ADeepLearningFrameworkonSunway• Support RNN,CNN• IntegratedwithswDNN,swBLAS• ScalewithMPI• SynchronousParameterServer
CGNumber 1 2 4 8 16Sec/100iter 1989.79 1048.34 538.197 263.679 156.89
batchsize=1024Layer={conv,pool,conv,pool,ip, ip,softmax}channel={128,-,128,-,500,10,10}Filter={5*5,-,3*3,-, -,-,-}Data: Mnist
Distributed DNN Framework on Sunway
ParameterConfiguration
ScalabilityonSunway
![Page 31: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/31.jpg)
无人驾驶 对弈 语音 视觉
App
Lib
hardware
Fram
ework
![Page 32: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/32.jpg)
无人驾驶 对弈 语音 视觉
App
Lib
hardware
Fram
ework
SWCaffe
![Page 33: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining](https://reader034.vdocument.in/reader034/viewer/2022043013/5fae8060c121413ca1597951/html5/thumbnails/33.jpg)