swdnn: a library for accelerating deep learning ... a library for accelerating de… · and future...
TRANSCRIPT
HAOHUAN FU
Dr. Haohuan Fu is an associate professor in the Ministry of Education Key
Laboratory for Earth System Modeling, and Center of Earth System Science in
Tsinghua University. He is also the associate director of the National
Supercomputing Center in Wuxi. His research interests include design
methodologies for highly efficient and highly scalable simulation applications
that can take advantage of emerging multi-core, many-core, and
reconfigurable architectures, and make full utilization of current Peta-Flops
and future Exa-Flops supercomputers; and intelligent data management,
analysis, and data mining platforms that combine the statistical methods and
machine learning technologies.
swDNN: A Library for Accelerating Deep Learning Applications
on Sunway TaihuLight
swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight Supercomputer
Haohuan FuNational Supercomputing Center in Wuxi
Tsinghua University
Deep Leaning from Big PictureDataScience(“BigData”)
DataAnalysis
DataAnalysis
MachineLearning
SVMK-MeansClustering
DeeplearningDeepneuralNets
ConvolutionalNeuralNets
RecommenderSystemsCollaborativeFilteringRegressionBayesian
NetworksDecisionTreesRandomForestsSemanticAnalysis
DataManagement
QueriesIndexing
DistributedStorage
AmountofData
Performance
NewAImodel(deeplearning)
MostlearningAlgorithm
Deep Leaning vs Traditional Machine Learning
FromAndrewNg FromNVIDIA
What is Deep Leaning good at?
Ecosystem of Deep Learning
Ecosystem of Deep Learning
ComplexityofTask
SizeofM
odel
SizeofModelAm
ountofData
ComplexityofTask
ComputationRequired
Challenges to Deep Learning
• LargeModels (Memory)• MoreFloatPointOperations(Computing)
8layers1.4Gflop/Image2GBGPU memory
~16%Error
152layers22.6Gflop/Image56GBGPUmemory
~3.5%Error
2012AlexNet
2015ResNet
2ExaFLOP25M|7,000hrs~16%Error
20ExaFLOP100M|12,000hrs
~5%Error
2014DeepSpeech1
2015DeepSpeech2
Images Speech
16xComputing28xMemory
10xComputing4x MemoryforParameters
ComputingCapacityTflopsà Pflops
Challenges to Deep Learning
“Investmentsincomputersystems— andIthinkthebleeding-edgeofAI,anddeeplearningspecifically,isshiftingtoHPC— cancutdownthetimetorunanexperimentfromaweektoadayandsometimesevenfaster.”
Bigdata+Deeplearning+Highperformancecomputing=IntelligenceGTC’14:DeepLearningMeetsHeterogeneousComputing
High Performance Deep Learning
ProfessorofStanford
http://www.theverge.com/
Big Sur:Facebook’sSupercomputerforDeeplearning40Pflops
High Performance Deep Learning
• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism
• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism
Towards High Performance Deep Learning
• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism
• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism
Towards High Performance Deep Learning
260 core inone chip
40,960 nodesin one system
Sunway-I:- CMA service, 1998- commercial chip- 0.384 Tflops- 48th of TOP500
Sunway BlueLight:- NSCC-Jinan, 2011- 16-core processor- 1 Pflops- 14th of TOP500
Sunway TaihuLight:- NSCC-Wuxi, 2016- 260-core processor- 125 Pflops- 1st of TOP500
The Sunway Machine Family
System TaihuLight Tianhe-2 Titan Sequoia Cori
Peak Performance (PFlops) 125.4 54.9 27.1 20.1 27.9
Total Memory (TB) 1310 1024 710 1572 879
Linpack Performance (PFlops) 93.0(74%) 33.9(62%) 17.6(65%) 17.2(85.3) 14.0(50%)
Rank of Top500 1 2 3 4 5
Performance/Power (Mflops/W) 6051.3 1901.5 2142.8 2176.6 3266.8
Rank of Green500 4 135 100 90 26
GTEPS 23755.7 2061.48 ### 23751 ###
Rank of Graph500 2 8 ### 3 ###
HPCG (Pflops) 0.3712 0.5801 0.3223 0.3304 0.3554
Rank of HPCG 4 2 7 6 5
Sunway TaihuLight V.S. Other Systems
14
Execution Pipelines
SIMD width Clock I-Cache size per CPE
LDM sizeper CPE
Memory Bandwidth
2 256bit 1.45GHz 16KB 64KB 136GB/s
Architecture of SW26010
• BasicneuronlayersinDNN• Fully-connectedlayer(BLAS);Poolinglayer• Activationfunction;BatchNormalization• *ConvolutionalLayer
• Convolutionallayersoccupyover90%time.
Scale up:Library for Deep Learning
𝑹𝑩𝑾𝑴𝑬𝑴→𝑳𝑫𝑴 =8𝐵𝑦𝑡𝑒
2𝑓𝑙𝑜𝑝/8×1.45𝐺𝐻𝑧 = 139.2𝐺𝐵𝑝𝑠
CPUUtility= ABCD.E
E= 0.32%
8by8CPEmesh
Registers
LDM
PeakPerformanceperCG/012 = 742.4829:;<
46.4GB/s�64
4~32GB/s
8GB/s
ConsideringExecutionEfficiency(EE)/012 = 742.4829:;< · OO
ConsideringLDM− REGExchange
/012 = 742.4829:;< · OO · VWX 1,46.4 · 64
[\]^_`→bcd
e
PeakPerformanceperCG/012 = 742.4829:;<
ConsideringExecutionEfficiency(EE)/012 = 742.4829:;< · OO
ConsideringDirectMemoryAccess/012 = 742.4829:;< · OO ·
VWX 1,g
bhijkj→lkm
e
DirectMemoryAccess REG-LDM-MEM
ConsideringMEM− LDMExchange/012 = 742.4829:;< · OO
·VWX 1,46.4 · 64
[\]^_`→bcd
e
· VWX 1,n\]^_`→bcd
[\]^_`→bcd
eMainMemory
A Performance Model for SW26010
• LDMblocking:loopsplittingandloopscheduling• WecannotapplyblockingtoeverydimensionduetothelimitedsizeofLDM.• TheDMAbandwidthbetweenmemoryandLDMwillbeaffectedbythesizeofleadingdimensionofloops.
• ToincreasedatareuseinLDM,weshouldarrangetheDMAoperationsatthe most outerloops.
TableI.MeasuredDMABandwidths(GBps)on1CG
LDM-related optimizations
𝑅BWKLM→NOK =𝑁Q + 𝑏TU𝑏V 𝐷𝑆2bTU𝑏V𝑁Q/𝑇
=
1𝑏[U𝑏V
+ 1𝑁Q
𝐷𝑆
2/𝑇
2𝑁\×𝑁Q×𝑏𝐶𝑜×𝑏VCachedCalculation
𝑁\×𝑏𝐶𝑜×𝑏V
𝑁\×𝑁Q
Inputelements
Filterelements
DMAgetData
LDM-related optimizations
𝑅BWKLM→NOK =𝐵 + 𝐾_𝑁Q 𝐷𝑆2𝐾_𝐵𝑁Q/𝑇
=
1𝐾_𝑁Q
+ 1𝐵 𝐷𝑆
2/𝑇
𝑁\×𝐵×𝐾_
𝑁\×𝑁Q
Inputelements
Filterelements
DMAgetData
2𝑁\×𝑁Q×𝐵×𝐾_CachedCalculation
LDM-related optimizations
• GEMMoperationinLDM• HowtodesigndatalayoutondistributedLDMspaceof64CPE?• Howtosharedatabetween64CPEwhenComputing?• HowtoloaddatafromLDMtoREGensureRBW<46.4GB/s?
• Solution• RegisterCommunication• Vectorization
Register-Related Optimization
• DataLayouts• Dividerowandcolumninto8parts• EachCPEmaintains1/8×1/8matrix
• DataSharing• RegisterCommunications
• 14cycleslatency• DatafetchwithProducerConsumermode
Register-Related Optimization
• Vectorization :ReduceRBWbetweenLDM-REG
𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇
v𝑟𝑏V
𝑟𝑏eQ
SIMDLoad+Extend
SIMDLoad
𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 4𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇
𝑅𝐵𝑊NOK→abc =4 + 4×4 ×8𝐵𝑦𝑡𝑒
2×4×4/(1.45𝐺𝐻𝑧×8) = 23.2𝐺𝐵/𝑠 < 46.4𝑟𝑏eU = 𝑟𝑏V = 4
v
Register-Related Optimization
datablockedinregisters
datablockedforvmaddoperation
Register-Related Optimization
𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 4𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇 𝑅𝐵𝑊NOK→abc =
4 + 4×4 ×8𝐵𝑦𝑡𝑒2×4×4/(1.45𝐺𝐻𝑧×8) = 23.2𝐺𝐵/𝑠 < 46.4
𝑟𝑏eU = 𝑟𝑏V = 4
Principles:
• ReduceReadAfterWrite(RAW) Hazard:
postponeissuingofdependentinstructions
• IncreaseInstructionLevelParallelism:
Overlap InstructionexecutionsonP0andP1pipelines
EE=16/26=61.5% EE=16/17=94.1%Instruction Reordering
Double-precisionperformanceresultsofourconvolutionkernelswithdifferent(Ni,No)rangingfrom(64,64)to(384,384),comparedwiththeK40mGPUresultswithcuDNNv5.(B=128,outputimage=64× 64,filter=3× 3)
• Performance of Convolutional Layers– Convolution performance above 1.6 Tflops– Speedup ranging from 1.91x to 9.75x compared with
cudnnv5.1.
Performance
• Performance is irrelevant to kernel size – MorestablethancuDNN
Performance
• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism
• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism
Towards High Performance Deep Learning
• SWCaffe :ADeepLearningFrameworkonSunway• Support RNN,CNN• IntegratedwithswDNN,swBLAS• ScalewithMPI• SynchronousParameterServer
CGNumber 1 2 4 8 16Sec/100iter 1989.79 1048.34 538.197 263.679 156.89
batchsize=1024Layer={conv,pool,conv,pool,ip, ip,softmax}channel={128,-,128,-,500,10,10}Filter={5*5,-,3*3,-, -,-,-}Data: Mnist
Distributed DNN Framework on Sunway
ParameterConfiguration
ScalabilityonSunway
无人驾驶 对弈 语音 视觉
App
Lib
hardware
Fram
ework
无人驾驶 对弈 语音 视觉
App
Lib
hardware
Fram
ework
SWCaffe