swdnn: a library for accelerating deep learning ... a library for accelerating de… · and future...

33
HAOHUAN FU Dr. Haohuan Fu is an associate professor in the Ministry of Education Key Laboratory for Earth System Modeling, and Center of Earth System Science in Tsinghua University. He is also the associate director of the National Supercomputing Center in Wuxi. His research interests include design methodologies for highly efficient and highly scalable simulation applications that can take advantage of emerging multi-core, many-core, and reconfigurable architectures, and make full utilization of current Peta-Flops and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining platforms that combine the statistical methods and machine learning technologies. swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight

Upload: others

Post on 12-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

HAOHUAN FU

Dr. Haohuan Fu is an associate professor in the Ministry of Education Key

Laboratory for Earth System Modeling, and Center of Earth System Science in

Tsinghua University. He is also the associate director of the National

Supercomputing Center in Wuxi. His research interests include design

methodologies for highly efficient and highly scalable simulation applications

that can take advantage of emerging multi-core, many-core, and

reconfigurable architectures, and make full utilization of current Peta-Flops

and future Exa-Flops supercomputers; and intelligent data management,

analysis, and data mining platforms that combine the statistical methods and

machine learning technologies.

swDNN: A Library for Accelerating Deep Learning Applications

on Sunway TaihuLight

Page 2: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight Supercomputer

Haohuan FuNational Supercomputing Center in Wuxi

Tsinghua University

Page 3: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

Deep Leaning from Big PictureDataScience(“BigData”)

DataAnalysis

DataAnalysis

MachineLearning

SVMK-MeansClustering

DeeplearningDeepneuralNets

ConvolutionalNeuralNets

RecommenderSystemsCollaborativeFilteringRegressionBayesian

NetworksDecisionTreesRandomForestsSemanticAnalysis

DataManagement

QueriesIndexing

DistributedStorage

Page 4: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

AmountofData

Performance

NewAImodel(deeplearning)

MostlearningAlgorithm

Deep Leaning vs Traditional Machine Learning

Page 5: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

FromAndrewNg FromNVIDIA

What is Deep Leaning good at?

Page 6: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

Ecosystem of Deep Learning

Page 7: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

Ecosystem of Deep Learning

Page 8: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

ComplexityofTask

SizeofM

odel

SizeofModelAm

ountofData

ComplexityofTask

ComputationRequired

Challenges to Deep Learning

Page 9: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• LargeModels (Memory)• MoreFloatPointOperations(Computing)

8layers1.4Gflop/Image2GBGPU memory

~16%Error

152layers22.6Gflop/Image56GBGPUmemory

~3.5%Error

2012AlexNet

2015ResNet

2ExaFLOP25M|7,000hrs~16%Error

20ExaFLOP100M|12,000hrs

~5%Error

2014DeepSpeech1

2015DeepSpeech2

Images Speech

16xComputing28xMemory

10xComputing4x MemoryforParameters

ComputingCapacityTflopsà Pflops

Challenges to Deep Learning

Page 10: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

“Investmentsincomputersystems— andIthinkthebleeding-edgeofAI,anddeeplearningspecifically,isshiftingtoHPC— cancutdownthetimetorunanexperimentfromaweektoadayandsometimesevenfaster.”

Bigdata+Deeplearning+Highperformancecomputing=IntelligenceGTC’14:DeepLearningMeetsHeterogeneousComputing

High Performance Deep Learning

ProfessorofStanford

Page 11: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

http://www.theverge.com/

Big Sur:Facebook’sSupercomputerforDeeplearning40Pflops

High Performance Deep Learning

Page 12: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism

• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism

Towards High Performance Deep Learning

Page 13: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism

• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism

Towards High Performance Deep Learning

260 core inone chip

40,960 nodesin one system

Page 14: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

Sunway-I:- CMA service, 1998- commercial chip- 0.384 Tflops- 48th of TOP500

Sunway BlueLight:- NSCC-Jinan, 2011- 16-core processor- 1 Pflops- 14th of TOP500

Sunway TaihuLight:- NSCC-Wuxi, 2016- 260-core processor- 125 Pflops- 1st of TOP500

The Sunway Machine Family

Page 15: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

System TaihuLight Tianhe-2 Titan Sequoia Cori

Peak Performance (PFlops) 125.4 54.9 27.1 20.1 27.9

Total Memory (TB) 1310 1024 710 1572 879

Linpack Performance (PFlops) 93.0(74%) 33.9(62%) 17.6(65%) 17.2(85.3) 14.0(50%)

Rank of Top500 1 2 3 4 5

Performance/Power (Mflops/W) 6051.3 1901.5 2142.8 2176.6 3266.8

Rank of Green500 4 135 100 90 26

GTEPS 23755.7 2061.48 ### 23751 ###

Rank of Graph500 2 8 ### 3 ###

HPCG (Pflops) 0.3712 0.5801 0.3223 0.3304 0.3554

Rank of HPCG 4 2 7 6 5

Sunway TaihuLight V.S. Other Systems

14

Page 16: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

Execution Pipelines

SIMD width Clock I-Cache size per CPE

LDM sizeper CPE

Memory Bandwidth

2 256bit 1.45GHz 16KB 64KB 136GB/s

Architecture of SW26010

Page 17: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• BasicneuronlayersinDNN• Fully-connectedlayer(BLAS);Poolinglayer• Activationfunction;BatchNormalization• *ConvolutionalLayer

• Convolutionallayersoccupyover90%time.

Scale up:Library for Deep Learning

Page 18: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

𝑹𝑩𝑾𝑴𝑬𝑴→𝑳𝑫𝑴 =8𝐵𝑦𝑡𝑒

2𝑓𝑙𝑜𝑝/8×1.45𝐺𝐻𝑧 = 139.2𝐺𝐵𝑝𝑠

CPUUtility= ABCD.E

E= 0.32%

8by8CPEmesh

Registers

LDM

PeakPerformanceperCG/012 = 742.4829:;<

46.4GB/s�64

4~32GB/s

8GB/s

ConsideringExecutionEfficiency(EE)/012 = 742.4829:;< · OO

ConsideringLDM− REGExchange

/012 = 742.4829:;< · OO · VWX 1,46.4 · 64

[\]^_`→bcd

e

PeakPerformanceperCG/012 = 742.4829:;<

ConsideringExecutionEfficiency(EE)/012 = 742.4829:;< · OO

ConsideringDirectMemoryAccess/012 = 742.4829:;< · OO ·

VWX 1,g

bhijkj→lkm

e

DirectMemoryAccess REG-LDM-MEM

ConsideringMEM− LDMExchange/012 = 742.4829:;< · OO

·VWX 1,46.4 · 64

[\]^_`→bcd

e

· VWX 1,n\]^_`→bcd

[\]^_`→bcd

eMainMemory

A Performance Model for SW26010

Page 19: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• LDMblocking:loopsplittingandloopscheduling• WecannotapplyblockingtoeverydimensionduetothelimitedsizeofLDM.• TheDMAbandwidthbetweenmemoryandLDMwillbeaffectedbythesizeofleadingdimensionofloops.

• ToincreasedatareuseinLDM,weshouldarrangetheDMAoperationsatthe most outerloops.

TableI.MeasuredDMABandwidths(GBps)on1CG

LDM-related optimizations

Page 20: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

𝑅BWKLM→NOK =𝑁Q + 𝑏TU𝑏V 𝐷𝑆2bTU𝑏V𝑁Q/𝑇

=

1𝑏[U𝑏V

+ 1𝑁Q

𝐷𝑆

2/𝑇

2𝑁\×𝑁Q×𝑏𝐶𝑜×𝑏VCachedCalculation

𝑁\×𝑏𝐶𝑜×𝑏V

𝑁\×𝑁Q

Inputelements

Filterelements

DMAgetData

LDM-related optimizations

Page 21: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

𝑅BWKLM→NOK =𝐵 + 𝐾_𝑁Q 𝐷𝑆2𝐾_𝐵𝑁Q/𝑇

=

1𝐾_𝑁Q

+ 1𝐵 𝐷𝑆

2/𝑇

𝑁\×𝐵×𝐾_

𝑁\×𝑁Q

Inputelements

Filterelements

DMAgetData

2𝑁\×𝑁Q×𝐵×𝐾_CachedCalculation

LDM-related optimizations

Page 22: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• GEMMoperationinLDM• HowtodesigndatalayoutondistributedLDMspaceof64CPE?• Howtosharedatabetween64CPEwhenComputing?• HowtoloaddatafromLDMtoREGensureRBW<46.4GB/s?

• Solution• RegisterCommunication• Vectorization

Register-Related Optimization

Page 23: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• DataLayouts• Dividerowandcolumninto8parts• EachCPEmaintains1/8×1/8matrix

• DataSharing• RegisterCommunications

• 14cycleslatency• DatafetchwithProducerConsumermode

Register-Related Optimization

Page 24: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• Vectorization :ReduceRBWbetweenLDM-REG

𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇

v𝑟𝑏V

𝑟𝑏eQ

SIMDLoad+Extend

SIMDLoad

𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 4𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇

𝑅𝐵𝑊NOK→abc =4 + 4×4 ×8𝐵𝑦𝑡𝑒

2×4×4/(1.45𝐺𝐻𝑧×8) = 23.2𝐺𝐵/𝑠 < 46.4𝑟𝑏eU = 𝑟𝑏V = 4

v

Register-Related Optimization

datablockedinregisters

datablockedforvmaddoperation

Page 25: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

Register-Related Optimization

𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 4𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇 𝑅𝐵𝑊NOK→abc =

4 + 4×4 ×8𝐵𝑦𝑡𝑒2×4×4/(1.45𝐺𝐻𝑧×8) = 23.2𝐺𝐵/𝑠 < 46.4

𝑟𝑏eU = 𝑟𝑏V = 4

Page 26: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

Principles:

• ReduceReadAfterWrite(RAW) Hazard:

postponeissuingofdependentinstructions

• IncreaseInstructionLevelParallelism:

Overlap InstructionexecutionsonP0andP1pipelines

EE=16/26=61.5% EE=16/17=94.1%Instruction Reordering

Page 27: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

Double-precisionperformanceresultsofourconvolutionkernelswithdifferent(Ni,No)rangingfrom(64,64)to(384,384),comparedwiththeK40mGPUresultswithcuDNNv5.(B=128,outputimage=64× 64,filter=3× 3)

• Performance of Convolutional Layers– Convolution performance above 1.6 Tflops– Speedup ranging from 1.91x to 9.75x compared with

cudnnv5.1.

Performance

Page 28: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• Performance is irrelevant to kernel size – MorestablethancuDNN

Performance

Page 29: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• Scale up:leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism

• Scaleout:usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism

Towards High Performance Deep Learning

Page 30: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

• SWCaffe :ADeepLearningFrameworkonSunway• Support RNN,CNN• IntegratedwithswDNN,swBLAS• ScalewithMPI• SynchronousParameterServer

CGNumber 1 2 4 8 16Sec/100iter 1989.79 1048.34 538.197 263.679 156.89

batchsize=1024Layer={conv,pool,conv,pool,ip, ip,softmax}channel={128,-,128,-,500,10,10}Filter={5*5,-,3*3,-, -,-,-}Data: Mnist

Distributed DNN Framework on Sunway

ParameterConfiguration

ScalabilityonSunway

Page 31: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

无人驾驶 对弈 语音 视觉

App

Lib

hardware

Fram

ework

Page 32: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining

无人驾驶 对弈 语音 视觉

App

Lib

hardware

Fram

ework

SWCaffe

Page 33: swDNN: A Library for Accelerating Deep Learning ... A Library for Accelerating De… · and future Exa-Flops supercomputers; and intelligent data management, analysis, and data mining