swdnn: a library for accelerating deep learning ... a library for accelerating de… · and future...

HAOHUAN FU

Dr. Haohuan Fu is an associate professor in the Ministry of Education Key

Laboratory for Earth System Modeling, and Center of Earth System Science in

Tsinghua University. He is also the associate director of the National

Supercomputing Center in Wuxi. His research interests include design

methodologies for highly efficient and highly scalable simulation applications

that can take advantage of emerging multi-core, many-core, and

reconfigurable architectures, and make full utilization of current Peta-Flops

and future Exa-Flops supercomputers; and intelligent data management,

analysis, and data mining platforms that combine the statistical methods and

machine learning technologies.

swDNN: A Library for Accelerating Deep Learning Applications

on Sunway TaihuLight

swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight Supercomputer

Haohuan FuNational Supercomputing Center in Wuxi

Tsinghua University

Deep Leaning from Big PictureDataScience(“BigData”)

DataAnalysis

DataAnalysis

MachineLearning

SVMK-MeansClustering

DeeplearningDeepneuralNets

ConvolutionalNeuralNets

RecommenderSystemsCollaborativeFilteringRegressionBayesian

NetworksDecisionTreesRandomForestsSemanticAnalysis

DataManagement

QueriesIndexing

DistributedStorage

AmountofData

Performance

NewAImodel(deeplearning)

MostlearningAlgorithm

Deep Leaning vs Traditional Machine Learning

FromAndrewNg FromNVIDIA

What is Deep Leaning good at?

Ecosystem of Deep Learning

ComplexityofTask

SizeofM

odel

SizeofModelAm

ountofData

ComplexityofTask

ComputationRequired

Challenges to Deep Learning

• LargeModels (Memory)• MoreFloatPointOperations(Computing)

8layers1.4Gflop/Image2GBGPU memory

~16%Error

152layers22.6Gflop/Image56GBGPUmemory

~3.5%Error

2012AlexNet

2015ResNet

2ExaFLOP25M|7,000hrs~16%Error

20ExaFLOP100M|12,000hrs

~5%Error

2014DeepSpeech1

2015DeepSpeech2

Images Speech

16xComputing28xMemory

10xComputing4x MemoryforParameters

ComputingCapacityTflopsà Pflops

Challenges to Deep Learning

“Investmentsincomputersystems— andIthinkthebleeding-edgeofAI,anddeeplearningspecifically,isshiftingtoHPC— cancutdownthetimetorunanexperimentfromaweektoadayandsometimesevenfaster.”

Bigdata+Deeplearning+Highperformancecomputing=IntelligenceGTC’14:DeepLearningMeetsHeterogeneousComputing

High Performance Deep Learning

ProfessorofStanford

http://www.theverge.com/

Big Sur:Facebook’sSupercomputerforDeeplearning40Pflops

High Performance Deep Learning

• Scale up：leveraginghardwarepowerinsideasinglemachine• Useacceleratorswithhighcomputingcapacity,likeGPU,TPU,XeonPhi,etc.• DependencyGraphincreasesoperationlevelparallelism

• Scaleout：usingmultiplemachinesinalargeclustertoincreasetheavailablecomputingpower• DataParallelism• ModelParallelism

Towards High Performance Deep Learning




260 core inone chip

40,960 nodesin one system

Sunway-I:- CMA service, 1998- commercial chip- 0.384 Tflops- 48th of TOP500

Sunway BlueLight:- NSCC-Jinan, 2011- 16-core processor- 1 Pflops- 14th of TOP500

Sunway TaihuLight:- NSCC-Wuxi, 2016- 260-core processor- 125 Pflops- 1st of TOP500

The Sunway Machine Family

System TaihuLight Tianhe-2 Titan Sequoia Cori

Peak Performance (PFlops) 125.4 54.9 27.1 20.1 27.9

Total Memory (TB) 1310 1024 710 1572 879

Linpack Performance (PFlops) 93.0(74%) 33.9(62%) 17.6(65%) 17.2(85.3) 14.0(50%)

Rank of Top500 1 2 3 4 5

Performance/Power (Mflops/W) 6051.3 1901.5 2142.8 2176.6 3266.8

Rank of Green500 4 135 100 90 26

GTEPS 23755.7 2061.48 ### 23751 ###

Rank of Graph500 2 8 ### 3 ###

HPCG (Pflops) 0.3712 0.5801 0.3223 0.3304 0.3554

Rank of HPCG 4 2 7 6 5

Sunway TaihuLight V.S. Other Systems

14

Execution Pipelines

SIMD width Clock I-Cache size per CPE

LDM sizeper CPE

Memory Bandwidth

2 256bit 1.45GHz 16KB 64KB 136GB/s

Architecture of SW26010

• BasicneuronlayersinDNN• Fully-connectedlayer(BLAS);Poolinglayer• Activationfunction;BatchNormalization• *ConvolutionalLayer

• Convolutionallayersoccupyover90%time.

Scale up：Library for Deep Learning

𝑹𝑩𝑾𝑴𝑬𝑴→𝑳𝑫𝑴 =8𝐵𝑦𝑡𝑒

2𝑓𝑙𝑜𝑝/8×1.45𝐺𝐻𝑧 = 139.2𝐺𝐵𝑝𝑠

CPUUtility= ABCD.E

E= 0.32%

8by8CPEmesh

Registers

LDM

PeakPerformanceperCG/012 = 742.4829:;<

46.4GB/s�64

4~32GB/s

8GB/s

ConsideringExecutionEfficiency(EE)/012 = 742.4829:;< · OO

ConsideringLDM− REGExchange

/012 = 742.4829:;< · OO · VWX 1,46.4 · 64

[\]^_`→bcd

e

PeakPerformanceperCG/012 = 742.4829:;<

ConsideringExecutionEfficiency(EE)/012 = 742.4829:;< · OO

ConsideringDirectMemoryAccess/012 = 742.4829:;< · OO ·

VWX 1,g

bhijkj→lkm

e

DirectMemoryAccess REG-LDM-MEM

ConsideringMEM− LDMExchange/012 = 742.4829:;< · OO

·VWX 1,46.4 · 64

[\]^_`→bcd

e

· VWX 1,n\]^_`→bcd

[\]^_`→bcd

eMainMemory

A Performance Model for SW26010

• LDMblocking:loopsplittingandloopscheduling• WecannotapplyblockingtoeverydimensionduetothelimitedsizeofLDM.• TheDMAbandwidthbetweenmemoryandLDMwillbeaffectedbythesizeofleadingdimensionofloops.

• ToincreasedatareuseinLDM,weshouldarrangetheDMAoperationsatthe most outerloops.

TableI.MeasuredDMABandwidths(GBps)on1CG

LDM-related optimizations

𝑅BWKLM→NOK =𝑁Q + 𝑏TU𝑏V 𝐷𝑆2bTU𝑏V𝑁Q/𝑇

=

1𝑏[U𝑏V

+ 1𝑁Q

𝐷𝑆

2/𝑇

2𝑁\×𝑁Q×𝑏𝐶𝑜×𝑏VCachedCalculation

𝑁\×𝑏𝐶𝑜×𝑏V

𝑁\×𝑁Q

Inputelements

Filterelements

DMAgetData


𝑅BWKLM→NOK =𝐵 + 𝐾_𝑁Q 𝐷𝑆2𝐾_𝐵𝑁Q/𝑇

=

1𝐾_𝑁Q

+ 1𝐵 𝐷𝑆

2/𝑇

𝑁\×𝐵×𝐾_

𝑁\×𝑁Q

Inputelements

Filterelements

DMAgetData

2𝑁\×𝑁Q×𝐵×𝐾_CachedCalculation


• GEMMoperationinLDM• HowtodesigndatalayoutondistributedLDMspaceof64CPE?• Howtosharedatabetween64CPEwhenComputing?• HowtoloaddatafromLDMtoREGensureRBW<46.4GB/s?

• Solution• RegisterCommunication• Vectorization

Register-Related Optimization

• DataLayouts• Dividerowandcolumninto8parts• EachCPEmaintains1/8×1/8matrix

• DataSharing• RegisterCommunications

• 14cycleslatency• DatafetchwithProducerConsumermode


• Vectorization :ReduceRBWbetweenLDM-REG

𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇

v𝑟𝑏V

𝑟𝑏eQ

SIMDLoad+Extend

SIMDLoad

𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 4𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇

𝑅𝐵𝑊NOK→abc =4 + 4×4 ×8𝐵𝑦𝑡𝑒

2×4×4/(1.45𝐺𝐻𝑧×8) = 23.2𝐺𝐵/𝑠 < 46.4𝑟𝑏eU = 𝑟𝑏V = 4

v


datablockedinregisters

datablockedforvmaddoperation


𝑅𝐵𝑊NOK→abc =𝑟𝑏V + 4𝑟𝑏eU 𝐷𝑆2𝑟𝑏V𝑟𝑏eU/𝑇 𝑅𝐵𝑊NOK→abc =

4 + 4×4 ×8𝐵𝑦𝑡𝑒2×4×4/(1.45𝐺𝐻𝑧×8) = 23.2𝐺𝐵/𝑠 < 46.4

𝑟𝑏eU = 𝑟𝑏V = 4

Principles:

• ReduceReadAfterWrite(RAW) Hazard:

postponeissuingofdependentinstructions

• IncreaseInstructionLevelParallelism:

Overlap InstructionexecutionsonP0andP1pipelines

EE=16/26=61.5% EE=16/17=94.1%Instruction Reordering

Double-precisionperformanceresultsofourconvolutionkernelswithdifferent(Ni,No)rangingfrom(64,64)to(384,384),comparedwiththeK40mGPUresultswithcuDNNv5.(B=128,outputimage=64× 64,filter=3× 3)

• Performance of Convolutional Layers– Convolution performance above 1.6 Tflops– Speedup ranging from 1.91x to 9.75x compared with

cudnnv5.1.

Performance

• Performance is irrelevant to kernel size – MorestablethancuDNN

Performance

• SWCaffe :ADeepLearningFrameworkonSunway• Support RNN,CNN• IntegratedwithswDNN,swBLAS• ScalewithMPI• SynchronousParameterServer

CGNumber 1 2 4 8 16Sec/100iter 1989.79 1048.34 538.197 263.679 156.89

batchsize=1024Layer={conv,pool,conv,pool,ip, ip,softmax}channel={128,-,128,-,500,10,10}Filter={5*5,-,3*3,-, -,-,-}Data: Mnist

Distributed DNN Framework on Sunway

ParameterConfiguration

ScalabilityonSunway

无人驾驶对弈语音视觉

App

Lib

hardware

Fram

ework

无人驾驶对弈语音视觉

App

Lib

hardware

Fram

ework

SWCaffe

swdnn: a library for accelerating deep learning ... a library for accelerating de… · and future...

Documents