deep$learning$on$sharcnet:$ $ $from$cpu$to$gpu$cluster$ · 2015-02-04 · whyusinggpu?$ 14 cublas:...

22
Deep Learning on SHARCNET: From CPU to GPU cluster Fei Mao SHARCNET E-mail: [email protected]

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Deep  Learning  on  SHARCNET:      From  CPU  to  GPU  cluster  

Fei Mao SHARCNET E-mail: [email protected]

 

Page 2: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

List  of  Topics:  •  Deep  Learning  basics  - Typical  models,  computaDonal  needs  

•  HPC  and  DL  - Why  GPU  cluster?  

•  TheoreDcal  performance:  CPU  vs.  GPU  •  Demos  and  benchmarks  on  DL  tools  with  real  world  problem  

- Why  &  how  to  go  deeper  and  bigger?  •  Monk  hardware  topology  •  Intra-­‐node  comm  (GPUDirect  P2P)    •  Inter-­‐node  comm  (CUDA-­‐aware  MPI)  

Page 3: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Deep  Learning  basics    

•  MulDple  layer  networks:  •  RBMs(Restricted  Boltzmann  Machines),  DBN(Deep  Belief  Network)  

•  ConvoluDonal  neural  network  

Page 4: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

ComputaDonal  needs  

•  RBM:  •  Do  something:  MCMC,  RNG  •  UpdaDng  weights:  minibatch,  GEMM  

•  CNN:  •  ConvoluDonal  layers(90-­‐95%):  cuDNN,  FFT(cuFFT,YFFT)  •  Fully  connected  layers(5-­‐10%)  

Page 5: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Why  using  GPU?  

19

cuRAND: Up to 70x Faster vs. Intel MKL

• cuRAND 6.5 on K40c, ECC ON, double-precision input and output data on device • MKL 11.0.4 on Intel IvyBridge single socket 12-core E5-2697 v2 @ 2.70GHz Performance may vary based on OS version and motherboard configuration

0

2

4

6

8

10

12

14

16

Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3aUniform Distribution Normal Distribution Log-Normal Distribution

GSa

mpl

es /

sec

cuRANDMKL

Page 6: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Why  using  GPU?  

14

cuBLAS: ZGEMM 6x Faster than MKL

• cuBLAS 6.5 on K40m, ECC ON, input and output data on device • MKL 11.0.4 on Intel IvyBridge single socket 12-core E5-2697 v2 @ 2.70GHz Performance may vary based on OS version and motherboard configuration

0

200

400

600

800

1000

1200

1400

1600

0 500 1000 1500 2000 2500 3000 3500 4000

GFL

OPS

Matrix Dimension (m=n=k)

cuBLASMKL

Page 7: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Why  using  GPU?  

4

FFTs up to 10x Faster than MKL

• MKL 10.1r1 on Intel Quad Core i7-940 1333, 2.93Ghz

• cuFFT 4.0 on Tesla C2070, ECC on

• Performance measured for ~16M total elements,

split into batches of transforms of the size on the x-axis

0

50

100

150

200

250

300

350

400

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

gF

LO

Ps

log2(size)

cuFFT - Single Precision

Radix2Radix3Radix7Radix5MKL

0

20

40

60

80

100

120

140

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28gF

LO

Ps

log2(size)

cuFFT - Double Precision

Radix2Radix3Radix7Radix5MKL

1D used in audio processing and as a foundation for 2D and 3D FFTs

Page 8: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Why  using  GPU?  

29

Using Caffe with cuDNN

Accelerate Caffe layer types by 1.2 – 3x Example: AlexNet Layer 2 forward: 1.9x faster convolution, 2.7x faster pooling

Integrated into Caffe dev branch today! (targeting official release with Caffe 1.0)

Caffe (CPU*)

1x

Caffe (GPU) 11x

Caffe (cuDNN)

14x

Baseline Caffe compared to Caffe accelerated by cuDNN on K40

Overall AlexNet training time

*CPU is 24 core E5-2697v2 @ 2.4GHz Intel MKL 11.1.3

Page 9: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Deep  Learning  tools  

•  Theano:  •  a  Python  library  that  allows  you  to  define,  opDmize,  and  evaluate  mathemaDcal  expressions  involving  mulD-­‐dimensional  arrays  efficiently  

•  transparent  use  of  a  GPU(defined  in  ~/.theanorc)  •  efficient  symbolic  differenDaDon  •  speed  and  stability  opDmizaDons  

•  Theano  demo  and  benchmark:  •  Deep  GeneraDve  StochasDc  Networks(GSN)  

Page 10: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Running  Theano  on  SHARCNET  

1.  Load  python  module:  •  On  monk,  module  load  python/intel/2.7.8  

2.  Export  PYTHONPATH  for  other  package  •  export  PYTHONPATH=/yourpath/python_packages/lib/python2.7/site-­‐packages:$PYTHONPATH  

3.  Edit  “.theanorc”  under  /home/username  •  Add  “device  =  gpu”  under  [global]  

4.  Run/sqsub  python  run_gsn.py  

Page 11: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Theano  GPU  Speedup  

0  

2  

4  

6  

8  

10  

12  

14  

16  

18  

Angel(GTX  750Ti)   Monk(M2070)   20  cores,  K20  

CPU(whole  node)  

single  GPU  

Page 12: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Deep  Learning  tools  

•  Caffe:  •  A  deep  learning  framework  developed  with  cleanliness,  readability,  and  speed  in  mind.  

•  Fast  GPU  implementaDon  with  cuDNN  •  Designing  your  CNN  without  coding  

•  Caffe  demo  and  benchmark:  •  Training  CaffeNet/AlexNet  on  ImageNet!  

Page 13: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Running  Caffe  on  SHARCNET  1.  Load  gcc  module:  

•  Unload  intel  mkl  openmpi  and  then  load  gcc/4.8.2  2.  Export  PATH  and  LD_LIBRARY_PATH  

•  Monk  use  cuda  6(modified  code)  •  Angel  use  cuda  6.5  with  cuDNN  •  Dependencies:  /work/feimao/sosware_installs/  

3.  Create  *LEVELDB*  file  from  images  •  LMDB  is  default  but  doesn't  work  on  SHARCNET  

4.  Modify  train_val.prototxt  with  leveldb  read  file,  batch  size,  etc.  

5.  Modify  solver.prototxt  for  lr,  momentum  ,etc.  6.  Run/sqsub  ./build/tools/caffe  train  -­‐-­‐solver=…  

Page 14: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

CaffeNet  training  Dme(s)  

0  

10  

20  

30  

40  

50  

60  

70  

80  

Monk   Angel   K20  

w/o  cuDNN  

w/  cuDNN  

Training on ImageNet with 128 batch size, 40 tiers = 128*40=5120 images

Page 15: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Why  go  deeper  and  larger  

LeNet-5: 60k Handwritten digits

AlexNet: 1.2m ImageNet(15m pretrain), 7 CNNs, 15.3% top-5 error

GoogLeNet: 6.67% top-5 error

Page 16: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Going  deeper  is  not  easy!  •  Theano  and  Caffe  support  single  GPU  ONLY!  •  Long  Dme  to  train  on  big  dataset:  

– AlexNet:  On  a  K40  machine,  every  20  iteraDons(5120  images)  costs  26.5  seconds  to  run!  160  hours  for  90  epoch!  

– GoogLeNet:  K40,  5120  images  takes  67.5  seconds!  More  than  a  month  for  250  epoch!  

•  How  to  train  huge  networks  in  reasonable  Dme?  

Page 17: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

GPU  cluster  is  the  cure!  •  In  2012,  Google  used  a  CPU  cluster  of  1000  nodes(16000  cores)  to  train  a  network  with  1.8B  parameters.    

•  In  2013,  16-­‐node  GPU  cluster(COTS  HPC,  64  gpus)  can  train  a  model  with  11.2B  parameters!  

•  In  2015,  Baidu  used  a  36-­‐node  GPU  cluster  (144  gpus)  to  train  a  network  for  ImageNet  with  larger  image  size(512x512)  and  more  cropped  samples.  Top-­‐5  error:  5.98%  

Page 18: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Developing  on  GPU  cluster  •  Hardware  topology:  (Monk)  

… IB  network  

GPU0  

CPU  

GPU1  

sw  

IB  network  

GPU0  

CPU  

GPU1  

sw  

Infiniband network: ~5GB/S

PCI-E switch: ~6GB/S

Page 19: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Developing  on  GPU  cluster  •  Intra-­‐node  comm  (GPUDirect  P2P)    

IB  network  

GPU0  

CPU  

GPU1  

sw  

•  w/o  P2P:  data  go  to  CPU  memory  •  CudaSetDevice(0) •  CudaMalloc(d_0) •  CudaMemcpy(d_0 to h_0) •  CudaSetDevice(1) •  CudaMalloc(d_1) •  CudaMemcpy(h_0 to d_1)

•  w/  P2P:  data  go  through  PCI-­‐E  bus  •  CudaSetDevice(0) •  CudaMalloc(d_0) •  CudaSetDevice(1) •  CudaMalloc(d_1) •  CudaMemcpyPeer(d_0 to d_1)

Page 20: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Developing  on  GPU  cluster  •  MulD-­‐GPU  implementaDon  of  Theano:  

–  Parallel  data  loading  –  Data  parallelism  of  AlexNet  –  Using  PyCUDA  with  GPUDirect  P2P  support  –  ~1.7x  speed-­‐up  with  2  GPUs  –  hxps://github.com/uoguelph-­‐mlrg/theano_alexnet  

Page 21: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Developing  on  GPU  cluster  •  Inter-­‐node  comm  (CUDA-­‐aware  MPI)    

IB  network  

GPU0  

CPU  

GPU1  

sw  

•  w/o  cuda-­‐aware  MPI:  data  go  to  CPU  memory  and  then  IB  network  

•  CudaMalloc(d_0 on proc_0) •  CudaMalloc(d_1 on proc_1) •  CudaMemcpy(d_0 to h_0) •  MPI_Send(from h_0 to h_1) •  CudaMemcpy(h_1 to d_1)

•  w/  P2P:  GPU  data  go  to  IB  network  through  PCI-­‐E  bus  

•  CudaMalloc(d_0 on proc_0) •  CudaMalloc(d_1 on proc_11 ) •  MPI_Send(from d_0 to d_1)

Page 22: Deep$Learning$on$SHARCNET:$ $ $From$CPU$to$GPU$cluster$ · 2015-02-04 · WhyusingGPU?$ 14 cuBLAS: ZGEMM 6x Faster than MKL • cuBLAS 6.5 on K40m, ECC ON, input and output data on

Conclusion:  

Go  deeper  and  deeper  and  deeper…  on  GPU  cluster!