a scaleable implementation of deep learning on spark -alexander ulanov
TRANSCRIPT
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
A Scalable Implementation of Deep Learning on SparkAlexander Ulanov 1
Joint work with Xiangrui Meng2, Bert Greevenbosch3 With the help from Guoqiang Li4, Andrey Simanovsky1
1Hewlett-Packard Labs 2Databricks 3Huawei & Jules Energy 4Spark community
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.2
Outline
• Artificial neural network basics• Implementation of Multilayer Perceptron (MLP) in Spark• Optimization & parallelization• Experiments• Future work
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.3
Artificial neural networkBasics• Statistical model that approximates a function of multiple inputs• Consists of interconnected “neurons” which exchange messages
– “Neuron” produces an output by applying a transformation function on its inputs
• Network with more than 3 layers of neurons is called “deep”, instance of deep learning
Layer types & learning• A layer type is defined by a transformation function
– Affine: , Sigmoid: , Convolution, Softmax, etc.• Multilayer perceptron (MLP) – a network with several pairs of Affine &
Sigmoid layers• Model parameters – weights that “neurons” use for transformations• Parameters are iteratively estimated with the backpropagation
algorithmMultilayer perceptron• Speech recognition (phoneme classification), computer vision• Introduced in Spark 1.5.0
𝑥𝑦
inputoutput
hidden layer
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.4
Example of MLP in SparkHandwritten digits recognition• Dataset MNIST [LeCun et al. 1998]• 28x28 greyscale images of handwritten digits 0-9• MLP with 784 inputs, 10 outputs and two hidden
layers of 300 and 100 neurons
val digits: DataFrame = sqlContext.read.format("libsvm").load("/data/mnist")val mlp = new MultilayerPerceptronClassifier() .setLayers(Array(784, 300, 100, 10)) .setBlockSize(128)val model = mlp.fit(digits)
784 inputs 300 neurons
100 neurons10 neurons
1st hidden layer
2nd hidden layer
Output layer
digits = sqlContext.read.format("libsvm").load("/data/mnist")mlp = MultilayerPerceptronClassifier(layers=[784, 300, 100, 10], blockSize=128)model = mlp.fit(digits)
Scala
Python
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.5
Pipeline with PCA+MLP in Sparkval digits: DataFrame = sqlContext.read.format(“libsvm”).load(“/data/mnist”)val pca = new PCA() .setInputCol(“features”) .setK(20) .setOutPutCol(“features20”)val mlp = new MultilayerPerceptronClassifier() .setFeaturesCol(“features20”) .setLayers(Array(20, 50, 10)) .setBlockSize(128)val pipeline = new Pipeline() .setStages(Array(pca, mlp))val model = pipeline.fit(digits)
digits = sqlContext.read.format("libsvm").load("/data/mnist8m")pca = PCA(inputCol="features", k=20, outputCol="features20")mlp = MultilayerPerceptronClassifier(featuresCol="features20", layers=[20, 50, 10], blockSize=128)pipeline = Pipeline(stages=[pca, mlp])model = pipeline.fit(digits)
Scala
Python
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.6
MLP implementation in SparkRequirements• Conform to Spark APIs• Extensible interface (deep learning API)• Efficient and scalable (single node & cluster)Why conform to Spark APIs?• Spark can call any Java, Python or Scala library, not necessary designed for
Spark– Results with expensive data movement from Spark RDD to the library– Prohibits from using for Spark ML Pipelines
Extensible interface• Our implementation processes each layer as a black box with backpropagation
in general form– Allows further introduction of new layers and features
• CNN, Autoencoder, RBM are currently under dev. by community
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.7
Efficiency
Batch processing• Layer’s affine transformations can be represented in vector form:
– – output from the layer, vector of size – – the matrix of layer weights , – bias, vector of size – – input to the layer, vector of size
• Vector-matrix multiplications are not as efficient as matrix-matrix– Stack input vectors (into batch) to perform matrices multiplication: – is , is , – is , each column contains a copy of
• We implemented batch processing in matrix form – Enabled the use of optimized native BLAS libraries– Memory is reused to limit GC overhead
= * +
= * +
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.8(1
x1) *
(1x1
)
(10x
10) *
(10x
1)
(10x
10) *
(10x
10)
(100
x100
) * (1
00x1
)
(100
x100
) * (1
00x1
0)
(100
x100
) * (1
00x1
00)
(100
0x10
00) *
(100
0x10
0)
(100
0x10
00) *
(100
0x10
00)
(100
00x1
0000
) * (1
0000
x100
0)
(100
00x1
0000
) * (1
0000
x100
00)
1.00E-041.00E-031.00E-021.00E-01
1.00E+001.00E+011.00E+021.00E+031.00E+04
dgemm performance
netlib-NVBLAS netlib-MKLnetlib OpenBLAS netlib-f2jblas
Single node BLASBLAS in Spark • BLAS – Basic Linear Algebra
Subprograms• Hardware optimized native in C & Fortran
– CPU: MKL, OpenBLAS etc.– GPU: NVBLAS (F-BLAS interface to
CUDA)• Use in Spark through Netlib-javaExperiments• Huge benefit from native BLAS vs pure
Java f2jblas• GPU is faster (2x) only for large matrices
– When compute is larger than copy to/from GPU
• More details:– https://github.com/avulanov/scala-blas– “linalg: Matrix Computations in Apache
Spark” Reza et al., 2015
CPU: 2x Xeon X5650 @ 2.67GHz, 32GB RAMGPU: Tesla M2050 3GB, 575MHz, 448 CUDA cores
seconds
Matrices sizes
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.9
Scalability
Parallelization • Each iteration , each node
– 1. Gets parameters from master – 2. Computes a gradient – 3. Sends a gradient to master– 4. Master computes based on gradients
• Gradient type– Batch – process all data on each iteration– Stochastic – random point– Mini-batch – random batch
• How many workers to use?– Less workers – less compute– More workers – more communication
𝑤𝑘
𝑤𝑘+1≔𝑌 (𝛻𝑖𝑘𝐹 )
Master
Executor 1
Executor N
𝑤𝑘
Partition 1Partition 2
Partition P
Executor 1
Executor N
VV
v𝛻1
𝑘 𝐹 (𝑑𝑎𝑡𝑎1)
𝛻𝑁𝑘 𝐹 (𝑑𝑎𝑡𝑎𝑁)
𝛻1𝑘𝐹
Master
Executor 1
Executor N
𝛻𝑁𝑘 𝐹
Master V
Vv
1.
2.
3.
4.GoTo #1
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.10
Communication and computation trade-off
Parallelization of batch gradient• There are data points, features and classes
– Assume, we want to train logistic regression, it has parameters• Communication: workers get/receive 64 bit parameters through the network with
bandwidth and software overhead . Use all-reduce:
• Computation: each worker has FLOPS and processes of data, that needs operations
• What is the optimal number of workers?
– , if is the number of model parameters and floating point operations
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.11
Analysis of the trade-off
Optimal number of workers for batch gradient• Parallelism in a cluster
• Analysis– More FLOPS means lower degree of batch gradient parallelism in a cluster– More operations, i.e. more features and classes (or a deep network) means higher
degree– Small overhead for get/receive a message means higher degree
• Example: MNIST8M handwritten digit recognition dataset – 8.1M documents, 784 features, 10 classes, logistic regression– 32GFlops double precision CPU, 1Gbit network, overhead ~ 0.1s
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.12
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.50
20
40
60
80
100Spark MLP vs Caffe MLP
MLP (total) MLP (compute)Caffe CPU Caffe GPU
Scalability testingSetup• MNIST character recognition 60K samples• 6-layer MLP
(784,2500,2000,1500,1000,500,10)• 12M parameters• CPU: Xeon X5650 @ 2.67GHz• GPU: Tesla M2050 3GB, 575MHz• Caffe (Deep Learning from Berkeley): 1 node• Spark: 1 master + 5 workersResults per iteration• Single node (both tools double precision)
– 1.6 slower than Caffe CPU (Scala vs C++)• Scalability
– 5 nodes give 4.7x speedup, beats Caffe, close to GPU
Seconds
Workers
Com
mun
icati
on cost
𝑛=𝑚𝑎𝑥(⌊ 60𝐾 ∙12𝑀 ∙0.6964𝐺 (128 ∙12𝑀 /950𝑀+2 ∙0.1 )
⌋ ,1)=𝟒
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.13
Conclusions & future work
Conclusions • Scalable multilayer perceptron is available in Spark 1.5.0• Extensible internal API for Artificial Neural Networks
– Further contributions are welcome!• Native BLAS (and GPU) speeds up Spark• Heuristics for parallelization of batch gradientWork in progress [SPARK-5575]• Autoencoder(s)• Restricted Boltzmann Machines• Drop-out• Convolutional neural networksFuture work • SGD & parameter server
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
Thank you