distributed machine learning on spark - stanford...
TRANSCRIPT
![Page 1: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/1.jpg)
Reza Zadeh
Distributed Machine Learning"on Spark
@Reza_Zadeh | http://reza-zadeh.com
![Page 2: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/2.jpg)
Outline Data flow vs. traditional network programming Spark computing engine Optimization Example
Matrix Computations MLlib + {Streaming, GraphX, SQL} Future of MLlib
![Page 3: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/3.jpg)
Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks
and where to run each task » Run parts twice fault recovery
Biggest example: MapReduce Map
Map
Map
Reduce
Reduce
![Page 4: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/4.jpg)
Spark Computing Engine Extends a programming language with a distributed collection data-structure » “Resilient distributed datasets” (RDD)
Open source at Apache » Most active community in big data, with 50+
companies contributing
Clean APIs in Java, Scala, Python Community: SparkR
![Page 5: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/5.jpg)
Key Idea Resilient Distributed Datasets (RDDs) » Collections of objects across a cluster with user
controlled partitioning & storage (memory, disk, ...) » Built via parallel transformations (map, filter, …) » The world only lets you make make RDDs such that
they can be:
Automatically rebuilt on failure
![Page 6: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/6.jpg)
MLlib History MLlib is a Spark subproject providing machine learning primitives
Initial contribution from AMPLab, UC Berkeley
Shipped with Spark since Sept 2013
![Page 7: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/7.jpg)
MLlib: Available algorithms classification: logistic regression, linear SVM,"naïve Bayes, least squares, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS
![Page 8: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/8.jpg)
Optimization At least two large classes of optimization problems humans can solve: » Convex Programs
» Spectral Problems
![Page 9: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/9.jpg)
Optimization Example
![Page 10: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/10.jpg)
Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-‐p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda a, b: a + b) w -‐= gradient print “Final w: %s” % w
![Page 11: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/11.jpg)
Logistic Regression Results
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing T
ime
(s)
Number of Iterations
Hadoop Spark
110 s / iteration
first iteration 80 s further iterations 1 s
100 GB of data on 50 m1.xlarge EC2 machines
![Page 12: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/12.jpg)
Behavior with Less RAM 68
.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0% 25% 50% 75% 100%
Itera
tion
time
(s)
% of working set in memory
![Page 13: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/13.jpg)
Distributing Matrix Computations
![Page 14: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/14.jpg)
Distributing Matrices How to distribute a matrix across machines? » By Entries (CoordinateMatrix) » By Rows (RowMatrix)
» By Blocks (BlockMatrix) All of Linear Algebra to be rebuilt using these partitioning schemes
As of version 1.3
![Page 15: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/15.jpg)
Distributing Matrices Even the simplest operations require thinking about communication e.g. multiplication
How many different matrix multiplies needed? » At least one per pair of {Coordinate, Row,
Block, LocalDense, LocalSparse} = 10 » More because multiplies not commutative
![Page 16: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/16.jpg)
Singular Value Decomposition on Spark
![Page 17: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/17.jpg)
Singular Value Decomposition
![Page 18: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/18.jpg)
Singular Value Decomposition Two cases » Tall and Skinny » Short and Fat (not really)
» Roughly Square SVD method on RowMatrix takes care of which one to call.
![Page 19: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/19.jpg)
Tall and Skinny SVD
![Page 20: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/20.jpg)
Tall and Skinny SVD
Gets us V and the singular values
Gets us U by one matrix multiplication
![Page 21: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/21.jpg)
Square SVD ARPACK: Very mature Fortran77 package for computing eigenvalue decompositions"
JNI interface available via netlib-java"
Distributed using Spark – how?
![Page 22: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/22.jpg)
Square SVD via ARPACK Only interfaces with distributed matrix via matrix-vector multiplies
The result of matrix-vector multiply is small. The multiplication can be distributed.
![Page 23: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/23.jpg)
Square SVD
With 68 executors and 8GB memory in each, looking for the top 5 singular vectors
![Page 24: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/24.jpg)
Communication-Efficient
All pairs similarity on Spark (DIMSUM)
![Page 25: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/25.jpg)
All pairs Similarity All pairs of cosine scores between n vectors » Don’t want to brute force (n choose 2) m » Essentially computes
"Compute via DIMSUM
» Dimension Independent Similarity Computation using MapReduce
![Page 26: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/26.jpg)
Intuition Sample columns that have many non-zeros with lower probability. "
On the flip side, columns that have fewer non-zeros are sampled with higher probability." Results provably correct and independent of larger dimension, m.
![Page 27: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/27.jpg)
Spark implementation
![Page 28: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/28.jpg)
MLlib + {Streaming, GraphX, SQL}
![Page 29: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/29.jpg)
A General Platform
Spark Core
Spark Streaming"
real-time
Spark SQL structured
GraphX graph
MLlib machine learning
…
Standard libraries included with Spark
![Page 30: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/30.jpg)
Benefit for Users Same engine performs data extraction, model training and interactive queries
… DFS read
DFS write pa
rse DFS read
DFS write tra
in DFS read
DFS write qu
ery
DFS
DFS read pa
rse
train
quer
y
Separate engines
Spark
![Page 31: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/31.jpg)
MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion, k-means as of 1.2
Model weights are updated via SGD, thus amenable to streaming
More work needed for decision trees
![Page 32: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/32.jpg)
MLlib + SQL
points = context.sql(“select latitude, longitude from tweets”) !
model = KMeans.train(points, 10) !!
DataFrames coming in Spark 1.3! (March 2015)
![Page 33: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/33.jpg)
MLlib + GraphX
![Page 34: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/34.jpg)
Future of MLlib
![Page 35: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/35.jpg)
Research Goal: General Distributed Optimization
Distribute CVX by backing CVXPY with
PySpark
Easy-‐to-‐express distributable convex
programs
Need to know less math to optimize
complicated objectives
![Page 36: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/36.jpg)
Most active open source community in big data
200+ developers, 50+ companies contributing
Spark Community
Giraph Storm
0
50
100
150
Contributors in past year
![Page 37: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/37.jpg)
Continuing Growth
source: ohloh.net
Contributors per month to Spark
![Page 38: Distributed Machine Learning on Spark - Stanford …stanford.edu/~rezab/slides/strata2015.pdfdistributed collection data-structure » “Resilient distributed datasets” (RDD) Open](https://reader030.vdocument.in/reader030/viewer/2022040410/5ecd7e9934d01e3af214ac48/html5/thumbnails/38.jpg)
Spark and ML Spark has all its roots in research, so we hope to keep incorporating new ideas!