introduction to distributed optimizationrezab/dao/notes/intro_to_dao.pdf2) lazily transform them to...
TRANSCRIPT
![Page 1: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/1.jpg)
Reza Zadeh
Introduction to Distributed Optimization
@Reza_Zadeh | http://reza-zadeh.com
![Page 2: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/2.jpg)
Key IdeaResilient Distributed Datasets (RDDs)»Collections of objects across a cluster with user
controlled partitioning & storage (memory, disk, ...)»Built via parallel transformations (map, filter, …)»The world only lets you make make RDDs such that
they can be:
Automatically rebuilt on failure
![Page 3: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/3.jpg)
Life of a Spark Program1) Create some input RDDs from external data or
parallelize a collection in your driver program.
2) Lazily transform them to define new RDDs using transformations like filter() or map()
3) Ask Spark to cache() any intermediate RDDs that will need to be reused.
4) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.
![Page 4: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/4.jpg)
Example Transformationsmap() intersection() cartesion()
flatMap() distinct() pipe()
filter() groupByKey() coalesce()
mapPartitions() reduceByKey() repartition()
mapPartitionsWithIndex() sortByKey() partitionBy()
sample() join() ...
union() cogroup() ...
![Page 5: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/5.jpg)
Example Actionsreduce() takeOrdered()
collect() saveAsTextFile()
count() saveAsSequenceFile()
first() saveAsObjectFile()
take() countByKey()
takeSample() foreach()
saveToCassandra() ...
![Page 6: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/6.jpg)
PairRDDOperations for RDDs of tuples (Scala has nice tuple support)https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
![Page 7: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/7.jpg)
groupByKeyAvoid using it –
use reduceByKey
![Page 8: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/8.jpg)
Guide for RDD operationshttps://spark.apache.org/docs/latest/programming-guide.html
Browse through this.
![Page 9: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/9.jpg)
Communication Costs
![Page 10: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/10.jpg)
MLlib: Available algorithmsclassification: logistic regression, linear SVM,naïve Bayes, least squares, classification treeregression: generalized linear models (GLMs), regression treecollaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF)clustering: k-means||decomposition: SVD, PCAoptimization: stochastic gradient descent, L-BFGS
![Page 11: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/11.jpg)
OptimizationAt least two large classes of optimization problems humans can solve:
» Convex» Spectral
![Page 12: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/12.jpg)
Optimization Example: Gradient Descent
![Page 13: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/13.jpg)
ML Objectives
![Page 14: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/14.jpg)
Scaling1) Data size
2) Model size
3) Number of models
![Page 15: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/15.jpg)
Logistic Regressiondata = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):gradient = data.map(lambda p:
(1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x).reduce(lambda a, b: a + b)w -= gradient
print “Final w: %s” % w
![Page 16: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/16.jpg)
Separable UpdatesCan be generalized for
» Unconstrained optimization» Smooth or non-smooth
» LBFGS, Conjugate Gradient, Accelerated Gradient methods, …
![Page 17: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/17.jpg)
Logistic Regression Results
0500
1000150020002500300035004000
1 5 10 20 30
Run
ning
Tim
e (s
)
Number of Iterations
HadoopSpark
110 s / iteration
first iteration 80 sfurther iterations 1 s
100 GB of data on 50 m1.xlarge EC2 machines
![Page 18: Introduction to Distributed Optimizationrezab/dao/notes/intro_to_dao.pdf2) Lazily transform them to define new RDDs using transformations like filter() or map() 3) Ask Spark to cache()](https://reader036.vdocument.in/reader036/viewer/2022063020/5fe27d581eb1f447fa5a694b/html5/thumbnails/18.jpg)
Behavior with Less RAM68
.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0% 25% 50% 75% 100%
Itera
tion
time
(s)
% of working set in memory