Scalable Machine Learning with Apache SparkEvan Casey@ev_ancasey
Who am I?● Engineer at Tapad● HackNY 2014 Fellow● Things I work on:
○ Scala○ Distributed systems○ Hadoop/Spark
Overview● Apache Spark
○ Dataflow model○ Spark vs Hadoop MapReduce○ Programming with Spark
● Machine Learning with Spark○ MLlib overview○ Gradient descent example○ Distributed implementation on Apache Spark○ Lessons learned
Apache Spark● Distributed data-processing
framework built on top of HDFS● Use cases:
○ Interactive analytics ○ Graph processing ○ Stream processing ○ Scalable ML
Why Spark?● Up to 100x faster than
Hadoop● Built on top of Akka● Expressive APIs in Scala,
Java, and Python● Active open-source
community
Spark vs Hadoop MapReduce● In-memory data flow model
optimized for multi-stage jobs
● Novel approach to fault tolerance
● Similar programming style to Scalding/Cascading
Programming Model● Resilient Distributed Dataset (RDD)
○ Textfile, parallelize● Parallel Operations
○ Map, GroupBy, Filter, Join, etc● Optimizations
○ Caching, shared variables
Wordcount Exampleval sc = new SparkContext()val file = sc.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// counts.cache// sc.broadcast(counts)
Machine Learning in SparkAlgorithms:
- classification: logistic regression, linear SVM, naive bayes, random forests
- regression: generalized linear models, regression tree- collaborative filtering: alternating least squares (ALS), non-negative
matrix factorization (NMF)- clustering: k-means- decomposition: singular value decompositions (SVD), principal
component analysis (PCA)
K-Means Clusteringval data = sc.textFile("hdfs://...")val parsedData = data.map(_.split(‘ ‘).map(_.toDouble)).cache()
// Cluster the data into two classesval clusters = KMeans.train(parsedData, 2, numIterations = 20)
// Compute the sum of squared errorsval cost = clusters.computeCost(parsedData)
Gradient Descent Example
val file = sc.textFile("hdfs://...")val points = file.map(parsePoint).cache()var w = Vector.zeros(d)for (i <- 1 to numIterations) { (1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y * p.x).reduce(_+_) w -= alpha * gradient}
About Tapad● 350k QPS● Ingest multiple TBs daily● Kafka, Scalding, Spark, Zookeeper, Aerospike● We’re hiring! :)
Thanks!@ev_ancasey
Questions?