machine learning with apache spark - hackny masters

13
Scalable Machine Learning with Apache Spark Evan Casey @ev_ancasey

Upload: evan-casey

Post on 01-Jul-2015

316 views

Category:

Data & Analytics


7 download

DESCRIPTION

Introduction to Machine Learning with MLlib and Apache Spark

TRANSCRIPT

Page 1: Machine Learning with Apache Spark - HackNY Masters

Scalable Machine Learning with Apache SparkEvan Casey@ev_ancasey

Page 2: Machine Learning with Apache Spark - HackNY Masters

Who am I?● Engineer at Tapad● HackNY 2014 Fellow● Things I work on:

○ Scala○ Distributed systems○ Hadoop/Spark

Page 3: Machine Learning with Apache Spark - HackNY Masters

Overview● Apache Spark

○ Dataflow model○ Spark vs Hadoop MapReduce○ Programming with Spark

● Machine Learning with Spark○ MLlib overview○ Gradient descent example○ Distributed implementation on Apache Spark○ Lessons learned

Page 4: Machine Learning with Apache Spark - HackNY Masters

Apache Spark● Distributed data-processing

framework built on top of HDFS● Use cases:

○ Interactive analytics ○ Graph processing ○ Stream processing ○ Scalable ML

Page 5: Machine Learning with Apache Spark - HackNY Masters

Why Spark?● Up to 100x faster than

Hadoop● Built on top of Akka● Expressive APIs in Scala,

Java, and Python● Active open-source

community

Page 6: Machine Learning with Apache Spark - HackNY Masters

Spark vs Hadoop MapReduce● In-memory data flow model

optimized for multi-stage jobs

● Novel approach to fault tolerance

● Similar programming style to Scalding/Cascading

Page 7: Machine Learning with Apache Spark - HackNY Masters

Programming Model● Resilient Distributed Dataset (RDD)

○ Textfile, parallelize● Parallel Operations

○ Map, GroupBy, Filter, Join, etc● Optimizations

○ Caching, shared variables

Page 8: Machine Learning with Apache Spark - HackNY Masters

Wordcount Exampleval sc = new SparkContext()val file = sc.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// counts.cache// sc.broadcast(counts)

Page 9: Machine Learning with Apache Spark - HackNY Masters

Machine Learning in SparkAlgorithms:

- classification: logistic regression, linear SVM, naive bayes, random forests

- regression: generalized linear models, regression tree- collaborative filtering: alternating least squares (ALS), non-negative

matrix factorization (NMF)- clustering: k-means- decomposition: singular value decompositions (SVD), principal

component analysis (PCA)

Page 10: Machine Learning with Apache Spark - HackNY Masters

K-Means Clusteringval data = sc.textFile("hdfs://...")val parsedData = data.map(_.split(‘ ‘).map(_.toDouble)).cache()

// Cluster the data into two classesval clusters = KMeans.train(parsedData, 2, numIterations = 20)

// Compute the sum of squared errorsval cost = clusters.computeCost(parsedData)

Page 11: Machine Learning with Apache Spark - HackNY Masters

Gradient Descent Example

val file = sc.textFile("hdfs://...")val points = file.map(parsePoint).cache()var w = Vector.zeros(d)for (i <- 1 to numIterations) { (1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y * p.x).reduce(_+_) w -= alpha * gradient}

Page 12: Machine Learning with Apache Spark - HackNY Masters

About Tapad● 350k QPS● Ingest multiple TBs daily● Kafka, Scalding, Spark, Zookeeper, Aerospike● We’re hiring! :)

Page 13: Machine Learning with Apache Spark - HackNY Masters

Thanks!@ev_ancasey

Questions?