machine learning with apache spark - hackny masters

Post on 01-Jul-2015

316 Views

Category:

Data & Analytics

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

Introduction to Machine Learning with MLlib and Apache Spark

TRANSCRIPT

Scalable Machine Learning with Apache SparkEvan Casey@ev_ancasey

Who am I?● Engineer at Tapad● HackNY 2014 Fellow● Things I work on:

○ Scala○ Distributed systems○ Hadoop/Spark

Overview● Apache Spark

○ Dataflow model○ Spark vs Hadoop MapReduce○ Programming with Spark

● Machine Learning with Spark○ MLlib overview○ Gradient descent example○ Distributed implementation on Apache Spark○ Lessons learned

Apache Spark● Distributed data-processing

framework built on top of HDFS● Use cases:

○ Interactive analytics ○ Graph processing ○ Stream processing ○ Scalable ML

Why Spark?● Up to 100x faster than

Hadoop● Built on top of Akka● Expressive APIs in Scala,

Java, and Python● Active open-source

community

Spark vs Hadoop MapReduce● In-memory data flow model

optimized for multi-stage jobs

● Novel approach to fault tolerance

● Similar programming style to Scalding/Cascading

Programming Model● Resilient Distributed Dataset (RDD)

○ Textfile, parallelize● Parallel Operations

○ Map, GroupBy, Filter, Join, etc● Optimizations

○ Caching, shared variables

Wordcount Exampleval sc = new SparkContext()val file = sc.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")// counts.cache// sc.broadcast(counts)

Machine Learning in SparkAlgorithms:

- classification: logistic regression, linear SVM, naive bayes, random forests

- regression: generalized linear models, regression tree- collaborative filtering: alternating least squares (ALS), non-negative

matrix factorization (NMF)- clustering: k-means- decomposition: singular value decompositions (SVD), principal

component analysis (PCA)

K-Means Clusteringval data = sc.textFile("hdfs://...")val parsedData = data.map(_.split(‘ ‘).map(_.toDouble)).cache()

// Cluster the data into two classesval clusters = KMeans.train(parsedData, 2, numIterations = 20)

// Compute the sum of squared errorsval cost = clusters.computeCost(parsedData)

Gradient Descent Example

val file = sc.textFile("hdfs://...")val points = file.map(parsePoint).cache()var w = Vector.zeros(d)for (i <- 1 to numIterations) { (1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y * p.x).reduce(_+_) w -= alpha * gradient}

About Tapad● 350k QPS● Ingest multiple TBs daily● Kafka, Scalding, Spark, Zookeeper, Aerospike● We’re hiring! :)

Thanks!@ev_ancasey

Questions?

top related