mllib: spark's machine learning library

33
About Me Postdoc in AMPLab Led initial development of MLlib Technical Advisor for Databricks Assistant Professor at UCLA Research interests include scalability and ease- of-use issues in statistical machine learning

Upload: jeykottalam

Post on 13-Jul-2015

2.786 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: MLlib: Spark's Machine Learning Library

About Me

•  Postdoc in AMPLab •  Led initial development of MLlib

•  Technical Advisor for Databricks •  Assistant Professor at UCLA •  Research interests include scalability and ease-

of-use issues in statistical machine learning

Page 2: MLlib: Spark's Machine Learning Library

MLlib: Spark’s Machine Learning Library

Ameet Talwalkar AMPCAMP 5

November 20, 2014

Page 3: MLlib: Spark's Machine Learning Library

History and Overview Example Applications Ongoing Development

Page 4: MLlib: Spark's Machine Learning Library

MLbase and MLlib

MLlib: Spark’s core ML library

MLI, Pipelines: APIs to simplify ML development

•  Tables, Matrices, Optimization, ML Pipelines

MLOpt: Declarative layer to automate hyperparameter tuning

MLlib

MLI

MLOpt

Apache Spark

MLbase aims to simplify development

and deployment of scalable ML pipelines

Experimental Testbeds

Production Code

Pipelines

Page 5: MLlib: Spark's Machine Learning Library

History of MLlib Initial Release

•  Developed by MLbase team in AMPLab (11 contributors)

•  Scala, Java

•  Shipped with Spark v0.8 (Sep 2013)

15 months later…

•  80+ contributors from various organization

•  Scala, Java, Python

•  Latest release part of Spark v1.1 (Sep 2014)

Page 6: MLlib: Spark's Machine Learning Library

What’s in MLlib?

•  Alternating Least Squares •  Lasso •  Ridge Regression •  Logistic Regression •  Decision Trees •  Naïve Bayes •  Support Vector Machines •  K-Means •  Gradient descent •  L-BFGS •  Random data generation •  Linear algebra •  Feature transformations •  Statistics: testing, correlation •  Evaluation metrics

Collaborative Filtering for Recommendation

Prediction

Clustering

Optimization

Many Utilities

Page 7: MLlib: Spark's Machine Learning Library

Benefits of MLlib

•  Part of Spark •  Integrated data analysis workflow •  Free performance gains

Apache Spark

SparkSQL Spark

Streaming MLlib GraphX

Page 8: MLlib: Spark's Machine Learning Library

Benefits of MLlib

•  Part of Spark •  Integrated data analysis workflow •  Free performance gains

•  Scalable •  Python, Scala, Java APIs •  Broad coverage of applications & algorithms •  Rapid improvements in speed & robustness

Page 9: MLlib: Spark's Machine Learning Library

History and Overview Example Applications

Use Cases Distributed ML Challenges Code Examples

Ongoing Development

Page 10: MLlib: Spark's Machine Learning Library

Clustering with K-Means

Given data points Find meaningful clusters

Page 11: MLlib: Spark's Machine Learning Library

Clustering with K-Means

Choose cluster centers Assign points to clusters

Page 12: MLlib: Spark's Machine Learning Library

Clustering with K-Means

Assign points to clusters Choose cluster centers

Page 13: MLlib: Spark's Machine Learning Library

Clustering with K-Means

Choose cluster centers Assign points to clusters

Page 14: MLlib: Spark's Machine Learning Library

Clustering with K-Means

Choose cluster centers Assign points to clusters

Page 15: MLlib: Spark's Machine Learning Library

Clustering with K-Means

Smart initialization

Limited communication (# clusters << # instances)

Data distributed by instance (point/row)

Page 16: MLlib: Spark's Machine Learning Library

K-Means: Scala // Load and parse data. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map { x => Vectors.dense(x.split(‘ ').map(_.toDouble))

}.cache()

// Cluster data into 5 classes using KMeans. val clusters = KMeans.train( parsedData, k = 5, numIterations = 20)

// Evaluate clustering error. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost)

Page 17: MLlib: Spark's Machine Learning Library

K-Means: Python

# Load and parse data. data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() # Cluster data into 5 classes using KMeans. clusters = KMeans.train(parsedData, k = 5, maxIterations = 20) # Evaluate clustering error. def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) cost = parsedData.map(lambda point: error(point)) \

.reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost))

Page 18: MLlib: Spark's Machine Learning Library

Recommendation

? Goal: Recommend movies to users

Page 19: MLlib: Spark's Machine Learning Library

Recommendation

Goal: Recommend movies to users

Challenges: •  Defining similarity •  Dimensionality

Millions of Users / Items •  Sparsity

Collaborative filtering

Page 20: MLlib: Spark's Machine Learning Library

Recommendation

Solution: Assume ratings are determined by a small number of factors.

≈ x

25M Users, 100K Movies ! 2.5 trillion ratings

With 10 factors/user ! 250M parameters

Page 21: MLlib: Spark's Machine Learning Library

?

Recommendation with Alternating Least Squares (ALS)

= x

Algorithm Alternating update of user/movie factors

?

Page 22: MLlib: Spark's Machine Learning Library

Recommendation with Alternating Least Squares (ALS)

= x

Algorithm Alternating update of user/movie factors

Can update factors in parallel

Must be careful about communication

Page 23: MLlib: Spark's Machine Learning Library

Recommendation with Alternating Least Squares (ALS) // Load and parse the data val data = sc.textFile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) })

// Build the recommendation model using ALS val model = ALS.train( ratings, rank = 10, numIterations = 20, regularizer = 0.01)

// Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts)

Page 24: MLlib: Spark's Machine Learning Library

ALS: Today’s ML Exercise

•  Load 1M/10M ratings from MovieLens •  Specify YOUR ratings on examples •  Split examples into training/validation •  Fit a model (Python or Scala) •  Improve model via parameter tuning •  Get YOUR recommendations

Page 25: MLlib: Spark's Machine Learning Library

History and Overview Example Applications Ongoing Development

Performance New APIs

Page 26: MLlib: Spark's Machine Learning Library

Performance

On a dataset with 660M users, 2.4M items, and 3.5B ratings MLlib runs in 40 minutes with 50 nodes

0

10

20

30

40

50

MLlib

Mahout

Number of Ratings 0M 200M 400M 600M 800M

Run

tim

e (m

inut

es)

ALS on Amazon Reviews on 16 nodes

Page 27: MLlib: Spark's Machine Learning Library

Performance Steady performance gains

ALS

Decision Trees

K-Means

Logistic Regression

Ridge Regression

Speedup (Spark 1.0 vs. 1.1)

~3X speedups on average

Page 28: MLlib: Spark's Machine Learning Library

Algorithms

In Spark 1.2 •  Random Forests: ensembles of Decision Trees •  Boosting Under development •  Topic modeling •  (many others) Many others!

Page 29: MLlib: Spark's Machine Learning Library

ML Pipelines Typical ML workflow

Training Data

Feature Extraction

Model Training

Model Testing

Test Data

Page 30: MLlib: Spark's Machine Learning Library

ML Pipelines

Training Data

Feature Extraction

Model Training

Model Testing

Test Data

Feature Extraction

Other Training

Data

Other Model Training

Typical ML workflow is complex.

Page 31: MLlib: Spark's Machine Learning Library

ML Pipelines

Pipelines in 1.2 (alpha release)

•  Easy workflow construction

•  Standardized interface for model tuning

•  Testing & failing early

Typical ML workflow is complex.

Inspired by MLbase / Pipelines Project (see Evan’s talk)

Collaboration with Databricks

MLbase / MLOpt aims to autotune these pipelines

Page 32: MLlib: Spark's Machine Learning Library

Datasets

Further Integration with SparkSQL

ML pipelines require Datasets

•  Handle many data types (features)

•  Keep metadata about features

•  Select subsets of features for different parts of pipeline

•  Join groups of features

ML Dataset = SchemaRDD

Inspired by MLbase / MLI API

Page 33: MLlib: Spark's Machine Learning Library

MLlib Programming Guide spark.apache.org/docs/latest/mllib-guide.html

Databricks training info databricks.com/spark-training

Spark user lists & communityspark.apache.org/community.html

edX MOOC on Scalable Machine Learning www.edx.org/course/uc-berkeleyx/uc-berkeleyx-cs190-1x-scalable-machine-6066

4-day BIDS minicourse in January

Resources