apache® spark™ mllib 2.x: migrating ml workloads to dataframes

Webinar Logistics

3

Webinar Logistics

4

About the speaker: Joseph Bradley

Joseph Bradley is a Software Engineer and Apache Spark Committer & PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.

5

About the speaker: Jules S. Damji

Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems.

6

DatabricksFounded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

75%

Data Value

Created Databricks on top of Spark to make big data simple.7

…

Apache Spark Engine

Spark Core

Spark StreamingSpark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, & R APIs

Standard libraries

8

10

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO

Source: Slide 5 of Spark Community Update

OutlineIntro to MLlib in 2.x

Migrating an ML workload to DataFrames

ML persistence

Roadmap ahead during 2.x

11



ML persistence

Roadmap ahead during 2.x

12

A bit of MLlib historySpark 0.8RDD-based API Fast, scale-out ML

Challenges• Expressing complex workflows• Integrating with DataFrames• Developing Java, Python & R APIs

Spark 1.2DataFrame-based API (a.k.a. “Spark ML”)

Major improvements• ML Pipelines with automated tuning• Native DataFrame integration• Standard API across languages

See Xiangrui Meng’s original design & prototype in SPARK-3530.

13

http://issues.apache.org/jira/browse/SPARK-3530

MLlib trajectory

v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.00

200400600800

1000

com

mits

/ re

leas

e

Scala/JavaAPI

Primary API for MLlib

Python API R API

DataFrame-basedAPI for MLlib

14

DataFrame-based API for MLlibDataFrames are the standard ML dataset type.

Uniform APIs for algorithms, hyperparameters, etc.

Pipelines provide utilities for constructing ML workflows + automating hyperparameter tuning.

Learn more about ML Pipelines:http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

15

http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2



http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html



DataFrame-based API for MLlibIn 2.0, the DataFrame-based API became the primary MLlib API.• Voted by community• org.apache.spark.ml, pyspark.ml

The RDD-based API is in maintenance mode.• Still maintained with bug fixes, but no new features• org.apache.spark.mllib, pyspark.mllib

16



ML persistence

Roadmap during 2.x

17

Why migrate to DataFrames?DataFrames

Language APIs

Pipelines

DataFrames & Datasets are the new “core” API for Spark.• Data sources & ETL• Latest performance improvements (Catalyst & Tungsten)• Structured Streaming

18


Language APIs

Pipelines

Standardized across Scala, Java, Python, and R• Python & R match Scala/Java performance• Cross-language persistence (saving/loading models)

19


Language APIs

Pipelines Specify complex ML workflows• Chain together Transformers, Estimators, & Models• Automated hyperparameter tuning

20

Demo migrationConvert a notebook from the RDD-based API to the DataFrame-based API.

Key points• Work with single models or complex Pipelines• Incremental migration• Many benefits: simpler APIs, SQL integration, tuning• A few gotchas (linear algebra types)

21

WarningDemo for experts!

Demo recap: migration processSeparate 2 migrations:

• Spark 1.6 2.0• RDDs DataFrames

Migrate ML APIs: spark.mllib spark.ml• Gotcha: a few naming changes (from standardizing algorithm APIs)

• Certain Param and model methods• run() fit()

• Tips:• Use explainParams()• Compare the API docs if you hit issues!

Migrate data APIs: RDDs DataFrames• Tip: Get familiar with conversion syntax in both directions.

22

Demo recap: migration processDebugging runtime errors

• Gotcha: Lazy evaluation in Pipelines means bugs appear later than expected.• Tip: Check intermediate results.

• Gotcha: Vector (and Matrix) types in spark.mllib and spark.ml.• Relevant for Spark 1.6 2.0 migration• Tip: Watch for buried errors: MatchError and mentions of “vector”• Tip: Use helper methods for conversion

• org.apache.spark.mllib.linalg.Vector.asML• org.apache.spark.mllib.linalg.Vectors.fromML• http://spark.apache.org/docs/latest/ml-guide.html#migration-guide

23

http://spark.apache.org/docs/latest/ml-guide.html#migration-guide



Future benefits of migrationCurrentlyML training is implemented on RDDs.

GoalPort implementation to DataFrames.Benefit from DataFrame

optimizations (Catalyst, Tungsten).

Spark SQL MLlib

Core

RDDs

DataFramesDatasetsSQL

24

Future benefits of migrationStatusFirst published implementation in GraphFrames (Spark package for graph processing)

Ongoing workDataFrame improvements for iterative algorithms: checkpointing, improved caching, and more.

Spark SQL

MLlib

Core

RDDs

DataFramesDatasetsSQL

25



ML persistence

Roadmap during 2.x

26

Why ML persistence?Data Science Software Engineering

Prototype (Python/R)Create model

Re-implement model for production (Java)

Deploy model

27

Why ML persistence?Data Science Software Engineering

Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to

make prediction

• Extra implementation work• Different code paths• Synchronization overhead

Re-implement Pipeline for production (Java)

Deploy Pipeline

28

With ML persistence...Data Science Software Engineering

Prototype (Python/R)Create Pipeline

Persist model or Pipeline: model.save(“s3n://...”)

Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production

29

Model tuning

ML persistence status

Text preprocessing

Feature generation

Random forest

Unfitted Fitted

Model

Pipeline

“recipe” “result”

30

ML persistence statusNear-complete coverage in all Spark language APIs• Scala & Java: complete• Python: complete except for 2 algorithms• R: complete for existing APIs

Single underlying implementation of models

Exchangeable data format• JSON for metadata• Parquet for model data (coefficients, etc.)

31

Demo: ML persistence• Can persist single models & complex workflows

• Easy to move models across Spark deployments

• Share models across teams & languages

32

ML persistence: pending issuesPython tuning: not yet implemented• CrossValidator, TrainValidationSplit

R format: incompatible with Python/Java/Scala• Issue: R wrappers are all special Pipelines.• Working towards a fix• Workaround: Load underlying PipelineModel from subfolder in

saved model directory.

Backwards compatibility: WIP in SPARK-15573

ML persistence blog post: http://databricks.com/blog/2016/05/31

33

http://issues.apache.org/jira/browse/SPARK-15573

https://databricks.com/blog/2016/05/31



ML persistence

Roadmap during 2.x

34

Goals for MLlib in 2.xMajor initiatives• ML persistence: saving &

loading models & Pipelines• Complete feature parity for

DataFrames-based API. Missing items:• Frequent Pattern Mining• Certain methods in models• Developer APIs

For an overview of MLlib in 2.0, seehttp://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-production

35

Other important improvements• Generalized Linear Models• Python & R API parity• Speed & scalability improvements

http://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-production

http://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-production

Coming in 2.1Multiclass logistic regression (SPARK-7159)Locality sensitive hashing (SPARK-5992)More ML in SparkR (SPARK-16442)• ALS• Isotonic Regression• Multilayer Perceptron Classifier• Random Forest• Gaussian Mixture Model• LDA• Multiclass Logistic Regression• Gradient Boosted Trees

Various speed & scalability improvements• Random Forest, Naive Bayes, LDA, Gaussian Mixture, and others

Spark 2.1 status:Release candidates are under QA.

For release schedule, see http://spark.apache.org/versioning-policy.html

36

http://spark.apache.org/versioning-policy.html

http://spark.apache.org/versioning-policy.html

Get startedGet involved in the community• Events & news https://sparkhub.databricks.com/• User mailing list http://spark.apache.org/

community.html

Get involved in development• Dev mailing list

http://spark.apache.org/community.html• JIRA http://issues.apache.org/jira/browse/SPARK• Contribute http://spark.apache.org/contributing.html

Try out Apache Spark for free on Databricks Community Edition!http://databricks.com/try

Many thanks to the Apache Spark community!

37

https://sparkhub.databricks.com/

https://sparkhub.databricks.com/

http://spark.apache.org/community.html



http://issues.apache.org/jira/browse/SPARK



http://spark.apache.org/contributing.html

http://spark.apache.org/contributing.html

https://spark-summit.org/east-2017/

https://spark-summit.org/east-2017/

Thank you!Twitter: @jkbatcmu