apache® spark™ mllib 2.x: migrating ml workloads to dataframes
TRANSCRIPT
Webinar Logistics
3
Webinar Logistics
4
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineer and Apache Spark Committer & PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
5
About the speaker: Jules S. Damji
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems.
6
DatabricksFounded by the creators of Apache Spark in 2013
Share of Spark code contributed by Databricksin 2014
75%
Data Value
Created Databricks on top of Spark to make big data simple.7
…
Apache Spark Engine
Spark Core
Spark StreamingSpark SQL MLlib GraphX
Unified engine across diverse workloads & environments
Scale out, fault tolerant
Python, Java, Scala, & R APIs
Standard libraries
8
9
10
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO
Source: Slide 5 of Spark Community Update
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
11
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap ahead during 2.x
12
A bit of MLlib historySpark 0.8RDD-based API Fast, scale-out ML
Challenges• Expressing complex workflows• Integrating with DataFrames• Developing Java, Python & R APIs
Spark 1.2DataFrame-based API (a.k.a. “Spark ML”)
Major improvements• ML Pipelines with automated tuning• Native DataFrame integration• Standard API across languages
See Xiangrui Meng’s original design & prototype in SPARK-3530.
13
MLlib trajectory
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.00
200400600800
1000
com
mits
/ re
leas
e
Scala/JavaAPI
Primary API for MLlib
Python API R API
DataFrame-basedAPI for MLlib
14
DataFrame-based API for MLlibDataFrames are the standard ML dataset type.
Uniform APIs for algorithms, hyperparameters, etc.
Pipelines provide utilities for constructing ML workflows + automating hyperparameter tuning.
Learn more about ML Pipelines:http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
15
DataFrame-based API for MLlibIn 2.0, the DataFrame-based API became the primary MLlib API.• Voted by community• org.apache.spark.ml, pyspark.ml
The RDD-based API is in maintenance mode.• Still maintained with bug fixes, but no new features• org.apache.spark.mllib, pyspark.mllib
16
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
17
Why migrate to DataFrames?DataFrames
Language APIs
Pipelines
DataFrames & Datasets are the new “core” API for Spark.• Data sources & ETL• Latest performance improvements (Catalyst & Tungsten)• Structured Streaming
18
Why migrate to DataFrames?DataFrames
Language APIs
Pipelines
Standardized across Scala, Java, Python, and R• Python & R match Scala/Java performance• Cross-language persistence (saving/loading models)
19
Why migrate to DataFrames?DataFrames
Language APIs
Pipelines Specify complex ML workflows• Chain together Transformers, Estimators, & Models• Automated hyperparameter tuning
20
Demo migrationConvert a notebook from the RDD-based API to the DataFrame-based API.
Key points• Work with single models or complex Pipelines• Incremental migration• Many benefits: simpler APIs, SQL integration, tuning• A few gotchas (linear algebra types)
21
WarningDemo for experts!
Demo recap: migration processSeparate 2 migrations:
• Spark 1.6 2.0• RDDs DataFrames
Migrate ML APIs: spark.mllib spark.ml• Gotcha: a few naming changes (from standardizing algorithm APIs)
• Certain Param and model methods• run() fit()
• Tips:• Use explainParams()• Compare the API docs if you hit issues!
Migrate data APIs: RDDs DataFrames• Tip: Get familiar with conversion syntax in both directions.
22
Demo recap: migration processDebugging runtime errors
• Gotcha: Lazy evaluation in Pipelines means bugs appear later than expected.• Tip: Check intermediate results.
• Gotcha: Vector (and Matrix) types in spark.mllib and spark.ml.• Relevant for Spark 1.6 2.0 migration• Tip: Watch for buried errors: MatchError and mentions of “vector”• Tip: Use helper methods for conversion
• org.apache.spark.mllib.linalg.Vector.asML• org.apache.spark.mllib.linalg.Vectors.fromML• http://spark.apache.org/docs/latest/ml-guide.html#migration-guide
23
Future benefits of migrationCurrentlyML training is implemented on RDDs.
GoalPort implementation to DataFrames.Benefit from DataFrame
optimizations (Catalyst, Tungsten).
Spark SQL MLlib
Core
RDDs
DataFramesDatasetsSQL
24
Future benefits of migrationStatusFirst published implementation in GraphFrames (Spark package for graph processing)
Ongoing workDataFrame improvements for iterative algorithms: checkpointing, improved caching, and more.
Spark SQL
MLlib
Core
RDDs
DataFramesDatasetsSQL
25
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
26
Why ML persistence?Data Science Software Engineering
Prototype (Python/R)Create model
Re-implement model for production (Java)
Deploy model
27
Why ML persistence?Data Science Software Engineering
Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to
make prediction
• Extra implementation work• Different code paths• Synchronization overhead
Re-implement Pipeline for production (Java)
Deploy Pipeline
28
With ML persistence...Data Science Software Engineering
Prototype (Python/R)Create Pipeline
Persist model or Pipeline: model.save(“s3n://...”)
Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production
29
Model tuning
ML persistence status
Text preprocessing
Feature generation
Random forest
Unfitted Fitted
Model
Pipeline
“recipe” “result”
30
ML persistence statusNear-complete coverage in all Spark language APIs• Scala & Java: complete• Python: complete except for 2 algorithms• R: complete for existing APIs
Single underlying implementation of models
Exchangeable data format• JSON for metadata• Parquet for model data (coefficients, etc.)
31
Demo: ML persistence• Can persist single models & complex workflows
• Easy to move models across Spark deployments
• Share models across teams & languages
32
ML persistence: pending issuesPython tuning: not yet implemented• CrossValidator, TrainValidationSplit
R format: incompatible with Python/Java/Scala• Issue: R wrappers are all special Pipelines.• Working towards a fix• Workaround: Load underlying PipelineModel from subfolder in
saved model directory.
Backwards compatibility: WIP in SPARK-15573
ML persistence blog post: http://databricks.com/blog/2016/05/31
33
OutlineIntro to MLlib in 2.x
Migrating an ML workload to DataFrames
ML persistence
Roadmap during 2.x
34
Goals for MLlib in 2.xMajor initiatives• ML persistence: saving &
loading models & Pipelines• Complete feature parity for
DataFrames-based API. Missing items:• Frequent Pattern Mining• Certain methods in models• Developer APIs
For an overview of MLlib in 2.0, seehttp://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-production
35
Other important improvements• Generalized Linear Models• Python & R API parity• Speed & scalability improvements
Coming in 2.1Multiclass logistic regression (SPARK-7159)Locality sensitive hashing (SPARK-5992)More ML in SparkR (SPARK-16442)• ALS• Isotonic Regression• Multilayer Perceptron Classifier• Random Forest• Gaussian Mixture Model• LDA• Multiclass Logistic Regression• Gradient Boosted Trees
Various speed & scalability improvements• Random Forest, Naive Bayes, LDA, Gaussian Mixture, and others
Spark 2.1 status:Release candidates are under QA.
For release schedule, see http://spark.apache.org/versioning-policy.html
36
Get startedGet involved in the community• Events & news https://sparkhub.databricks.com/• User mailing list http://spark.apache.org/
community.html
Get involved in development• Dev mailing list
http://spark.apache.org/community.html• JIRA http://issues.apache.org/jira/browse/SPARK• Contribute http://spark.apache.org/contributing.html
Try out Apache Spark for free on Databricks Community Edition!http://databricks.com/try
Many thanks to the Apache Spark community!
37
https://spark-summit.org/east-2017/
Thank you!Twitter: @jkbatcmu