spark + h20 = machine learning at scale
TRANSCRIPT
Spark + H2O = Machine Learning at scale
Mateusz Dymczyk Software Engineer
Machine Learning with Spark Tokyo 30.06.2016
Agenda
• Spark introduction • H2O introduction • Spark + H2O = Sparkling Water • Demos
Spark
What is Spark?
• Fast and general engine for large-scale data processing. • API in Java, Scala, Python and R • Batch and streaming APIs • Based on immutable data structure
*http://spark.apache.org/
Architecture
*http://spark.apache.org/docs/latest/cluster-overview.html
Why Spark?
• In-memory computation (fast) • Ability to cache (intermediate) results in memory (or on
disk) • Easy API • Plenty of out-of-the box libraries
*http://spark.apache.org/docs/latest/mllib-guide.html
MLlib
• Spark’s machine learning library • Supports: • basic statistics • classification and regression • clustering • dimensionality reduction • evaluations • … *http://spark.apache.org/docs/latest/mllib-guide.html
Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5
val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()
// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*http://spark.apache.org/docs/latest/mllib-linear-methods.html
Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5
val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()
// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*http://spark.apache.org/docs/latest/mllib-linear-methods.html
Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5
val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()
// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*http://spark.apache.org/docs/latest/mllib-linear-methods.html
Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5
val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()
// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)
// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
*http://spark.apache.org/docs/latest/mllib-linear-methods.html
But…
• Are the implementations fast enough? • Are the implementations accurate enough? • What about other algorithms (i.e. where’s my
DeepLearning!)? • What about visualisations?
*http://spark.apache.org/docs/latest/mllib-guide.html
H2O
Math platform
What is H2O?
• Open source • Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.
• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API
Math platform
API
What is H2O?
• Open source • Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.
• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API
• Highly paralleled and distributed implementation • Fast in-memory computation on highly compressed data • Allows you to use all your data without sampling • Based on mutable data structures
Math platform
API
Big data focused
What is H2O?
• Open source • Set of math and predictive algorithms
• GLM, Random Forest, GBM, Deep Learning etc.
FlowUI
• Notebook style open source interface for H2O
• Allows you to combine code execution, text, mathematics, plots, and rich media in a single document
Why H2O?
• Speed and accuracy
• Algorithms/functionality not present in MLlib
• Access to FlowUI
• Possibility to generate dependency free (Java) models
• Option to checkpoint models (though not all) and continue learning in the future
Sparkl ing Water
What is Sparkl ing Water?
• Framework integrating Spark and H2O • Transparent use of H2O data structures and algorithms
with Spark API and vice versa
Common use-cases
Modeling
ETL
Data Source
Modell ing Predict ions
Deep learning, GBM, DRF, GLM, PCA, Ensembles
etc.
ETL
ETL
Data Source
Modell ing Predict ions
Stream Processing
ETL
Data Source
Modell ing
Predict ions
Data Stream
Spark Streaming/ Storm/Flink etc.
Demo #1 Sparkl ing Shell
REQUIREMENTS • Windows/Linux/MacOS • Java 1.7+ • Spark 1.3+ • SPARK_HOME set
INSTALLATION 1. http://www.h2o.ai/download 2. set MASTER env 3. unzip 4. run bin/sparkling-shell
DEV FLOW 1. create a script file containing application code
2. run with bin/sparkling-shell -i script_name.script.scala OR
1. run bin/sparkling-shell and simply use the REPL
import org.apache.spark.h2o._
// sc - SparkContext already provided by the shell
val h2oContext = new H2OContext(sc).start() import h2oContext._
// Application logic
Air l ine delay classif ication
Model predicting flight
delays
ETL Modell ing Predict ions
• load data from CSVs
• use Spark APIs to filter and join data
Model using H2O’s GBM
*https://github.com/h2oai/sparkling-water/tree/master/examples/scripts
Gradient Boosting Machines
• Classification and regression predictive modelling • Ensemble of multiple weak models (usually decision trees) • Iteratively solves residuals (gradient boosted) • Stochastic
Demo #2 FlowUI
Demo #3 Standalone app
REQUIREMENTS
• git • editor of choice (IntelliJ/eclipse support)
BOOTSTRAP
1. git clone https://github.com/h2oai/h2o-droplets.git 2. cd h2o-droplets/sparkling-water-droplet 3. if using IntelliJ or Eclipse: – ./gradlew idea – ./gradlew eclipse – import project in the IDE
4. develop your app
DEPLOYMENT
1. build ./gradlew build shadowJar 2. submit with:
$SPARK_HOME/bin/spark-submit \ --class water.droplets.SWTokyoDemo \ --master local[*] \ --packages ai.h2o:sparkling-water-core_2.10:1.6.5 \ build/libs/sparkling-water-droplet-app.jar
Open Source
• Github:
https://github.com/h2oai/sparkling-water
• JIRA:
http://jira.h2o.ai
• Google groups:
https://groups.google.com/forum/?hl=en#!forum/h2ostream
More Info
• Documentation and booklets:
http://www.h2o.ai/docs/
• H2O.ai blog:
http://h2o.ai/blog
• H2O.ai YouTube channel:
https://www.youtube.com/user/0xdata
@h2oai
http://www.h2o.ai
Q&A