spark + h20 = machine learning at scale

Spark + H2O = Machine Learning at scale

Mateusz Dymczyk Software Engineer

Machine Learning with Spark Tokyo 30.06.2016

Agenda

• Spark introduction • H2O introduction • Spark + H2O = Sparkling Water • Demos

What is Spark?

• Fast and general engine for large-scale data processing. • API in Java, Scala, Python and R • Batch and streaming APIs • Based on immutable data structure

*http://spark.apache.org/

Architecture

*http://spark.apache.org/docs/latest/cluster-overview.html

http://spark.apache.org/docs/latest/cluster-overview.html

Why Spark?

• In-memory computation (fast) • Ability to cache (intermediate) results in memory (or on

disk) • Easy API • Plenty of out-of-the box libraries

*http://spark.apache.org/docs/latest/mllib-guide.html

http://spark.apache.org/docs/latest/mllib-guide.html

MLlib

• Spark’s machine learning library • Supports: • basic statistics • classification and regression • clustering • dimensionality reduction • evaluations • … *http://spark.apache.org/docs/latest/mllib-guide.html


Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5

val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()

// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

*http://spark.apache.org/docs/latest/mllib-linear-methods.html

http://spark.apache.org/docs/latest/mllib-linear-methods.html

But…

• Are the implementations fast enough? • Are the implementations accurate enough? • What about other algorithms (i.e. where’s my

DeepLearning!)? • What about visualisations?

*http://spark.apache.org/docs/latest/mllib-guide.html


Math platform

What is H2O?

• Open source • Set of math and predictive algorithms

• GLM, Random Forest, GBM, Deep Learning etc.

• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API

Math platform

API

What is H2O?



• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API

• Highly paralleled and distributed implementation • Fast in-memory computation on highly compressed data • Allows you to use all your data without sampling • Based on mutable data structures

Math platform

API

Big data focused

What is H2O?



FlowUI

• Notebook style open source interface for H2O

• Allows you to combine code execution, text, mathematics, plots, and rich media in a single document

Why H2O?

• Speed and accuracy

• Algorithms/functionality not present in MLlib

• Access to FlowUI

• Possibility to generate dependency free (Java) models

• Option to checkpoint models (though not all) and continue learning in the future

Sparkl ing Water

What is Sparkl ing Water?

• Framework integrating Spark and H2O • Transparent use of H2O data structures and algorithms

with Spark API and vice versa

Common use-cases

Modeling

ETL

Data Source

Modell ing Predict ions

Deep learning, GBM, DRF, GLM, PCA, Ensembles

etc.

ETL

ETL

Data Source

Modell ing Predict ions

Stream Processing

ETL

Data Source

Modell ing

Predict ions

Data Stream

Spark Streaming/ Storm/Flink etc.

Demo #1 Sparkl ing Shell

REQUIREMENTS • Windows/Linux/MacOS • Java 1.7+ • Spark 1.3+ • SPARK_HOME set

INSTALLATION 1. http://www.h2o.ai/download 2. set MASTER env 3. unzip 4. run bin/sparkling-shell

DEV FLOW 1. create a script file containing application code

2. run with bin/sparkling-shell -i script_name.script.scala OR

1. run bin/sparkling-shell and simply use the REPL

import org.apache.spark.h2o._

// sc - SparkContext already provided by the shell

val h2oContext = new H2OContext(sc).start() import h2oContext._

// Application logic

Air l ine delay classif ication

Model predicting flight

delays

ETL Modell ing Predict ions

• load data from CSVs

• use Spark APIs to filter and join data

Model using H2O’s GBM

*https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

Gradient Boosting Machines

• Classification and regression predictive modelling • Ensemble of multiple weak models (usually decision trees) • Iteratively solves residuals (gradient boosted) • Stochastic

Demo #2 FlowUI

Demo #3 Standalone app

REQUIREMENTS

• git • editor of choice (IntelliJ/eclipse support)

BOOTSTRAP

1. git clone https://github.com/h2oai/h2o-droplets.git 2. cd h2o-droplets/sparkling-water-droplet 3. if using IntelliJ or Eclipse: – ./gradlew idea – ./gradlew eclipse – import project in the IDE

4. develop your app

https://github.com/h2oai/h2o-droplets.git

DEPLOYMENT

1. build ./gradlew build shadowJar 2. submit with:

$SPARK_HOME/bin/spark-submit \ --class water.droplets.SWTokyoDemo \ --master local[*] \ --packages ai.h2o:sparkling-water-core_2.10:1.6.5 \ build/libs/sparkling-water-droplet-app.jar

Open Source

• Github:

https://github.com/h2oai/sparkling-water

• JIRA:

http://jira.h2o.ai

• Google groups:

https://groups.google.com/forum/?hl=en#!forum/h2ostream

https://github.com/h2oai/sparkling-water

https://groups.google.com/forum/?hl=en#!forum/h2ostream

More Info

• Documentation and booklets:

http://www.h2o.ai/docs/

• H2O.ai blog:

http://h2o.ai/blog

• H2O.ai YouTube channel:

https://www.youtube.com/user/0xdata

@h2oai

http://www.h2o.ai

http://h2o.ai/blog

https://www.youtube.com/user/0xdata

http://www.h2o.ai

Thank you!

@mdymczyk

Mateusz Dymczyk

[email protected]

mailto:[email protected]?subject=

spark + h20 = machine learning at scale

Data & Analytics