spark + h20 = machine learning at scale

44
Spark + H2O = Machine Learning at scale Mateusz Dymczyk Software Engineer Machine Learning with Spark Tokyo 30.06.2016

Upload: mateusz-dymczyk

Post on 06-Jan-2017

116 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Spark + H20 = Machine Learning at scale

Spark + H2O = Machine Learning at scale

Mateusz Dymczyk Software Engineer

Machine Learning with Spark Tokyo 30.06.2016

Page 2: Spark + H20 = Machine Learning at scale

Agenda

• Spark introduction • H2O introduction • Spark + H2O = Sparkling Water • Demos

Page 3: Spark + H20 = Machine Learning at scale

Spark

Page 4: Spark + H20 = Machine Learning at scale

What is Spark?

• Fast and general engine for large-scale data processing. • API in Java, Scala, Python and R • Batch and streaming APIs • Based on immutable data structure

*http://spark.apache.org/

Page 5: Spark + H20 = Machine Learning at scale

Architecture

*http://spark.apache.org/docs/latest/cluster-overview.html

Page 6: Spark + H20 = Machine Learning at scale

Why Spark?

• In-memory computation (fast) • Ability to cache (intermediate) results in memory (or on

disk) • Easy API • Plenty of out-of-the box libraries

*http://spark.apache.org/docs/latest/mllib-guide.html

Page 7: Spark + H20 = Machine Learning at scale

MLlib

• Spark’s machine learning library • Supports: • basic statistics • classification and regression • clustering • dimensionality reduction • evaluations • … *http://spark.apache.org/docs/latest/mllib-guide.html

Page 8: Spark + H20 = Machine Learning at scale

Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5

val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()

// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

*http://spark.apache.org/docs/latest/mllib-linear-methods.html

Page 9: Spark + H20 = Machine Learning at scale

Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5

val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()

// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

*http://spark.apache.org/docs/latest/mllib-linear-methods.html

Page 10: Spark + H20 = Machine Learning at scale

Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5

val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()

// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

*http://spark.apache.org/docs/latest/mllib-linear-methods.html

Page 11: Spark + H20 = Machine Learning at scale

Linear regression demo// imports //V1,V2,V3,R //1,1,1,0.1 //1,0,1,0.5

val sc: SparkContext = initContext() val data = sc.textFile(...) val parsedData: RDD[LabeledPoint] = data.map { line => // parsing }.cache()

// Building the model val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)

// Evaluate model on training examples and compute training error val valuesAndPreds = parsedData.map { point => val prediction: Double = model.predict(point.features) (point.label, prediction) }

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

*http://spark.apache.org/docs/latest/mllib-linear-methods.html

Page 12: Spark + H20 = Machine Learning at scale

But…

• Are the implementations fast enough? • Are the implementations accurate enough? • What about other algorithms (i.e. where’s my

DeepLearning!)? • What about visualisations?

*http://spark.apache.org/docs/latest/mllib-guide.html

Page 13: Spark + H20 = Machine Learning at scale

H2O

Page 14: Spark + H20 = Machine Learning at scale

Math platform

What is H2O?

• Open source • Set of math and predictive algorithms

• GLM, Random Forest, GBM, Deep Learning etc.

Page 15: Spark + H20 = Machine Learning at scale

• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API

Math platform

API

What is H2O?

• Open source • Set of math and predictive algorithms

• GLM, Random Forest, GBM, Deep Learning etc.

Page 16: Spark + H20 = Machine Learning at scale

• Written in high performance Java - native Java API • Drivers for R, Python, Excel, Tableau • REST API

• Highly paralleled and distributed implementation • Fast in-memory computation on highly compressed data • Allows you to use all your data without sampling • Based on mutable data structures

Math platform

API

Big data focused

What is H2O?

• Open source • Set of math and predictive algorithms

• GLM, Random Forest, GBM, Deep Learning etc.

Page 17: Spark + H20 = Machine Learning at scale
Page 18: Spark + H20 = Machine Learning at scale
Page 19: Spark + H20 = Machine Learning at scale
Page 20: Spark + H20 = Machine Learning at scale

FlowUI

• Notebook style open source interface for H2O

• Allows you to combine code execution, text, mathematics, plots, and rich media in a single document

Page 21: Spark + H20 = Machine Learning at scale

Why H2O?

• Speed and accuracy

• Algorithms/functionality not present in MLlib

• Access to FlowUI

• Possibility to generate dependency free (Java) models

• Option to checkpoint models (though not all) and continue learning in the future

Page 22: Spark + H20 = Machine Learning at scale

Sparkl ing Water

Page 23: Spark + H20 = Machine Learning at scale

What is Sparkl ing Water?

• Framework integrating Spark and H2O • Transparent use of H2O data structures and algorithms

with Spark API and vice versa

Page 24: Spark + H20 = Machine Learning at scale
Page 25: Spark + H20 = Machine Learning at scale
Page 26: Spark + H20 = Machine Learning at scale
Page 27: Spark + H20 = Machine Learning at scale

Common use-cases

Page 28: Spark + H20 = Machine Learning at scale

Modeling

ETL

Data Source

Modell ing Predict ions

Deep learning, GBM, DRF, GLM, PCA, Ensembles

etc.

Page 29: Spark + H20 = Machine Learning at scale

ETL

ETL

Data Source

Modell ing Predict ions

Page 30: Spark + H20 = Machine Learning at scale

Stream Processing

ETL

Data Source

Modell ing

Predict ions

Data Stream

Spark Streaming/ Storm/Flink etc.

Page 31: Spark + H20 = Machine Learning at scale

Demo #1 Sparkl ing Shell

Page 32: Spark + H20 = Machine Learning at scale

REQUIREMENTS • Windows/Linux/MacOS • Java 1.7+ • Spark 1.3+ • SPARK_HOME set

INSTALLATION 1. http://www.h2o.ai/download 2. set MASTER env 3. unzip 4. run bin/sparkling-shell

Page 33: Spark + H20 = Machine Learning at scale

DEV FLOW 1. create a script file containing application code

2. run with bin/sparkling-shell -i script_name.script.scala OR

1. run bin/sparkling-shell and simply use the REPL

import org.apache.spark.h2o._

// sc - SparkContext already provided by the shell

val h2oContext = new H2OContext(sc).start() import h2oContext._

// Application logic

Page 34: Spark + H20 = Machine Learning at scale

Air l ine delay classif ication

Model predicting flight

delays

ETL Modell ing Predict ions

• load data from CSVs

• use Spark APIs to filter and join data

Model using H2O’s GBM

*https://github.com/h2oai/sparkling-water/tree/master/examples/scripts

Page 35: Spark + H20 = Machine Learning at scale

Gradient Boosting Machines

• Classification and regression predictive modelling • Ensemble of multiple weak models (usually decision trees) • Iteratively solves residuals (gradient boosted) • Stochastic

Page 36: Spark + H20 = Machine Learning at scale

Demo #2 FlowUI

Page 37: Spark + H20 = Machine Learning at scale

Demo #3 Standalone app

Page 38: Spark + H20 = Machine Learning at scale

REQUIREMENTS

• git • editor of choice (IntelliJ/eclipse support)

Page 39: Spark + H20 = Machine Learning at scale

BOOTSTRAP

1. git clone https://github.com/h2oai/h2o-droplets.git 2. cd h2o-droplets/sparkling-water-droplet 3. if using IntelliJ or Eclipse: – ./gradlew idea – ./gradlew eclipse – import project in the IDE

4. develop your app

Page 40: Spark + H20 = Machine Learning at scale

DEPLOYMENT

1. build ./gradlew build shadowJar 2. submit with:

$SPARK_HOME/bin/spark-submit \ --class water.droplets.SWTokyoDemo \ --master local[*] \ --packages ai.h2o:sparkling-water-core_2.10:1.6.5 \ build/libs/sparkling-water-droplet-app.jar

Page 41: Spark + H20 = Machine Learning at scale

Open Source

• Github:

https://github.com/h2oai/sparkling-water

• JIRA:

http://jira.h2o.ai

• Google groups:

https://groups.google.com/forum/?hl=en#!forum/h2ostream

Page 42: Spark + H20 = Machine Learning at scale

More Info

• Documentation and booklets:

http://www.h2o.ai/docs/

• H2O.ai blog:

http://h2o.ai/blog

• H2O.ai YouTube channel:

https://www.youtube.com/user/0xdata

@h2oai

http://www.h2o.ai

Page 43: Spark + H20 = Machine Learning at scale

Thank you!

@mdymczyk

Mateusz Dymczyk

[email protected]

Page 44: Spark + H20 = Machine Learning at scale

Q&A