introduction to fast scalable machine learning with h2o - plano

30
H 2 O.ai Machine Intelligence What is H2O? Open source in-memory prediction engine Math Platform Parallelized and distributed algorithms making the most use out of multithreaded systems GLM, Random Forest, GBM, PCA, etc. Easy to use and adopt API Written in Java – perfect for Java Programmers REST API (JSON) – drives H2O from R, Python, Excel, Tableau More data? Or better models? BOTH Big Data Use all of your data – model without down sampling Run a simple GLM or a more complex GBM to find the best fit for the data More Data + Better Models = Better Predictions

Upload: srisatish-ambati

Post on 06-Jan-2017

1.429 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

What is H2O? Open source in-memory prediction engineMath Platform

• Parallelized and distributed algorithms making the most use out of multithreaded systems

• GLM, Random Forest, GBM, PCA, etc.

Easy to use and adoptAPI• Written in Java – perfect for Java Programmers• REST API (JSON) – drives H2O from R, Python, Excel, Tableau

More data? Or better models? BOTHBig Data• Use all of your data – model without down sampling• Run a simple GLM or a more complex GBM to find the best fit for the data• More Data + Better Models = Better Predictions

Page 2: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

H2O Production Analytics Workflow

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

Page 3: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

Ensembles

Deep Neural Networks

Algorithms on H2O

• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes • Distributed Random Forest:

Classification or regression models• Gradient Boosting Machine: Produces

an ensemble of decision trees with increasing refined approximations

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Supervised Learning

Statistical Analysis

Page 4: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

Dimensionality Reduction

Anomaly Detection

Algorithms on H2O

• K-means: Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Unsupervised Learning

Clustering

Page 5: Introduction to Fast Scalable Machine Learning with H2O - Plano

JavaScript R Python Excel/Tableau

Network

Rapids Expression Evaluation Engine Scala Customer

Algorithm

Customer AlgorithmParse

GLMGBMRF

Deep LearningK-Means

PCA

In-H2O Prediction Engine

H2O Software Stack

Fluid Vector Frame

Distributed K/V Store

Non-blocking Hash Map

Job

MRTask

Fork/Join

Flow

Customer Algorithm

Spark Hadoop Standalone H2O

Page 6: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

cientific Advisory CouncilStephen BoydProfessor of EE EngineeringStanford University

Rob TibshiraniProfessor of Health Researchand Policy, and StatisticsStanford University

Trevor HastieProfessor of StatisticsStanford University

Page 7: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O and R

Page 8: Introduction to Fast Scalable Machine Learning with H2O - Plano

Reading Data from HDFS into H2O with R

STEP 1

R user

h2o_df = h2o.importFile(“hdfs://path/to/data.csv”)

Page 9: Introduction to Fast Scalable Machine Learning with H2O - Plano

Reading Data from HDFS into H2O with R

H2OH2O

H2O

data.csv

HTTP REST API request to H2Ohas HDFS path

H2O ClusterInitiate distributed

ingest

HDFSRequest

data from HDFS

STEP 22.2

2.3

2.4

R

h2o.importFile()

2.1R function

call

Page 10: Introduction to Fast Scalable Machine Learning with H2O - Plano

Reading Data from HDFS into H2O with R

H2OH2O

H2O

R

HDFS

STEP 3

Cluster IPCluster Port

Pointer to Data

Return pointer to data in

REST API JSON Response

HDFS provides

data

3.3

3.43.1h2o_df object

created in R

data.csv

h2o_dfH2O

Frame

3.2Distributed

H2OFrame in DKV

H2O Cluster

Page 11: Introduction to Fast Scalable Machine Learning with H2O - Plano

R Script Starting H2O GLM

HTTP

REST/JSON

.h2o.startModelJob()POST /3/ModelBuilders/glm

h2o.glm()

R script

Standard R process

TCP/IP

HTTP

REST/JSON

/3/ModelBuilders/glm endpoint

Job

GLM algorithm

GLM tasks

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

Page 12: Introduction to Fast Scalable Machine Learning with H2O - Plano

R Script Retrieving H2O GLM Result

HTTP

REST/JSON

h2o.getModel()GET /3/Models/glm_model_id

h2o.glm()

R script

Standard R process

TCP/IP

HTTP

REST/JSON

/3/Models endpoint

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

Page 13: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O in Big Data Environments

Page 14: Introduction to Fast Scalable Machine Learning with H2O - Plano

Hadoop (and YARN)• You can launch H2O directly on Hadoop:

$ hadoop jar h2odriver.jar … -nodes 3 –mapperXmx 50g

• H2O uses Hadoop MapReduce to get CPU and Memory on the cluster, not to manage work– H2O mappers stay at 0% progress forever

• Until you shut down the H2O job yourself– All mappers (3 in this case) must be running at the same time– The mappers communicate with each other

• Form an H2O cluster on-the-spot within your Hadoop environment– No Hadoop reducers(!)

• Special YARN memory settings for large mappers– yarn.nodemanager.resource.memory-mb– yarn.scheduler.maximum-allocation-mb

• CPU resources controlled via –nthreads h2o command line argument

Page 15: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O on YARN DeploymentHadoop Gateway Node

hadoop jar h2odriver.jar …

YARN RM

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NM NM NM NMYARN NodeManagers (NM)

H2O H2O

Page 16: Introduction to Fast Scalable Machine Learning with H2O - Plano

Now You Have an H2O ClusterHadoop Gateway Node

hadoop jar h2odriver.jar …

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NMYARN NodeManagers (NM)

H2O H2O

YARN RM

Page 17: Introduction to Fast Scalable Machine Learning with H2O - Plano

Read Data from HDFS OnceHadoop Gateway Node

hadoop jar h2odriver.jar …

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NMYARN NodeManagers (NM)

H2O H2O

YARN RM

Page 18: Introduction to Fast Scalable Machine Learning with H2O - Plano

Build Models in-MemoryHadoop Gateway Node

hadoop jar h2odriver.jar …Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NMYARN NodeManagers (NM)

H2O H2O

YARN RM

Page 19: Introduction to Fast Scalable Machine Learning with H2O - Plano

Sparkling Water

Page 20: Introduction to Fast Scalable Machine Learning with H2O - Plano

Sparkling Water Application Life Cycle

Sparkling App

jar file

SparkMaster

JVM

spark-submit

SparkWorker

JVM

SparkWorker

JVM

SparkWorker

JVM

(1)

(2)

(3)

Sparkling Water Cluster

Spark Executor JVM

H2O(4)

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Page 21: Introduction to Fast Scalable Machine Learning with H2O - Plano

Sparkling Water Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source(e.g.

HDFS) (1)

(2)

(3)

H2ORDD

Spark Executor JVM

Spark Executor JVM

SparkRDD

Page 22: Introduction to Fast Scalable Machine Learning with H2O - Plano

How H2O Processes Data

Page 23: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Data TaxonomyVector

H2O.aiMachine Intelligence

Page 24: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Data TaxonomyVector The vector may be very large

(billions of rows)

- Stored as a compressed column (often 4x)

- Access as Java primitives with on-the-fly decompression

- Support fast Random access

- Modifiable with Java memory semanticsH2O.aiMachine Intelligence

Page 25: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Data TaxonomyVector Large vectors must be distributed

over multiple JVMs

- Vector is split into chunks- Chunk is a unit of parallel access- Each chunk ~ 1000 elements- Per-chunk compression- Homed to a single node- Can be spilled to disk- GC very cheap

H2O.aiMachine Intelligence

Page 26: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Data Taxonomy age gender zip_code ID

A row of data is always stored in a single JVM

Distributed data frame

- Similar to R data frame

- Adding and removingcolumns is cheap

H2O.aiMachine Intelligence

Page 27: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Fork/Join

JVMtask

JVMtask

JVMtask

JVMtask

JVMtask

Task is distributed in a tree pattern

- Results are reduced at each inner node

- Returns with a singleresult when all subtasks done

H2O.aiMachine Intelligence

Page 28: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Fork/Join

JVMtask

JVM

task

task task

tasktaskchunk

chunk chunk

On each node the task is parallelized using Fork/Join

H2O.aiMachine Intelligence

Page 29: Introduction to Fast Scalable Machine Learning with H2O - Plano

Deploying H2OPOJO Models

Page 30: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O on Storm

(Spout)

(Bolt)

H2OScoringPOJO

Real-time data

Predictions

DataSource

(e.g. HDFS)

H2O

Emit POJO Model

Real-time StreamModeling workflow