introduction to fast scalable machine learning with h2o - plano

H2O.aiMachine Intelligence

What is H2O? Open source in-memory prediction engineMath Platform

• Parallelized and distributed algorithms making the most use out of multithreaded systems

• GLM, Random Forest, GBM, PCA, etc.

Easy to use and adoptAPI• Written in Java – perfect for Java Programmers• REST API (JSON) – drives H2O from R, Python, Excel, Tableau

More data? Or better models? BOTHBig Data• Use all of your data – model without down sampling• Run a simple GLM or a more complex GBM to find the best fit for the data• More Data + Better Models = Better Predictions


H2O Production Analytics Workflow

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

https://www.youtube.com/v/UGW3cT_cZLc&autoplay=1


Ensembles

Deep Neural Networks

Algorithms on H2O

• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes • Distributed Random Forest:

Classification or regression models• Gradient Boosting Machine: Produces

an ensemble of decision trees with increasing refined approximations

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Supervised Learning

Statistical Analysis


Dimensionality Reduction

Anomaly Detection

Algorithms on H2O

• K-means: Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Unsupervised Learning

Clustering

JavaScript R Python Excel/Tableau

Network

Rapids Expression Evaluation Engine Scala Customer

Algorithm

Customer AlgorithmParse

GLMGBMRF

Deep LearningK-Means

PCA

In-H2O Prediction Engine

H2O Software Stack

Fluid Vector Frame

Distributed K/V Store

Non-blocking Hash Map

Job

MRTask

Fork/Join

Flow

Customer Algorithm

Spark Hadoop Standalone H2O


cientific Advisory CouncilStephen BoydProfessor of EE EngineeringStanford University

Rob TibshiraniProfessor of Health Researchand Policy, and StatisticsStanford University

Trevor HastieProfessor of StatisticsStanford University

H2O and R

Reading Data from HDFS into H2O with R

STEP 1

R user

h2o_df = h2o.importFile(“hdfs://path/to/data.csv”)


H2OH2O

H2O

data.csv

HTTP REST API request to H2Ohas HDFS path

H2O ClusterInitiate distributed

ingest

HDFSRequest

data from HDFS

STEP 22.2

2.3

2.4

R

h2o.importFile()

2.1R function

call


H2OH2O

H2O

R

HDFS

STEP 3

Cluster IPCluster Port

Pointer to Data

Return pointer to data in

REST API JSON Response

HDFS provides

data

3.3

3.43.1h2o_df object

created in R

data.csv

h2o_dfH2O

Frame

3.2Distributed

H2OFrame in DKV

H2O Cluster

R Script Starting H2O GLM

HTTP

REST/JSON

.h2o.startModelJob()POST /3/ModelBuilders/glm

h2o.glm()

R script

Standard R process

TCP/IP

HTTP

REST/JSON

/3/ModelBuilders/glm endpoint

Job

GLM algorithm

GLM tasks

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

R Script Retrieving H2O GLM Result

HTTP

REST/JSON

h2o.getModel()GET /3/Models/glm_model_id

h2o.glm()

R script

Standard R process

TCP/IP

HTTP

REST/JSON

/3/Models endpoint

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

H2O in Big Data Environments

Hadoop (and YARN)• You can launch H2O directly on Hadoop:

$ hadoop jar h2odriver.jar … -nodes 3 –mapperXmx 50g

• H2O uses Hadoop MapReduce to get CPU and Memory on the cluster, not to manage work– H2O mappers stay at 0% progress forever

• Until you shut down the H2O job yourself– All mappers (3 in this case) must be running at the same time– The mappers communicate with each other

• Form an H2O cluster on-the-spot within your Hadoop environment– No Hadoop reducers(!)

• Special YARN memory settings for large mappers– yarn.nodemanager.resource.memory-mb– yarn.scheduler.maximum-allocation-mb

• CPU resources controlled via –nthreads h2o command line argument

H2O on YARN DeploymentHadoop Gateway Node

hadoop jar h2odriver.jar …

YARN RM

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NM NM NM NMYARN NodeManagers (NM)

H2O H2O

Now You Have an H2O ClusterHadoop Gateway Node


HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NMYARN NodeManagers (NM)

H2O H2O

YARN RM

Read Data from HDFS OnceHadoop Gateway Node


HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes


H2O H2O

YARN RM

Build Models in-MemoryHadoop Gateway Node

hadoop jar h2odriver.jar …Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes


H2O H2O

YARN RM

Sparkling Water

Sparkling Water Application Life Cycle

Sparkling App

jar file

SparkMaster

JVM

spark-submit

SparkWorker

JVM

SparkWorker

JVM

SparkWorker

JVM

(1)

(2)

(3)

Sparkling Water Cluster

Spark Executor JVM

H2O(4)

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Sparkling Water Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source(e.g.

HDFS) (1)

(2)

(3)

H2ORDD

Spark Executor JVM

Spark Executor JVM

SparkRDD

How H2O Processes Data

Distributed Data TaxonomyVector


Distributed Data TaxonomyVector The vector may be very large

(billions of rows)

- Stored as a compressed column (often 4x)

- Access as Java primitives with on-the-fly decompression

- Support fast Random access

- Modifiable with Java memory semanticsH2O.aiMachine Intelligence

Distributed Data TaxonomyVector Large vectors must be distributed

over multiple JVMs

- Vector is split into chunks- Chunk is a unit of parallel access- Each chunk ~ 1000 elements- Per-chunk compression- Homed to a single node- Can be spilled to disk- GC very cheap


Distributed Data Taxonomy age gender zip_code ID

A row of data is always stored in a single JVM

Distributed data frame

- Similar to R data frame

- Adding and removingcolumns is cheap


Distributed Fork/Join

JVMtask

JVMtask

JVMtask

JVMtask

JVMtask

Task is distributed in a tree pattern

- Results are reduced at each inner node

- Returns with a singleresult when all subtasks done


Distributed Fork/Join

JVMtask

JVM

task

task task

tasktaskchunk

chunk chunk

On each node the task is parallelized using Fork/Join


Deploying H2OPOJO Models

H2O on Storm

(Spout)

(Bolt)

H2OScoringPOJO

Real-time data

Predictions

DataSource

(e.g. HDFS)

H2O

Emit POJO Model

Real-time StreamModeling workflow

introduction to fast scalable machine learning with h2o - plano

Technology