introduction to fast scalable machine learning with h2o - plano
TRANSCRIPT
H2O.aiMachine Intelligence
What is H2O? Open source in-memory prediction engineMath Platform
• Parallelized and distributed algorithms making the most use out of multithreaded systems
• GLM, Random Forest, GBM, PCA, etc.
Easy to use and adoptAPI• Written in Java – perfect for Java Programmers• REST API (JSON) – drives H2O from R, Python, Excel, Tableau
More data? Or better models? BOTHBig Data• Use all of your data – model without down sampling• Run a simple GLM or a more complex GBM to find the best fit for the data• More Data + Better Models = Better Predictions
H2O.aiMachine Intelligence
H2O Production Analytics Workflow
HDFS
S3
NFS
DistributedIn-Memory
Load Data
Loss-lessCompression
H2O Compute Engine
Production Scoring Environment
Exploratory &Descriptive
Analysis
Feature Engineering &
Selection
Supervised &Unsupervised
Modeling
ModelEvaluation &
Selection
Predict
Data & ModelStorage
Model Export:Plain Old Java Object
YourImagination
Data Prep Export:Plain Old Java Object
Local
H2O.aiMachine Intelligence
Ensembles
Deep Neural Networks
Algorithms on H2O
• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes • Distributed Random Forest:
Classification or regression models• Gradient Boosting Machine: Produces
an ensemble of decision trees with increasing refined approximations
• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations
Supervised Learning
Statistical Analysis
H2O.aiMachine Intelligence
Dimensionality Reduction
Anomaly Detection
Algorithms on H2O
• K-means: Partitions observations into k clusters/groups of the same spatial size
• Principal Component Analysis: Linearly transforms correlated variables to independent components
• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data
• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
Unsupervised Learning
Clustering
JavaScript R Python Excel/Tableau
Network
Rapids Expression Evaluation Engine Scala Customer
Algorithm
Customer AlgorithmParse
GLMGBMRF
Deep LearningK-Means
PCA
In-H2O Prediction Engine
H2O Software Stack
Fluid Vector Frame
Distributed K/V Store
Non-blocking Hash Map
Job
MRTask
Fork/Join
Flow
Customer Algorithm
Spark Hadoop Standalone H2O
H2O.aiMachine Intelligence
cientific Advisory CouncilStephen BoydProfessor of EE EngineeringStanford University
Rob TibshiraniProfessor of Health Researchand Policy, and StatisticsStanford University
Trevor HastieProfessor of StatisticsStanford University
H2O and R
Reading Data from HDFS into H2O with R
STEP 1
R user
h2o_df = h2o.importFile(“hdfs://path/to/data.csv”)
Reading Data from HDFS into H2O with R
H2OH2O
H2O
data.csv
HTTP REST API request to H2Ohas HDFS path
H2O ClusterInitiate distributed
ingest
HDFSRequest
data from HDFS
STEP 22.2
2.3
2.4
R
h2o.importFile()
2.1R function
call
Reading Data from HDFS into H2O with R
H2OH2O
H2O
R
HDFS
STEP 3
Cluster IPCluster Port
Pointer to Data
Return pointer to data in
REST API JSON Response
HDFS provides
data
3.3
3.43.1h2o_df object
created in R
data.csv
h2o_dfH2O
Frame
3.2Distributed
H2OFrame in DKV
H2O Cluster
R Script Starting H2O GLM
HTTP
REST/JSON
.h2o.startModelJob()POST /3/ModelBuilders/glm
h2o.glm()
R script
Standard R process
TCP/IP
HTTP
REST/JSON
/3/ModelBuilders/glm endpoint
Job
GLM algorithm
GLM tasks
Fork/Join framework
K/V store framework
H2O process
Network layer
REST layer
H2O - algos
H2O - core
User process
H2O process
Legend
R Script Retrieving H2O GLM Result
HTTP
REST/JSON
h2o.getModel()GET /3/Models/glm_model_id
h2o.glm()
R script
Standard R process
TCP/IP
HTTP
REST/JSON
/3/Models endpoint
Fork/Join framework
K/V store framework
H2O process
Network layer
REST layer
H2O - algos
H2O - core
User process
H2O process
Legend
H2O in Big Data Environments
Hadoop (and YARN)• You can launch H2O directly on Hadoop:
$ hadoop jar h2odriver.jar … -nodes 3 –mapperXmx 50g
• H2O uses Hadoop MapReduce to get CPU and Memory on the cluster, not to manage work– H2O mappers stay at 0% progress forever
• Until you shut down the H2O job yourself– All mappers (3 in this case) must be running at the same time– The mappers communicate with each other
• Form an H2O cluster on-the-spot within your Hadoop environment– No Hadoop reducers(!)
• Special YARN memory settings for large mappers– yarn.nodemanager.resource.memory-mb– yarn.scheduler.maximum-allocation-mb
• CPU resources controlled via –nthreads h2o command line argument
H2O on YARN DeploymentHadoop Gateway Node
hadoop jar h2odriver.jar …
YARN RM
HDFSHDFSData Nodes
Hadoop Cluster
H2OH2O Mappers
(YARN Containers)
YARN Worker Nodes
NM NM NM NM NM NMYARN NodeManagers (NM)
H2O H2O
Now You Have an H2O ClusterHadoop Gateway Node
hadoop jar h2odriver.jar …
HDFSHDFSData Nodes
Hadoop Cluster
H2OH2O Mappers
(YARN Containers)
YARN Worker Nodes
NM NM NMYARN NodeManagers (NM)
H2O H2O
YARN RM
Read Data from HDFS OnceHadoop Gateway Node
hadoop jar h2odriver.jar …
HDFSHDFSData Nodes
Hadoop Cluster
H2OH2O Mappers
(YARN Containers)
YARN Worker Nodes
NM NM NMYARN NodeManagers (NM)
H2O H2O
YARN RM
Build Models in-MemoryHadoop Gateway Node
hadoop jar h2odriver.jar …Hadoop Cluster
H2OH2O Mappers
(YARN Containers)
YARN Worker Nodes
NM NM NMYARN NodeManagers (NM)
H2O H2O
YARN RM
Sparkling Water
Sparkling Water Application Life Cycle
Sparkling App
jar file
SparkMaster
JVM
spark-submit
SparkWorker
JVM
SparkWorker
JVM
SparkWorker
JVM
(1)
(2)
(3)
Sparkling Water Cluster
Spark Executor JVM
H2O(4)
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Sparkling Water Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVMData
Source(e.g.
HDFS) (1)
(2)
(3)
H2ORDD
Spark Executor JVM
Spark Executor JVM
SparkRDD
How H2O Processes Data
Distributed Data TaxonomyVector
H2O.aiMachine Intelligence
Distributed Data TaxonomyVector The vector may be very large
(billions of rows)
- Stored as a compressed column (often 4x)
- Access as Java primitives with on-the-fly decompression
- Support fast Random access
- Modifiable with Java memory semanticsH2O.aiMachine Intelligence
Distributed Data TaxonomyVector Large vectors must be distributed
over multiple JVMs
- Vector is split into chunks- Chunk is a unit of parallel access- Each chunk ~ 1000 elements- Per-chunk compression- Homed to a single node- Can be spilled to disk- GC very cheap
H2O.aiMachine Intelligence
Distributed Data Taxonomy age gender zip_code ID
A row of data is always stored in a single JVM
Distributed data frame
- Similar to R data frame
- Adding and removingcolumns is cheap
H2O.aiMachine Intelligence
Distributed Fork/Join
JVMtask
JVMtask
JVMtask
JVMtask
JVMtask
Task is distributed in a tree pattern
- Results are reduced at each inner node
- Returns with a singleresult when all subtasks done
H2O.aiMachine Intelligence
Distributed Fork/Join
JVMtask
JVM
task
task task
tasktaskchunk
chunk chunk
On each node the task is parallelized using Fork/Join
H2O.aiMachine Intelligence
Deploying H2OPOJO Models
H2O on Storm
(Spout)
(Bolt)
H2OScoringPOJO
Real-time data
Predictions
DataSource
(e.g. HDFS)
H2O
Emit POJO Model
Real-time StreamModeling workflow