introduction to machine learning with h2o and python
TRANSCRIPT
Introduction to Machine Learning with H2O and Python
Jo-fai (Joe) Chow
Data Scientist
@matlabulous
H2O Tutorial at Analyx20th April, 2017
About Me
• Civil (Water) Engineer• 2010 – 2015
• Consultant (UK)• Utilities
• Asset Management
• Constrained Optimization
• Industrial PhD (UK)• Infrastructure Design Optimization
• Machine Learning + Water Engineering
• Discovered H2O in 2014
• Data Scientist• 2015
• Virgin Media (UK)
• Domino Data Lab (Silicon Valley, US)
• 2016 – Present
• H2O.ai (Silicon Valley, US)
3
Side Project #1 – Crime Data Visualization
5
https://github.com/woobe/rApps/tree/master/crimemaphttp://insidebigdata.com/2013/11/30/visualization-week-crimemap/
Side Project #2 – Data Visualization Contest
6
https://github.com/woobe/rugsmaps http://blog.revolutionanalytics.com/2014/08/winner-for-revolution-analytics-user-group-map-contest.html
About Me
8
R + H2O + Domino for KaggleGuest Blog Post for Domino & H2O (2014)
• The Long Story• bit.ly/joe_kaggle_story
Agenda
• About H2O.ai• Company
• Machine Learning Platform
• Tutorial• H2O Python Module
• Download & Install
• Step-by-Step Examples:• Basic Data Import / Manipulation
• Regression & Classification (Basics)
• Regression & Classification (Advanced)
• Using H2O in the Cloud
9
Agenda
• About H2O.ai• Company
• Machine Learning Platform
• Tutorial• H2O Python Module
• Download & Install
• Step-by-Step Examples:• Basic Data Import / Manipulation
• Regression & Classification (Basics)
• Regression & Classification (Advanced)
• Using H2O in the Cloud
10
Background Information
For beginners
As if I am working onKaggle competitions
Short Break
Company OverviewFounded 2011 Venture-backed, debuted in 2012
Products • H2O Open Source In-Memory AI Prediction Engine• Sparkling Water• Steam
Mission Operationalize Data Science, and provide a platform for users to build beautiful data products
Team 70 employees• Distributed Systems Engineers doing Machine Learning• World-class visualization designers
Headquarters Mountain View, CA
12
0
10000
20000
30000
40000
50000
60000
70000
1-Jan-15 1-Jul-15 1-Jan-16 1-Oct-16
# H2O Users
H2O Community GrowthTremendous Momentum Globally
65,000+ users globally (Sept 2016)
• 65,000+ users from ~8,000 companies in 140 countries. Top 5 from:
Large User Circle
* DATA FROM GOOGLE ANALYTICS EMBEDDED IN THE END USER PRODUCT 16
0
2000
4000
6000
8000
10000
1-Jan-15 1-Jul-15 1-Jan-16 1-Oct-16
# Companies Using H2O ~8,000+ companies(Sept 2016)
+127%
+60%
H2O for Academic Research
19
http://www.sciencedirect.com/science/article/pii/S0377221716308657
https://arxiv.org/abs/1509.01199
HDFS
S3
NFS
DistributedIn-Memory
Load Data
Loss-lessCompression
H2O Compute Engine
Production Scoring Environment
Exploratory &Descriptive
Analysis
Feature Engineering &
Selection
Supervised &Unsupervised
Modeling
ModelEvaluation &
Selection
Predict
Data & ModelStorage
Model Export:Plain Old Java Object
YourImagination
Data Prep Export:Plain Old Java Object
Local
SQL
High Level Architecture
27
HDFS
S3
NFS
DistributedIn-Memory
Load Data
Loss-lessCompression
H2O Compute Engine
Production Scoring Environment
Exploratory &Descriptive
Analysis
Feature Engineering &
Selection
Supervised &Unsupervised
Modeling
ModelEvaluation &
Selection
Predict
Data & ModelStorage
Model Export:Plain Old Java Object
YourImagination
Data Prep Export:Plain Old Java Object
Local
SQL
High Level Architecture
28
Import Data from Multiple Sources
HDFS
S3
NFS
DistributedIn-Memory
Load Data
Loss-lessCompression
H2O Compute Engine
Production Scoring Environment
Exploratory &Descriptive
Analysis
Feature Engineering &
Selection
Supervised &Unsupervised
Modeling
ModelEvaluation &
Selection
Predict
Data & ModelStorage
Model Export:Plain Old Java Object
YourImagination
Data Prep Export:Plain Old Java Object
Local
SQL
High Level Architecture
29
Fast, Scalable & Distributed Compute Engine Written in
Java
HDFS
S3
NFS
DistributedIn-Memory
Load Data
Loss-lessCompression
H2O Compute Engine
Production Scoring Environment
Exploratory &Descriptive
Analysis
Feature Engineering &
Selection
Supervised &Unsupervised
Modeling
ModelEvaluation &
Selection
Predict
Data & ModelStorage
Model Export:Plain Old Java Object
YourImagination
Data Prep Export:Plain Old Java Object
Local
SQL
High Level Architecture
30
Fast, Scalable & Distributed Compute Engine Written in
Java
Supervised Learning
• Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical Analysis
Ensembles
• Distributed Random Forest: Classification or regression models
• Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations
Deep Neural Networks
• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations
Algorithms OverviewUnsupervised Learning
• K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k
Clustering
Dimensionality Reduction
• Principal Component Analysis: Linearly transforms correlated variables to independent components
• Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data
Anomaly Detection
• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
31
HDFS
S3
NFS
DistributedIn-Memory
Load Data
Loss-lessCompression
H2O Compute Engine
Production Scoring Environment
Exploratory &Descriptive
Analysis
Feature Engineering &
Selection
Supervised &Unsupervised
Modeling
ModelEvaluation &
Selection
Predict
Data & ModelStorage
Model Export:Plain Old Java Object
YourImagination
Data Prep Export:Plain Old Java Object
Local
SQL
High Level Architecture
33
Multiple Interfaces
HDFS
S3
NFS
DistributedIn-Memory
Load Data
Loss-lessCompression
H2O Compute Engine
Production Scoring Environment
Exploratory &Descriptive
Analysis
Feature Engineering &
Selection
Supervised &Unsupervised
Modeling
ModelEvaluation &
Selection
Predict
Data & ModelStorage
Model Export:Plain Old Java Object
YourImagination
Data Prep Export:Plain Old Java Object
Local
SQL
High Level Architecture
37
Export Standalone Models for Production
Learning Objectives
• Start and connect to a local H2O cluster from Python.
• Import data from Python data frames, local files or web.
• Perform basic data transformation and exploration.
• Train regression and classification models using various H2O machine learning algorithms.
• Evaluate models and make predictions.
• Improve performance by tuning and stacking.
• Connect to H2O cluster in the cloud.
40
Local H2O Cluster
45
Import H2O module
Start a local H2O clusternthreads = -1 means
using ALL CPU resources
Supervised Learning
• Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie
• Naïve Bayes
Statistical Analysis
Ensembles
• Distributed Random Forest: Classification or regression models
• Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations
Deep Neural Networks
• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations
Algorithms OverviewUnsupervised Learning
• K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k
Clustering
Dimensionality Reduction
• Principal Component Analysis: Linearly transforms correlated variables to independent components
• Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data
Anomaly Detection
• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
57
Improving Model Performance (Step-by-Step)
83
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Lower Mean Square Error
=Better
Performance
Improving Model Performance (Step-by-Step)
86
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Improving Model Performance (Step-by-Step)
88
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Improving Model Performance (Step-by-Step)
92
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Improving Model Performance (Step-by-Step)
95
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Grid Search
96
Combination Parameter 1 Parameter 2
1 0.7 0.7
2 0.7 0.8
3 0.7 0.9
4 0.8 0.7
5 0.8 0.8
6 0.8 0.9
7 0.9 0.7
8 0.9 0.8
9 0.9 0.9
Improving Model Performance (Step-by-Step)
100
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Improving Model Performance (Step-by-Step)
103
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
105
https://github.com/h2oai/h2o-meetups/blob/master/2017_02_23_Metis_SF_Sacked_Ensembles_Deep_Water/stacked_ensembles_in_h2o_feb2017.pdf
109
Lowest MSE = Best Performance
API for Stacked Ensembles
Use the three models from previous steps
Improving Model Performance (Step-by-Step)
110
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Lowest MSE = Best Performance
Learning Objectives
• Start and connect to a local H2O cluster from Python.
• Import data from Python data frames, local files or web.
• Perform basic data transformation and exploration.
• Train regression and classification models using various H2O machine learning algorithms.
• Evaluate models and make predictions.
• Improve performance by tuning and stacking.
• Connect to H2O cluster in the cloud.
117
Improving Model Performance (Step-by-Step)
118
Model Settings MSE (CV) MSE (Test)
GBM with default settings N/A 0.4551
GBM with manual settings N/A 0.4433
Manual settings + cross-validation 0.4502 0.4433
Manual + CV + early stopping 0.4429 0.4287
CV + early stopping + full grid search 0.4378 0.4196
CV + early stopping + random grid search 0.4227 0.4047
Stacking models from random grid search N/A 0.3969
Lowest MSE = Best Performance
• Our Friends at
• Find us at Poznan R Meetup• Today at 6:15 pm• Uniwersytet Ekonomiczny w Poznaniu
Centrum Edukacyjne Usług Elektronicznych
120
Thanks!
• Code, Slides & Documents• bit.ly/h2o_meetups• docs.h2o.ai
• Contact• [email protected]• @matlabulous• github.com/woobe
• Please search/ask questions on Stack Overflow• Use the tag `h2o` (not H2 zero)