introduction to machine learning with h2o - jo-fai (joe) chow, h2o
TRANSCRIPT
Introduct ion to Machine Learning with H2O
Jo-fai (Joe) Chow
Data Scientist
@matlabulous
Data Science Milan
Politecnico di Milano
10th October, 2016
About Me: C iv i l Engineer → Data Sc ient ist
• 2005 - 2015
• Water Engineero Consultant for Utilities
o Industrial PhD• Water Engineering +
Machine Learning
• Discovered H2O in 2014!
• 2015 - Present
• Data Scientisto Virgin Media (UK)
o Domino Data Lab (US)
o H2O.ai (US)
2Why? Long story – see bit.ly/joe_h2o_talk2
Agenda
• First Talk (25 mins)o About H2O.aio Demo
• A Simple Classification Task• H2O’s Web Interface
o Why H2O?• Our Community• Our Customers
o What’s Next?• New H2O Features
• Second Talk (25 mins)o H2O for IoT
• Predictive Maintenance• Anomaly Detection• H2O’s R Interface
• Third Talk (25 mins)o Deep Watero Demo
• H2O + mxnet on GPU• H2O’s Python Interface
3
About H2O.ai
About H2O.ai
• H2O.ai, the Companyo Team: 80 (70 shown)o Founded in 2012o HQ: Mountain View, California
• H2O, the Platformo Open Source (Apache 2.0)o Algorithms written in Java
• Fast, distributed and scalable
o Multiple interfaces to suit different users• Web, R, Python, Java, Scala, REST/JSON
o Works with desktop/laptop, cloud, Spark and Hadoop
Joe
Scientif ic Advisory Counci l
6
Current Algorithm Overview
7
Joe’s Strata Hadoop
London Talk
bit.ly/joe_h2o_talk4
Today’s
Demos
Joe’s LondonR Talk
bit.ly/joe_h2o_talk3
H2O Overview
8
H2O’s Mission
9
Making Machine Learning Accessible to Everyone
Photo credit: Virgin Media
H2O Web Interface Demo
A Typical Machine Learning Task
• Demo
o Dataset – MNIST• LeCun et al. (1999)
• Hand-written Digits
o Import & Explore Data
o Build & Evaluate Models
o Make Predictions
11Photo credit: http://www.opendeep.org/v0.0.5/docs/tutorial-classifying-handwritten-mnist-images
MNIST Hand-Written Digits
• 784 Inputso 28 x 28 = 784 pixels
• 1 Outputo 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9o Classification
• Fileso Train (60k Records)o Test (10k)
• Linkso https://s3.amazonaws.com/h2o-public-test-
data/bigdata/laptop/mnist/train.csv.gzo https://s3.amazonaws.com/h2o-public-test-
data/bigdata/laptop/mnist/test.csv.gz
12
Photo credit: https://ml4a.github.io/ml4a/neural_networks/
H2O Flow (Web Interface) Demo
• Download and unzip jarfrom www.h2o.ai
• In terminal:o java -jar h2o.jar
• Web browser:o localhost:54321
13
H2O Live Demo
More H2O Flow Examples
15
Other H2O Interfaces
• R
• Python
• docs.h2o.ai
16
Key Resources
More Advanced Topics
• Advanced Features
o Hyperparameters Tuning
o Model Stacking
o Saving/Loading Models
o Export Plain Old Java Object (POJO)
• Key Resources
o docs.h2o.ai
• Joe’s Previous H2O Talks
o bit.ly/joe_h2o_talk3
o bit.ly/h2o_budapest_1
o bit.ly/h2o_paris_1
17
Why H2O?
19
S z i l a r d Pa f k a – C h i e f D a ta S c i e nt i s t a t E p o c h
• Sziland’s talks / blog posts about H2O:
o ML Benchmark
o Intro to ML with H2O
o H2O Scoring
o Tweets
20
Szi lard Pafka – Why H2O?
21
• Szilard’s Summary Slide
H2O for Kaggle
22
H2O Community Support
23
Google forum – h2osteam community.h2o.ai
Please try
#AroundTheWorldWithH2Oai
24
Strata HadoopLondon
PyDataAmsterdam
useR! 2016Stanford
satRdaysBudapest
London KaggleMeetup
Chelsea FC
Paris MLMeetup
Big Data London
#AroundTheWorldWithH2Oai
25
Data Science Milan
Thank you
H2O Usage in Italy
26
www.h2o.ai/community
27
28
www.h2o.ai/customers
H2O in Action
29
Thank you
Data Science Milan – May 19, 2016Bringing Deep Learning into production - Paolo Platter, AgileLab
http://www.slideshare.net/ds_mi/bringing-deep-learning-into-production-paolo-platter-agilelab
What’s Next?
H2O is Evolving
• H2O Open Tour NYC YouTube Playlisto Advanced data munging
o Visual ML
o Deep Water (3rd talk)
o Sparkling Water• PySparkling & RSparkling
o Steam
31
Next time?
H2O’s Mission
32
Making Machine Learning Accessible to Everyone
Photo credit: Virgin Media
End of First Talk – Thanks!
33
• Data Science Milan
• Gianmario Spacagna
• Politecnico di Milano
• Resourceso bit.ly/h2o_milan_1
o www.h2o.ai
o docs.h2o.ai
• Contacto [email protected]
o @matlabulous
o github.com/woobe
Extra Slides(H2O Flow Demo Screenshots – just in case)
35
Upload the file without decompressing it first
36
Change the data type of “label” from “Numeric” to “Enum” (categorical)
37
Note: Size in Memory
Click on individual labels to explore data
38
39
Split the full dataset into training (80% = 48k records) and
validation (20% = 12k) – a common machine learning
practice
40
Click and select parameters
for model training
41
Users have full access to all available parameters
– fine-tune model training process
For example, I am using
rectifier with dropout as the activation
to train the model for 20 epochs
with classes balancing
Leaving other settings as default
42
Training the model with estimated remaining time
– users can stop the process early if they want to
43
Performance (logloss) on validation set
Performance (logloss) on training set
44
Confusion Matrix on Training Set (48k Records)
About 2% Error
Confusion Matrix on Validation Set (12k Records)
About 4% Error
45
Using the model for prediction on test set
46
Confusion Matrix on Test Set (10k Records)
About 4% Error (similar to validation)
47
Full prediction outputs including individual
probabilities and predicted label