becoming a data-driven organization with machine learning
DESCRIPTION
Does your organization collect data? Lots of data? Does your organization make use of all that data they have collected? In this session you will learn what you do with machine learning, and what are the building blocks for an application that uses machine learning. This session will show you how to go from data you have collected to creating predictions for customers. You will learn how valuable insights into your data can be gleaned while building the code to make predictions.TRANSCRIPT
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Becoming a Data Driven Organization
with Machine Learning By Peter Harrington
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Goals for this talk
• Introduce Machine Learning (ML)
• Talk about how we can take ML outside of the
• Share Some experience from the trenches
• Simplicity
2
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Agenda
• Introduce Myself
• Define Data Driven and ML
• Common tasks in ML
• Sample of some ML algos with examples
• *Interpretable ML
• *ML & Agile Development
• *What is a “Data Scientist”?
3
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
About me
• Author of Machine Learning in Action
4
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
My employer
• Provide Customers with a list of WHO is using WHAT
product.
• Customers are willing to pay us for this data.
• Collect data from numerous document sources where
companies are talking about themselves.
• Our product is data
5
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
What we do
• Natural Language Processing
• Knowledge Graph of Business Information
• 1.5B documents
• Update results daily
• Try to keep infrastructure costs down
6
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Spam
• We use lots and lots of Java
• We are hiring
• Santa Barbra, California
• Sunnyvale, California
• Apply on our website: www.hgdata.com
7
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
“Data Driven”
• “Data driven means that progress in an activity is
compelled by data, rather than by intuition or personal
experience.” --Wikipedia
• This talk is going to show how you can use some
techniques from Machine Learning to help make data
driven decisions in your work place, or help your
applications make decisions.
8
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
What is Machine Learning?
• Some tools to allow a machine to learn from data.
9
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Example
10
50, 62
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
Heig
ht
(in
ch
es)
Weight (Pounds)
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Example (continued)
11
50, 62 y = 0.9922x + 12.472
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
Heig
ht
(in
ch
es)
Weight (Pounds)
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
What is Machine Learning?
• Some tools to allow a machine to learn from data.
• Tools that can make decisions from non-deterministic
data
12
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Common Tasks in Machine Learning
• Supervised Learning
• Predicting a numerical value, this is called regression.
• Predicting categories, (spam or not spam for example) this is
called classification
• Unsupervised Learning
• Clustering
• Association rule mining {men, diapers} {beer}
• Topic modeling
• Semi-Supervised Learning
13
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Where is ML being used today?
• Face detection
• Handwriting detection in mail
• Voice recognition (Siri, Sync)
• Answering questions (IBM Watson, Google)
• Forecasting weather
• Stock Trading
• Recommending things when you shop
• Spam detection online: email, forums
• Law: forecasting results, extracting info from docs
14
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Where else?
• Spacecraft
• Self driving cars
• Identifying whales
• Predicting strokes
• Fighting financial fraud
15
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Why is ML useful?
• You do not need to be a domain expert to make a
prediction/forecast.
• Example 1: mathematician out predicts a Law professor at
Supreme court rulings
• Example 2: an economist out predicts wine snobs at predicting
the best vintages.
Both of the above examples are from a book called:“Super Crunchers” by
Ian Ayres
• We are not trying to escape study of these fields, rather
we are often asked to study fields that few have studied
before.
16
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Classification Example
17
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Decision Tree Example
18
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Data wrangling
• The data doesn’t always come as easily as in these toy
examples.
• 50-90% of our time is spent getting the data into the
system
• Reasons why we need to do this
• Wrong format
• Not being recorded
• Political reasons
19
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Regression Example
20
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Ridge Regression Example
21
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Semi-Supervised Learning
22
Image taken from: http://bioinformatics.oxfordjournals.org/content/24/6/783/F1.expansion.html
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
How can you do this?
• Collect Some data
• Put the data into some existing package in your
language of choice.
• Take the resulting model and put it into your application.
Understand the model.
23
ML code Data Model
H = 0.9*W +1 2.4
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Better Approach
• Introduce Machine Learning (ML)
24
ML code Data Model
H = 0.9*W +1 2.4
Training
Set
Test
Set Test Code
1.5 lbs Error:
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Agile Development and ML
25
Traditional
Development
Agile
Development
Agile Research
Requirements Negotiate with
customer
Working prototypes
to constantly refine
requirements
Establish target
accuracy,
Sufficient Data
Implementation Comprehensive plan Small teams, quick
implementation
cycle
Rapid research
cycle: focus on
data or algorithm
improvements
Measurement Validate that the
software meets specs
with tests
Iterate with
customer to
evaluate if
requirements are
met
Accuracy metrics
dominate,
Headroom analysis
used to guide next
sprint
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Interpretable ML
26
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Interpretable ML
• “Black box” models are not easy to interpret, and may
be poorly received.
• Some models like decision trees are easy to interpret
but may have not have the best performance.
• Check out: Decision Lists and Sparse Integer Lists if this
is something you are interested in.
27
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
What is a Data Scientist?
A. A cross between a statistician and a developer?
B. A developer who knows ML?
C. A buzzword we use to attract developers?
D. All of the above?
28
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
Thanks again for coming!
• Questions?
29
Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a
Creative Commons Attribution-NonCommercial l icense: http://creativecommons.org/licenses/by-nc/3.0/
True AI vs. Modern AI
• True AI seeks to understand how the human brain works
by creating an artificial version.
• Modern AI is a collection of hard problems.
30