black boxes and unicorns // jeremy achin, datarobot [firstmark's data driven]
TRANSCRIPT
Black Boxes and Unicorns
Jeremy Achin | Data Scientist & CEO| DataRobot
Jeremy Achin?
3
DataRobot Company History
2012 2H 2013 1H 2013 2H 2014 1H 2014 2H 2015 1H
June ‘12Founded
June ‘13Seed Funding
$3.3M
July ‘14Series A
$21M
2015 2H
Bigger & Better Announcements Coming Soon!
DataRobot: better predictive models faster
https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
Leo Breiman (classification & regression trees, random forest, and my personal hero)
https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
Leo Breiman (classification & regression trees, random forest, and my personal hero)
2001: Statistical Modeling: The Two Cultures
● An attack on statisticians who rely solely on regression models
● Argued we should be using the techniques that obtain the best results
● Even a carefully built regression model is just one of many possible representations of the underlying reality
“If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data [regression] models and adopt a more diverse set of tools.”
https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726
14 Years LaterExcellent progress in recent years but...
● still armies of people taking months to manually build regression models (especially in larger companies)
● non-regression methods still thought of as “black box”
Black Box (n) /blak bäks/
Black Box (n) /blak bäks/A phrase people use when they’re scared of technology they don’t understand and want to keep doing the same thing they’ve been doing for the last twenty years.
What do we really need to know about a predictive model?
1. Overall Performance on Out-of-Sample (Validation) Data
2. Predicted vs Actual by Variable
3. How a model’s predictions change as values of input
variables change
What do we really need to know about a predictive model?
1. Overall Performance on Out-of-Sample (Validation) Data
2. Predicted vs Actual by Variable
3. How a model’s predictions change as values of input
variables change
None of these depend on the specific algorithm you are using. Even #3!
Overall Out-of-Sample Performance
Mean Absolute Error
Weighted Mean Absolute Error
Root Mean Squared Error
Root Mean Squared Mean F Score
Mean Consequential Error
Mean Average Precision
Multi-class Log Loss
Hamming Loss
Mean Utility
Continuous Ranked
AUC
Average Precision (column-wise)
GiniAverage Among Top P
Mean Average Precision (row-wise)
`
Normalized Discounted Cumulative Gain@k
Mean Average Precision@n
Levenshtein Distance
Average Precision
Absolute Error
Probability ScoreLogarithmic Error
Hospital Readmission Model Assessment and Interpretation
Number of Prior Visits to Hospital
Hos
pita
l Rea
dmis
sion
Rat
e
Hospital Readmission Model Assessment and Interpretation
Number of Prior Visits to Hospital
Hos
pita
l Rea
dmis
sion
Rat
e
Actual Hospital Readmission
Rate
Hospital Readmission Model Assessment and Interpretation
Number of Prior Visits to Hospital
Hos
pita
l Rea
dmis
sion
Rat
e
Predicted Hospital
Readmission Rate
Hospital Readmission Model Assessment and Interpretation
Number of Prior Visits to Hospital
Hos
pita
l Rea
dmis
sion
Rat
e
Hospital Readmission Model Assessment and Interpretation
Number of Prior Visits to Hospital
Hos
pita
l Rea
dmis
sion
Rat
e
Partial Dependence
Partial Dependence
10.13.2 Partial Dependence Plots . . . . . . . . . . . . . 369
https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
Compliance (n) /kəmˈplīəns/
Compliance (n) /kəmˈplīəns/A word people use as a last resort to defend the status quo after they realize that their 100 variable regression model is an arbitrary representation of reality that is less accurate, robust, and interpretable than modern alternatives.
Arbitrary Representations of RealityThree statisticians sitting at a bar...
One more round?
ftp://ftp.nhtsa.dot.gov/GES/GES12/
● 153,077 Police-reported accidents
● 58 Variables
Goal: Try to Predict Probability of a Fatality
Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:
Regression Coefficient 0.509 0.355 0.089-0.694
Arbitrary Representations of Reality
Model Performance (Log Loss): 0.469
"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..
Also, being the driver is safe, so I'm driving home"
Statistician #1
Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:
Regression Coefficient 0.509 0.355 0.089-0.694
Arbitrary Representations of Reality
Model Performance (Log Loss): 0.469
"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..
Also, being the driver is safe, so I'm driving home"
Model Performance (Log Loss): 0.467
"Hmmm... looks like drinking and driving leads to fatal crashes. Probably shouldn't have another round."
Also, the later the better, so let's just wait here until midnight"
Variable Name Alcohol Involved: Age: Restraint Misuse:Hour of Accident:
RegressionCoefficient 1.866 0.008 0.000-0.019
Statistician #2Statistician #1
Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:
Regression Coefficient 0.509 0.355 0.089-0.694
Arbitrary Representations of Reality
Model Performance (Log Loss): 0.469
"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..
Also, being the driver is safe, so I'm driving home"
Model Performance (Log Loss): 0.422
"No, no, no, we just need to wear lap and shoulder belts with our booster seats, and be police officers. Look at those coefficients!
Furthermore, my model is better, so I'm right."
Variable Name Alcohol Involved: Age: Restraint Misuse:Hour of Accident:
RegressionCoefficient 1.866 0.008 0.000-0.019
Variable Name Opening Door In Motion: Is Police Officer: Booster Seat Used:Lap And Shoulder Belt:
RegressionCoefficient 0.449-0.412-0.787-1.897
Statistician #3Statistician #2Statistician #1
Model Performance (Log Loss): 0.467
"Hmmm... looks like drinking and driving leads to fatal crashes. Probably shouldn't have another round."
Also, the later the better, so let's just wait here until midnight"
The Killer Potato
The Killer Potato
Obligatory Data Scientist Definition Slide
Hacking Skills
Maths & Stats
Domain Knowledge
Data Science
● Foundational Statistics● Internals of Algorithms● Practical Knowledge
and Experience
● Programming○ Get Data○ Manipulate Data○ Explore Data○ Build Models○ Implement Models
● Understand the Business Problem
● Understanding of the Data
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
The current path to becoming a Data Scientist
A Better Way
AUTOMATED USINGMODERN TOOLS AND
COMPUTATIONAL POWER
Takeaways● There are technique-agnostic ways to
assess and interpret predictive models.
● The shortage of Data Scientists will be solved by a combination of pragmatic education and levels of automation currently not thought possible.
Three quick tips for entrepreneurs
Watch out for Lean Startup & MVP Zealots
Minimum viable product (MVP) get the smallest functional product into the market asap to derisk the investment.
Watch out for Lean Startup & MVP Zealots
Minimum viable product (MVP) is the product with the highest return on investment versus risk.
Minimum viable product (MVP) get the smallest functional product into the market asap to derisk the investment.
Be Paranoid and Don’t Rely on Hope.
Choose the Right Investors & Advisors
CHRIS LYNCH HARRY WELLER
Jason Seats Jit Saxena Kevin Dick
Ray Tacoma
Brad Gillespie
© DataRobot, Inc. All rights reserved.Confidential