production machine learning_infrastructure
DESCRIPTION
Slides from Josh Wills' talk on building machine learning infrastructure at Data Day Texas 2014.TRANSCRIPT
![Page 1: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/1.jpg)
1
From The Lab to the FactoryBuilding A Production Machine Learning InfrastructureJosh Wills, Senior Director of Data ScienceCloudera
![Page 2: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/2.jpg)
2
What is a Data Scientist?
![Page 3: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/3.jpg)
3
One Definition…
![Page 4: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/4.jpg)
…versus Another
4
![Page 5: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/5.jpg)
The Two Kinds of Data Scientists
• The Lab• Statisticians who got
really good at programming
• Neuroscientists, geneticists, etc.
• The Factory• Software engineers who
were in the wrong place at the wrong time
5
![Page 6: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/6.jpg)
6
Data Science In The Lab
![Page 7: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/7.jpg)
7
Data Science as Statistics
![Page 8: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/8.jpg)
8
Investigative Analytics
![Page 9: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/9.jpg)
9
Tools for Investigative Analytics
![Page 10: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/10.jpg)
10
Inputs and Outputs
![Page 11: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/11.jpg)
11
On Actionable Insights
![Page 12: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/12.jpg)
12
Data Science In The Factory
![Page 13: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/13.jpg)
13
Building Data Products
![Page 14: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/14.jpg)
14
A Shift In Perspective
Analytics in the Lab
• Question-driven• Interactive• Ad-hoc, post-hoc• Fixed data• Focus on speed and
flexibility• Output is embedded into a
report or in-database scoring engine
Analytics in the Factory
• Metric-driven• Automated• Systematic• Fluid data• Focus on transparency and
reliability• Output is a production
system that makes customer-facing decisions
![Page 15: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/15.jpg)
15
Data Science as Decision Engineering
![Page 16: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/16.jpg)
16
All* Products Become Data Products
![Page 17: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/17.jpg)
17
Sounds Great. So Who Is Doing This?
![Page 18: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/18.jpg)
18
From The Lab To The Factory
![Page 19: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/19.jpg)
19
The Art of Machine Learning
![Page 20: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/20.jpg)
20
A New Kind of Statistics
![Page 21: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/21.jpg)
21
DevOps for Data Science
![Page 22: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/22.jpg)
22
The Model: Information Retrieval
![Page 23: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/23.jpg)
23
From the Lab to the Factory:First Steps
![Page 24: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/24.jpg)
24
Step 1: Choose a Good Problem
![Page 25: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/25.jpg)
25
Step 2: DTSTCPWTM
![Page 26: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/26.jpg)
26
Step 3: Log Everything
![Page 27: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/27.jpg)
27
Step 4: Hire (More) Data Scientists
![Page 28: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/28.jpg)
28
Things We’re Working On
![Page 29: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/29.jpg)
29
The Data Science Workflow
![Page 30: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/30.jpg)
30
Identifying the Bottlenecks
![Page 31: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/31.jpg)
31
Myrrix
![Page 32: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/32.jpg)
Oryx: Simple and Scalable ML
32
![Page 33: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/33.jpg)
33
Generational Thinking
![Page 34: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/34.jpg)
34
Working on the Gaps
![Page 35: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/35.jpg)
35
Space Exploration
![Page 36: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/36.jpg)
36
The Limits of Our Models
![Page 37: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/37.jpg)
37
Gertrude: Experimenting with ML
• Multivariate Testing• Define and explore a
space of parameters• Overlapping
Experiments• Tang et al. (2010)• Runs multiple
independent experiments on every request
![Page 38: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/38.jpg)
38
Simple Conditional Logic
• Declare experiment flags in compiled code• Settings that can vary
per request• Create a config file that
contains simple rules for calculating flag values and rules for experiment diversion
![Page 39: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/39.jpg)
39
Separate Data Push from Code Push
• Validate config files and push updates to servers• Zookeeper via Curator• File-based
• Servers pick up new configs, load them, and update experiment space and flag value calculations
![Page 40: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/40.jpg)
40
Computational Hypothesis Testing
![Page 41: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/41.jpg)
41
The Experiments Dashboard
![Page 42: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/42.jpg)
42
A Few Links I Love
• http://research.google.com/pubs/pub36500.html• The original paper on the overlapping experiments
infrastrucure at Google• http://www.exp-platform.com/
• Collection of all of Microsoft’s papers and presentations on their experimentation platform
• http://www.deaneckles.com/blog/596_lossy-better-than-lossless-in-online-bootstrapping/• Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
![Page 43: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/43.jpg)
43
One More Thing
![Page 44: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/44.jpg)
44
A Day In The Life of a Data Scientist
![Page 45: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/45.jpg)
45
On Functional Programming
![Page 46: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/46.jpg)
46
On Lineage
![Page 47: Production machine learning_infrastructure](https://reader038.vdocument.in/reader038/viewer/2022110303/54c6321b4a7959a0338b458d/html5/thumbnails/47.jpg)
Josh Wills, Director of Data Science, Cloudera@josh_wills
Thank you!