Download - Production machine learning_infrastructure
1
From The Lab to the FactoryBuilding A Production Machine Learning InfrastructureJosh Wills, Senior Director of Data ScienceCloudera
2
What is a Data Scientist?
3
One Definition…
…versus Another
4
The Two Kinds of Data Scientists
• The Lab• Statisticians who got
really good at programming
• Neuroscientists, geneticists, etc.
• The Factory• Software engineers who
were in the wrong place at the wrong time
5
6
Data Science In The Lab
7
Data Science as Statistics
8
Investigative Analytics
9
Tools for Investigative Analytics
10
Inputs and Outputs
11
On Actionable Insights
12
Data Science In The Factory
13
Building Data Products
14
A Shift In Perspective
Analytics in the Lab
• Question-driven• Interactive• Ad-hoc, post-hoc• Fixed data• Focus on speed and
flexibility• Output is embedded into a
report or in-database scoring engine
Analytics in the Factory
• Metric-driven• Automated• Systematic• Fluid data• Focus on transparency and
reliability• Output is a production
system that makes customer-facing decisions
15
Data Science as Decision Engineering
16
All* Products Become Data Products
17
Sounds Great. So Who Is Doing This?
18
From The Lab To The Factory
19
The Art of Machine Learning
20
A New Kind of Statistics
21
DevOps for Data Science
22
The Model: Information Retrieval
23
From the Lab to the Factory:First Steps
24
Step 1: Choose a Good Problem
25
Step 2: DTSTCPWTM
26
Step 3: Log Everything
27
Step 4: Hire (More) Data Scientists
28
Things We’re Working On
29
The Data Science Workflow
30
Identifying the Bottlenecks
31
Myrrix
Oryx: Simple and Scalable ML
32
33
Generational Thinking
34
Working on the Gaps
35
Space Exploration
36
The Limits of Our Models
37
Gertrude: Experimenting with ML
• Multivariate Testing• Define and explore a
space of parameters• Overlapping
Experiments• Tang et al. (2010)• Runs multiple
independent experiments on every request
38
Simple Conditional Logic
• Declare experiment flags in compiled code• Settings that can vary
per request• Create a config file that
contains simple rules for calculating flag values and rules for experiment diversion
39
Separate Data Push from Code Push
• Validate config files and push updates to servers• Zookeeper via Curator• File-based
• Servers pick up new configs, load them, and update experiment space and flag value calculations
40
Computational Hypothesis Testing
41
The Experiments Dashboard
42
A Few Links I Love
• http://research.google.com/pubs/pub36500.html• The original paper on the overlapping experiments
infrastrucure at Google• http://www.exp-platform.com/
• Collection of all of Microsoft’s papers and presentations on their experimentation platform
• http://www.deaneckles.com/blog/596_lossy-better-than-lossless-in-online-bootstrapping/• Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
43
One More Thing
44
A Day In The Life of a Data Scientist
45
On Functional Programming
46
On Lineage
Josh Wills, Director of Data Science, Cloudera@josh_wills
Thank you!