cloudera user group - from the lab to the factory
DESCRIPTION
This is the presentation that Cloudera's senior director of data science, Josh Wills, delivered at the Cloudera User Group (CUG) Chicago meeting on 12/3/13 and NYC meeting on 12/5/13.TRANSCRIPT
1
From The Lab to the Factory
Building A Production Machine Learning Infrastructure
Josh Wills, Senior Director of Data Science
Cloudera
One Other Thing About Me
2
Data Science: Another Definition
3
Data Scientists Build Data Products.
4
A Shift In Perspective
Analytics in the Lab
• Question-driven
• Interactive
• Ad-hoc, post-hoc
• Fixed data
• Focus on speed and
flexibility
• Output is embedded into a
report or in-database
scoring engine
Analytics in the Factory
• Metric-driven
• Automated
• Systematic
• Fluid data
• Focus on transparency and reliability
• Output is a production system that makes customer-facing decisions
5
All* Products Become Data Products
6
Identifying the Bottlenecks
7
Oryx: Model Building and Serving
• Algorithms
• ALS Recommenders
• K-Means Parallel
• RDF
• Batch model building
via MapReduce*
• Server for real-time
scoring and updates
• PMML 4.1 Models
8
Oryx Design
9
Generational Thinking
10
The Limits of Our Models
11
Space Exploration
12
Data Science Needs DevOps
13
Introducing Gertrude
• Multivariate Testing
• Define and explore a
space of parameters
• Overlapping
Experiments
• Tang et al. (2010)
• Runs multiple
independent
experiments on every
request
14
Simple Conditional Logic
• Declare experiment
flags in compiled code
• Settings that can vary per request
• Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
15
Separate Data Push from Code Push
• Validate config files and
push updates to servers
• Zookeeper via Curator
• File-based
• Servers pick up new
configs, load them, and
update experiment
space and flag value
calculations
16
The Experiments Dashboard
17
A Few Links I Love
• http://research.google.com/pubs/pub36500.html
• The original paper on the overlapping experiments
infrastrucure at Google
• http://www.exp-platform.com/
• Collection of all of Microsoft’s papers and presentations on
their experimentation platform
• http://www.deaneckles.com/blog/596_lossy-better-
than-lossless-in-online-bootstrapping/
• Dean Eckles on his paper about bootstrapped confidence
intervals with multiple dependencies
18
Josh Wills, Director of Data Science, Cloudera @josh_wills
Thank you!