cloudera user group - from the lab to the factory

19
1 From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera

Upload: clouderausergroups

Post on 07-Jul-2015

473 views

Category:

Technology


0 download

DESCRIPTION

This is the presentation that Cloudera's senior director of data science, Josh Wills, delivered at the Cloudera User Group (CUG) Chicago meeting on 12/3/13 and NYC meeting on 12/5/13.

TRANSCRIPT

Page 1: Cloudera User Group - From the Lab to the Factory

1

From The Lab to the Factory

Building A Production Machine Learning Infrastructure

Josh Wills, Senior Director of Data Science

Cloudera

Page 2: Cloudera User Group - From the Lab to the Factory

One Other Thing About Me

2

Page 3: Cloudera User Group - From the Lab to the Factory

Data Science: Another Definition

3

Page 4: Cloudera User Group - From the Lab to the Factory

Data Scientists Build Data Products.

4

Page 5: Cloudera User Group - From the Lab to the Factory

A Shift In Perspective

Analytics in the Lab

• Question-driven

• Interactive

• Ad-hoc, post-hoc

• Fixed data

• Focus on speed and

flexibility

• Output is embedded into a

report or in-database

scoring engine

Analytics in the Factory

• Metric-driven

• Automated

• Systematic

• Fluid data

• Focus on transparency and reliability

• Output is a production system that makes customer-facing decisions

5

Page 6: Cloudera User Group - From the Lab to the Factory

All* Products Become Data Products

6

Page 7: Cloudera User Group - From the Lab to the Factory

Identifying the Bottlenecks

7

Page 8: Cloudera User Group - From the Lab to the Factory

Oryx: Model Building and Serving

• Algorithms

• ALS Recommenders

• K-Means Parallel

• RDF

• Batch model building

via MapReduce*

• Server for real-time

scoring and updates

• PMML 4.1 Models

8

Page 9: Cloudera User Group - From the Lab to the Factory

Oryx Design

9

Page 10: Cloudera User Group - From the Lab to the Factory

Generational Thinking

10

Page 11: Cloudera User Group - From the Lab to the Factory

The Limits of Our Models

11

Page 12: Cloudera User Group - From the Lab to the Factory

Space Exploration

12

Page 13: Cloudera User Group - From the Lab to the Factory

Data Science Needs DevOps

13

Page 14: Cloudera User Group - From the Lab to the Factory

Introducing Gertrude

• Multivariate Testing

• Define and explore a

space of parameters

• Overlapping

Experiments

• Tang et al. (2010)

• Runs multiple

independent

experiments on every

request

14

Page 15: Cloudera User Group - From the Lab to the Factory

Simple Conditional Logic

• Declare experiment

flags in compiled code

• Settings that can vary per request

• Create a config file that contains simple rules for calculating flag values and rules for experiment diversion

15

Page 16: Cloudera User Group - From the Lab to the Factory

Separate Data Push from Code Push

• Validate config files and

push updates to servers

• Zookeeper via Curator

• File-based

• Servers pick up new

configs, load them, and

update experiment

space and flag value

calculations

16

Page 17: Cloudera User Group - From the Lab to the Factory

The Experiments Dashboard

17

Page 18: Cloudera User Group - From the Lab to the Factory

A Few Links I Love

• http://research.google.com/pubs/pub36500.html

• The original paper on the overlapping experiments

infrastrucure at Google

• http://www.exp-platform.com/

• Collection of all of Microsoft’s papers and presentations on

their experimentation platform

• http://www.deaneckles.com/blog/596_lossy-better-

than-lossless-in-online-bootstrapping/

• Dean Eckles on his paper about bootstrapped confidence

intervals with multiple dependencies

18

Page 19: Cloudera User Group - From the Lab to the Factory

Josh Wills, Director of Data Science, Cloudera @josh_wills

Thank you!