production machine learning_infrastructure

47
1 From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera

Upload: joshwills

Post on 26-Jan-2015

105 views

Category:

Technology


3 download

DESCRIPTION

Slides from Josh Wills' talk on building machine learning infrastructure at Data Day Texas 2014.

TRANSCRIPT

Page 1: Production machine learning_infrastructure

1

From The Lab to the FactoryBuilding A Production Machine Learning InfrastructureJosh Wills, Senior Director of Data ScienceCloudera

Page 2: Production machine learning_infrastructure

2

What is a Data Scientist?

Page 3: Production machine learning_infrastructure

3

One Definition…

Page 4: Production machine learning_infrastructure

…versus Another

4

Page 5: Production machine learning_infrastructure

The Two Kinds of Data Scientists

• The Lab• Statisticians who got

really good at programming

• Neuroscientists, geneticists, etc.

• The Factory• Software engineers who

were in the wrong place at the wrong time

5

Page 6: Production machine learning_infrastructure

6

Data Science In The Lab

Page 7: Production machine learning_infrastructure

7

Data Science as Statistics

Page 8: Production machine learning_infrastructure

8

Investigative Analytics

Page 9: Production machine learning_infrastructure

9

Tools for Investigative Analytics

Page 10: Production machine learning_infrastructure

10

Inputs and Outputs

Page 11: Production machine learning_infrastructure

11

On Actionable Insights

Page 12: Production machine learning_infrastructure

12

Data Science In The Factory

Page 13: Production machine learning_infrastructure

13

Building Data Products

Page 14: Production machine learning_infrastructure

14

A Shift In Perspective

Analytics in the Lab

• Question-driven• Interactive• Ad-hoc, post-hoc• Fixed data• Focus on speed and

flexibility• Output is embedded into a

report or in-database scoring engine

Analytics in the Factory

• Metric-driven• Automated• Systematic• Fluid data• Focus on transparency and

reliability• Output is a production

system that makes customer-facing decisions

Page 15: Production machine learning_infrastructure

15

Data Science as Decision Engineering

Page 16: Production machine learning_infrastructure

16

All* Products Become Data Products

Page 17: Production machine learning_infrastructure

17

Sounds Great. So Who Is Doing This?

Page 18: Production machine learning_infrastructure

18

From The Lab To The Factory

Page 19: Production machine learning_infrastructure

19

The Art of Machine Learning

Page 20: Production machine learning_infrastructure

20

A New Kind of Statistics

Page 21: Production machine learning_infrastructure

21

DevOps for Data Science

Page 22: Production machine learning_infrastructure

22

The Model: Information Retrieval

Page 23: Production machine learning_infrastructure

23

From the Lab to the Factory:First Steps

Page 24: Production machine learning_infrastructure

24

Step 1: Choose a Good Problem

Page 25: Production machine learning_infrastructure

25

Step 2: DTSTCPWTM

Page 26: Production machine learning_infrastructure

26

Step 3: Log Everything

Page 27: Production machine learning_infrastructure

27

Step 4: Hire (More) Data Scientists

Page 28: Production machine learning_infrastructure

28

Things We’re Working On

Page 29: Production machine learning_infrastructure

29

The Data Science Workflow

Page 30: Production machine learning_infrastructure

30

Identifying the Bottlenecks

Page 31: Production machine learning_infrastructure

31

Myrrix

Page 32: Production machine learning_infrastructure

Oryx: Simple and Scalable ML

32

Page 33: Production machine learning_infrastructure

33

Generational Thinking

Page 34: Production machine learning_infrastructure

34

Working on the Gaps

Page 35: Production machine learning_infrastructure

35

Space Exploration

Page 36: Production machine learning_infrastructure

36

The Limits of Our Models

Page 37: Production machine learning_infrastructure

37

Gertrude: Experimenting with ML

• Multivariate Testing• Define and explore a

space of parameters• Overlapping

Experiments• Tang et al. (2010)• Runs multiple

independent experiments on every request

Page 38: Production machine learning_infrastructure

38

Simple Conditional Logic

• Declare experiment flags in compiled code• Settings that can vary

per request• Create a config file that

contains simple rules for calculating flag values and rules for experiment diversion

Page 39: Production machine learning_infrastructure

39

Separate Data Push from Code Push

• Validate config files and push updates to servers• Zookeeper via Curator• File-based

• Servers pick up new configs, load them, and update experiment space and flag value calculations

Page 40: Production machine learning_infrastructure

40

Computational Hypothesis Testing

Page 41: Production machine learning_infrastructure

41

The Experiments Dashboard

Page 42: Production machine learning_infrastructure

42

A Few Links I Love

• http://research.google.com/pubs/pub36500.html• The original paper on the overlapping experiments

infrastrucure at Google• http://www.exp-platform.com/

• Collection of all of Microsoft’s papers and presentations on their experimentation platform

• http://www.deaneckles.com/blog/596_lossy-better-than-lossless-in-online-bootstrapping/• Dean Eckles on his paper about bootstrapped confidence

intervals with multiple dependencies

Page 43: Production machine learning_infrastructure

43

One More Thing

Page 44: Production machine learning_infrastructure

44

A Day In The Life of a Data Scientist

Page 45: Production machine learning_infrastructure

45

On Functional Programming

Page 46: Production machine learning_infrastructure

46

On Lineage

Page 47: Production machine learning_infrastructure

Josh Wills, Director of Data Science, Cloudera@josh_wills

Thank you!