automating and productionizing machine learning pipelines ... · machine learning pipelines for...

44
Automating and Productionizing Machine Learning Pipelines for Real- Time Scoring with Apache Spark David Crespi, Data Scientist Jared Piedt, Software Engineer

Upload: others

Post on 22-May-2020

33 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

Automating and Productionizing

Machine Learning Pipelines for Real-

Time Scoring with Apache Spark

D a v i d C r e s p i , D a t a S c i e n t i s t

J a r e d P i e d t , S o f t w a r e E n g i n e e r

Page 2: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

2

Overviewof Red Ventures.

H I S T O R Y

B Y T H E N U M B E R S

3,500+ Employees

Locations

• USA - 13 Locations

• Brazil - Sao Paulo

• United Kingdom - London

1 Culture

Founded as Red F in 2000

Red Ventures launched in 2004

General Atlantic & SilverLake minority

strategic investors.

Page 3: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

3

Our Use Case – Real-Time Predictions

Requ i rements

1

2

Speed

Consistency

Page 4: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

4

Data Science Process

1

2

3

Data Collection

Machine Learning Pipelines

Model Deployment

Page 5: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

5

MLl ib

Spark SQL

&

DataFrames

Page 6: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

6

Data Science Process

1

2

3

Data Collection

Machine Learning Pipelines

Model Deployment

Page 7: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

7

Old Data Architecture

Page 8: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

8

Old Data Architecture

D W?

Page 9: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

9

Old Data Architecture

D W

Complex ETL

Page 10: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

10

Old Data Architecture

D W

Training

Data

Complex ETL

Page 11: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

11

Old Data Architecture

D W

A p p

Training

Data

Scoring

Data

Complex ETL

Page 12: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

12

Pain Points

• Duplication of business logic

• Data drift

Page 13: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

13

Goals

1

2

3

Immutable data

Write business logic once

Make data available in real-time

Page 14: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

14

Event-Driven Architecture

D a t a P i p e l i n e

W e b

C h a t

S e r v e r

I V R

Page 15: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

15

New Data Architecture

D a t a P i p e l i n e

Amazon

S3

K e y -

V a l u e

S t o r e

B u s i n e s s

L o g i c

Training

Data

Scoring

Data

Page 16: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

16

Projections

{

i d : 4

}{

i d : 3

}{

i d : 2

}{

i d : 1

}

{

i d :

}

Page 17: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

17

Credit Card Recommendation

User Id Keyword Page View

Count

Card Shown Clicked

a best travel

cards

2 Travel 1

b credit cards 3 Cash Back 0

c top credit cards 1 Cash Back 1

d credit cards 1 Travel 0

Page 18: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

18

E 1 E 2 E 3

time

r e d u c e

r

Page 19: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

19

r e d u c e

r

E 1 E 2 E 3

time

Page 20: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

20

r e d u c e

r

E 1 E 2 E 3

time

Page 21: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

21

r e d u c e

r

E 1 E 2 E 3

time

Page 22: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

22

Credit Card Recommendation

User Id Keyword Page View

Count

Card Shown Clicked

a best travel

cards

2 Travel 1

b credit cards 3 Cash Back 0

c top credit cards 1 Cash Back 1

d credit cards 1 Travel 0

z airline miles

card

1 Travel 1

Page 23: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

23

Data Science Process

1

2

3

Data Collection

Machine Learning Pipelines

Model Deployment

Page 24: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

24

ML Pipeline

Transformer Estimator

Page 25: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

25

Spark: Estimators and Transformers

Transformer

Estimator Transformer

Page 26: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

26

Spark: Estimators and Transformers

PipelineStage

Transformer

Estimator Transformer

Page 27: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

27

Transformers

Comment Resolved

I need help with internet 1

Setting up my TV 0

My internet won’t work 1

Internet is slow 0

Can’t connect 1

Netflix not working 0

Page 28: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

28

CommentResolve

d

Internet

Comment

I need help with

internet

1 1

Setting up my TV 0 0

My internet won’t

work

1 1

Internet is slow 0 1

Can’t connect 1 0

Netflix not working 0 0

Transformers

CommentResolve

d

I need help with

internet

1

Setting up my TV 0

My internet won’t

work

1

Internet is slow 0

Can’t connect 1

Netflix not working 0

Page 29: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

29

Estimators

Home Square Footage Sold

1,200 1

2,100 0

3,000 1

NULL 0

1,350 1

1,725 0

Page 30: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

30

Estimators

Home Square Footage Sold

1,200 1

2,100 0

3,000 1

NULL 0

1,350 1

1,725 0

Page 31: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

31

Estimators

Home Square

FootageSold

1,200 1

2,100 0

3,000 1

NULL 0

1,350 1

1,725 0

Imputer

Fill value = 1,875

Page 32: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

32

Home

Square

Footage

Sold

Home Square

Footage

Imputed

1,200 1 1,200

2,100 0 2,100

3,000 1 3,000

NULL 0 1,875

1,350 1 1,350

1,725 0 1,725

Estimators

Home

Square

Footage

Sold

1,200 1

2,100 0

3,000 1

NULL 0

1,350 1

1,725 0

Page 33: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

33

Spark: Estimators and Transformers

PipelineModel

Transformer Transformer Transformer

Pipeline

Transformer Transformer Estimator

PipelineModelTransformer Transformer Transformer

Page 34: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

34

How do ML algorithms fit in?

Page 35: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

35

Spark: Estimators and Transformers

PipelineModel

Transformer Transformer Transformer

Pipeline

Transformer Transformer Estimator

PipelineModelTransformer Transformer Transformer

Page 36: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

36

Generalizing Data Science

Response

All Features

Response

Raw Text Features

Categorical Features

Numeric Features

Training Data Training Data

Page 37: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

37

We fit our pipeline… now what?

Page 38: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

38

Data Science Process

1

2

3

Data Collection

Machine Learning Pipelines

Model Deployment

Page 39: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

39

Real-time scoring paradigm

?

Predic t ion

API

Product ion

Appl icat ions

Page 40: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

40

Model evaluation in real-time – with Spark

Page 41: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

41

Model evaluation in real-time – with MLeap

Page 42: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

42

Data collection

ML pipeline trainingModel deployment

Page 43: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

43

Recap

1

2

3

Data Collection

Machine Learning Pipelines

Model Deployment

Page 44: Automating and Productionizing Machine Learning Pipelines ... · Machine Learning Pipelines for Real-Time Scoring with Apache Spark D a v i d C r e s p i , D a t a S c i e n t i s

44

Questions