building better credit models through deployable analytics in r robert krzyzanowski director of data...

57
BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

Upload: jeffrey-robbins

Post on 22-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

BUILDING BET TER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R

Robert KrzyzanowskiDirector of Data Engineering at Avant

Page 2: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

BUILDING BET TER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R

Robert KrzyzanowskiDirector of Data Engineering at Avant

Page 3: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

3

• Gap between complicated non-deployable models and need for production ready solutions.

Netflix never implemented the algorithm that won the Netflix $1 million challenge.

O u r I n i ti a l P r o b l e m

Page 4: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

4

• Gap between complicated non-deployable models and need for production ready solutions.

Netflix never implemented the algorithm that won the Netflix $1 million challenge.

• Frustration in developing models in one language (R, python, etc.) and productionizing them in another (C++, Java etc.).

O u r I n i ti a l P r o b l e m

Page 5: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

5

• Gap between complicated non-deployable models and need for production ready solutions.

Netflix never implemented the algorithm that won the Netflix $1 million challenge.

• Frustration in developing models in one language (R, python, etc.) and productionizing them in another (C++, Java etc.).

• Result - Advanced and complicated models are rarely used in production.

O u r I n i ti a l P r o b l e m

Page 6: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

6

Industry Standard

Software Engineer Get real time data

Validate deployed models

Implement sources

Statistician Refit models

Write new analytical methods

Model card creation

Separate processes in different groups, leading to translation errors

Data Scientist

Performs both Software Engineer and Statistician functions

No separate process to re-code model for production

Days vs. Months

Agile Development

Development = Production

D e v e l o p m e n t = D e p l o y m e n t

Page 7: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

7

"Data scientists spend 50 to 80 percent of their time collecting and preparing digital data."

- New York Times 08/18/2014

T h e P r o b l e m o f D a t a P r e p a r a ti o n

Problem:

Page 8: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

8

"Data scientists spend 50 to 80 percent of their time collecting and preparing digital data."

- New York Times 08/18/2014

T h e P r o b l e m o f D a t a P r e p a r a ti o n

Problem:

"Good feature engineering is oftenmore important for classifier performance

than model selection." - Google Research Paper (2015)

Page 9: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

9

T h e P r o b l e m o f D a t a P r e p a r a ti o n

Solution:

Re-define machine learning.

Page 10: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

10

W h y D a t a P r e p a r a ti o n ?

Definition: A machine learning model is a trained statistical predictor applied to a cleaned up data set.

Page 11: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

11

W h y D a t a P r e p a r a ti o n ?

Definition: A machine learning model is a trained statistical predictor applied to a cleaned up data set.

Wrong.

Page 12: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

12

W h y D a t a P r e p a r a ti o n ?

Definition: A machine learning model is1. A trained data preparation

applied to raw production data.

2. A trained statistical predictor applied to the results of the trained data preparation.

Page 13: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

13

P r o o f o f D e fi n i ti o n

Two Models:

Direct Mail ResponseIn Eastern States:

Variable state levels: New York Massachusetts New Jersey

Direct Mail ResponseIn Western States:

Variable state levels: California Oregon Washington

Need two data pipelines to restore categorical levels

Page 14: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

14

P r o o f o f D e fi n i ti o n

Two Models:

Model A – Seed 100

Impute variable “inquiries” with mean of 0.2

Model A – Seed 101

Impute variable “inquiries” with mean of 0.3

Need two data pipelinesto replace NA with mean in production

Page 15: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

15

H o w t o A p p l y T h e D e fi n i ti o n

1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.

Page 16: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

16

H o w t o A p p l y T h e D e fi n i ti o n

1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.

2. Define a mapping from the development framework to a production system: then production-ready machine learning is free and identical to the experimental environment.

Page 17: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

17

H o w t o A p p l y T h e D e fi n i ti o n

1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.

2. Define a mapping from the development framework to a production system: then production-ready machine learning is free and identical to the experimental environment.

3. Data scientists now write both data preparation on raw production data and apply statistical classifier.

Page 18: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

18

H o w t o A p p l y T h e D e fi n i ti o n

To master data scienceyou must masterdata preparation.

Page 19: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

19

H o w t o A p p l y T h e D e fi n i ti o n

To master data scienceyou must masterdata preparation.

- Confucius

Page 20: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

20

A H a r d e r E x a m p l e

Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.

Page 21: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

21

A H a r d e r E x a m p l e

Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.

list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric))

Page 22: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

22

A H a r d e r E x a m p l e

Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.

list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric))

New real-time customers must be scored in < 1 second on EC2.Cannot discretize a 1-row data.frame (new customer).

Must be careful to use different function to achieve same behavior.

Page 23: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

23

A H a r d e r E x a m p l e

discretize <- function(variable) { variable <- arules::discretize(variable) # If levels are [0, 3), [3, 6), [6, 10), # cutoffs will be 3, 6, Inf cutoffs <- c(as.numeric( gsub(“^[^0-9]*([0-9]+).*$”, “\\1”, levels(variable)[-1])), Inf) # list(“[0, 3)” = 3, “[3, 6)” = 6, “[6, 10)” = Inf) input$cutoffs <- setNames(cutoffs, levels(variable)) variable}

restore_levels <- function(variable) { factor(vapply(variable, function(val) { names(input$cutoffs)[which.max(val < input$cutoffs)] }, character(1)), levels = names(input$cutoffs))}

(can replace with Rcpp version to make it faster)

Page 24: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

24

A H a r d e r E x a m p l e

input <- new.env()

var <- discretize(1:10)# input$cutoffs is:# list(“[0, 3)” = 3, “[3, 6)” = 6, “[6, 10)” = Inf)[1] [0, 3) [0, 3) [3, 6) [3, 6) [3, 6) [6] [6, 10) [6, 10) [6, 10) [6, 10) [6, 10)Levels: [0,3) [3, 6) [6, 10)

discretize(5)Error: 'breaks' are not unique

restore_levels(5)[1] [3, 6)Levels: [0, 3) [3, 6) [6, 10)

restore_levels(0:11) [1] [0, 3) [0, 3) [0, 3) [3, 6) [3, 6) [3, 6) [7] [6, 10) [6, 10) [6, 10) [6, 10) [6, 10) [6, 10)Levels: [0, 3) [3, 6) [6, 10)

Page 25: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

25

A H a r d e r E x a m p l e

Key Point:

Identical mathematical operationsrequire different logic in

train versus predict.

You must train your data preparationjust like you train your model.

Page 26: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

26

A H a r d e r E x a m p l e

list( import = indicator ~ data_source1 + data_source2, data = list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric) ), model = list(“glmnet”, link = “binomial”, alpha = 0.5),

export = list(s3 = “location/of/model”))

Final toy model

Page 27: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

27

T h e R e s u l t

Our most complex models went from 1,000s of lines of code to 100.

Zero code deployment:

development = production

Modularity: Can re-use credit model data preparation and methods for lead conversion model, or direct mail for collections model

Page 28: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

28

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion)

Page 29: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

29

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion)

0/1 or continuous-valued dependent variable

Page 30: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

30

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion

0/1 or continuous-valued dependent variable

612 variables

Page 31: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

31

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313

0/1 or continuous-valued dependent variable

612 variables 312 variables

Page 32: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

32

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

Page 33: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

33

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

Page 34: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

34

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

+ bank

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

Page 35: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

35

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

+ bank + application

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

Page 36: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

36

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

+ bank + application + iovation

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

Page 37: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

37

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend

+ bank + application + iovation + clarity

0/1 or continuous-valued dependent variable

612 variables 312 variables 217 variables

Page 38: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

38

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity

R formula object

Page 39: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

39

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity

R formula object

4 Avant R packages using database connections + caching layersinstantly translate this R formula to a live data source yielding adata.frame with

100,000s – 1,000,000s of rows 1,000s of cols

Adding new data sources = Easy

Page 40: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

40

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list( make_validation_set, seed = seed, trainpct = trainpct )

Page 41: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

41

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list( make_validation_set, seed = …

1 Line of Code per data preparation step encouragesDRY (Don’t Repeat Yourself) code and easier testing

Page 42: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

42

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL,…

Parsing engine translates this grammar to mean“do this in training but not in live production”.

Data scientist forced to ensure data preparationworks in production while developing model.

Avoid later headaches and angry girlfriends whenmodel breaks at midnight.

Page 43: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

43

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,…

Train versus predict duality is present on final model object.

Model will predict on raw (not clean) production data, either1-row or 1,000,000-row data.frames, in interactive R modeor on deployed Amazon EC2 instance, without extra code.

Page 44: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

44

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05…

Data scientist can re-run data steps and examine or visualize data to interactively build and debug data preparation. No need to start fromscratch or keep duplicate copies of data. R memory problem = Solved

> run("data/Sweep") Special notation defined by framework

Page 45: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .

Page 46: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = fraud(num_installments = 2, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .

Answering new business questions is easy: change indicator query.

Page 47: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) #, "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .

Comment out or hyper-parameterize data preparation andre-run full model to test the effect of feature engineering on final model performance.

Page 48: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

Any of 6,500+ CRAN, BioConductor, or in-house stats packages can be incorporated with a light-weight wrapper.

Page 49: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "mars", nk = knots, degree = deg ))

Any of 6,500+ CRAN, BioConductor, or in-house stats packages can be incorporated with a light-weight wrapper.

Page 50: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "gbm", trees = 20000, shrinkage = 0.001 ))

Can use parallel, snow, etc. to parallelize models that require many cores/nodes.

Page 51: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

Parametrization of full modeling process is configurable.

Different grammar exists for ensemble models,sequential models, etc.

Page 52: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "gbm", trees = 20000, shrinkage = 0.001 ))

Train models in parallelto solve differentbusiness questions

Page 53: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "gbm", trees = 20000, shrinkage = 0.001 ))

list( import = collection(delinquency_window = 60) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "mars", nk = knots, degree = deg ))

list( import = conversion(response_window = 3) ~ transunion data = list(…),

model = list( "ensemble", lapply(seq(0, 1, by = 0.1), function(alpha) { list( "glmnet", link = "bernoulli", alpha = alpha ) }) ))

Page 54: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))

list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "gbm", trees = 20000, shrinkage = 0.001 ))

list( import = collection(delinquency_window = 60) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,

data = list(…),

model = list( "mars", nk = knots, degree = deg ))

list( import = conversion(response_window = 3) ~ transunion data = list(…),

model = list( "ensemble", lapply(seq(0, 1, by = 0.1), function(alpha) { list( "glmnet", link = "bernoulli", alpha = alpha ) }) ))

Framework of 25 R packages.

Deploy models same-day on raw production data with no extra code.

Provides smooth transition from R to “Big Data”.

Page 55: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

R a p i d E x p e r i m e n t a ti o n

Page 56: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

56

Page 57: BUILDING BETTER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R Robert Krzyzanowski Director of Data Engineering at Avant

57

Contact: [email protected]

peterhurford/batchman kirillseva/ruigi michaelochurch/fixedwidth-hs robertzk/cachemeifyoucan