building better credit models through deployable analytics in r robert krzyzanowski director of data...
TRANSCRIPT
BUILDING BET TER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R
Robert KrzyzanowskiDirector of Data Engineering at Avant
BUILDING BET TER CREDIT MODELS THROUGH DEPLOYABLE ANALYTICS IN R
Robert KrzyzanowskiDirector of Data Engineering at Avant
3
• Gap between complicated non-deployable models and need for production ready solutions.
Netflix never implemented the algorithm that won the Netflix $1 million challenge.
O u r I n i ti a l P r o b l e m
4
• Gap between complicated non-deployable models and need for production ready solutions.
Netflix never implemented the algorithm that won the Netflix $1 million challenge.
• Frustration in developing models in one language (R, python, etc.) and productionizing them in another (C++, Java etc.).
O u r I n i ti a l P r o b l e m
5
• Gap between complicated non-deployable models and need for production ready solutions.
Netflix never implemented the algorithm that won the Netflix $1 million challenge.
• Frustration in developing models in one language (R, python, etc.) and productionizing them in another (C++, Java etc.).
• Result - Advanced and complicated models are rarely used in production.
O u r I n i ti a l P r o b l e m
6
Industry Standard
Software Engineer Get real time data
Validate deployed models
Implement sources
Statistician Refit models
Write new analytical methods
Model card creation
Separate processes in different groups, leading to translation errors
Data Scientist
Performs both Software Engineer and Statistician functions
No separate process to re-code model for production
Days vs. Months
Agile Development
Development = Production
D e v e l o p m e n t = D e p l o y m e n t
7
"Data scientists spend 50 to 80 percent of their time collecting and preparing digital data."
- New York Times 08/18/2014
T h e P r o b l e m o f D a t a P r e p a r a ti o n
Problem:
8
"Data scientists spend 50 to 80 percent of their time collecting and preparing digital data."
- New York Times 08/18/2014
T h e P r o b l e m o f D a t a P r e p a r a ti o n
Problem:
"Good feature engineering is oftenmore important for classifier performance
than model selection." - Google Research Paper (2015)
9
T h e P r o b l e m o f D a t a P r e p a r a ti o n
Solution:
Re-define machine learning.
10
W h y D a t a P r e p a r a ti o n ?
Definition: A machine learning model is a trained statistical predictor applied to a cleaned up data set.
11
W h y D a t a P r e p a r a ti o n ?
Definition: A machine learning model is a trained statistical predictor applied to a cleaned up data set.
Wrong.
12
W h y D a t a P r e p a r a ti o n ?
Definition: A machine learning model is1. A trained data preparation
applied to raw production data.
2. A trained statistical predictor applied to the results of the trained data preparation.
13
P r o o f o f D e fi n i ti o n
Two Models:
Direct Mail ResponseIn Eastern States:
Variable state levels: New York Massachusetts New Jersey
Direct Mail ResponseIn Western States:
Variable state levels: California Oregon Washington
Need two data pipelines to restore categorical levels
14
P r o o f o f D e fi n i ti o n
Two Models:
Model A – Seed 100
Impute variable “inquiries” with mean of 0.2
Model A – Seed 101
Impute variable “inquiries” with mean of 0.3
Need two data pipelinesto replace NA with mean in production
15
H o w t o A p p l y T h e D e fi n i ti o n
1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.
16
H o w t o A p p l y T h e D e fi n i ti o n
1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.
2. Define a mapping from the development framework to a production system: then production-ready machine learning is free and identical to the experimental environment.
17
H o w t o A p p l y T h e D e fi n i ti o n
1. Define a framework and grammar so data scientists can clean data without ever having to repeat the process when “productionizing”.
2. Define a mapping from the development framework to a production system: then production-ready machine learning is free and identical to the experimental environment.
3. Data scientists now write both data preparation on raw production data and apply statistical classifier.
18
H o w t o A p p l y T h e D e fi n i ti o n
To master data scienceyou must masterdata preparation.
19
H o w t o A p p l y T h e D e fi n i ti o n
To master data scienceyou must masterdata preparation.
- Confucius
20
A H a r d e r E x a m p l e
Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.
21
A H a r d e r E x a m p l e
Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.
list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric))
22
A H a r d e r E x a m p l e
Goal: Impute some variables and discretize somevariables, so the resulting data preparation isproduction-ready.
list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric))
New real-time customers must be scored in < 1 second on EC2.Cannot discretize a 1-row data.frame (new customer).
Must be careful to use different function to achieve same behavior.
23
A H a r d e r E x a m p l e
discretize <- function(variable) { variable <- arules::discretize(variable) # If levels are [0, 3), [3, 6), [6, 10), # cutoffs will be 3, 6, Inf cutoffs <- c(as.numeric( gsub(“^[^0-9]*([0-9]+).*$”, “\\1”, levels(variable)[-1])), Inf) # list(“[0, 3)” = 3, “[3, 6)” = 6, “[6, 10)” = Inf) input$cutoffs <- setNames(cutoffs, levels(variable)) variable}
restore_levels <- function(variable) { factor(vapply(variable, function(val) { names(input$cutoffs)[which.max(val < input$cutoffs)] }, character(1)), levels = names(input$cutoffs))}
(can replace with Rcpp version to make it faster)
24
A H a r d e r E x a m p l e
input <- new.env()
var <- discretize(1:10)# input$cutoffs is:# list(“[0, 3)” = 3, “[3, 6)” = 6, “[6, 10)” = Inf)[1] [0, 3) [0, 3) [3, 6) [3, 6) [3, 6) [6] [6, 10) [6, 10) [6, 10) [6, 10) [6, 10)Levels: [0,3) [3, 6) [6, 10)
discretize(5)Error: 'breaks' are not unique
restore_levels(5)[1] [3, 6)Levels: [0, 3) [3, 6) [6, 10)
restore_levels(0:11) [1] [0, 3) [0, 3) [0, 3) [3, 6) [3, 6) [3, 6) [7] [6, 10) [6, 10) [6, 10) [6, 10) [6, 10) [6, 10)Levels: [0, 3) [3, 6) [6, 10)
25
A H a r d e r E x a m p l e
Key Point:
Identical mathematical operationsrequire different logic in
train versus predict.
You must train your data preparationjust like you train your model.
26
A H a r d e r E x a m p l e
list( import = indicator ~ data_source1 + data_source2, data = list( “Impute variables” = list(impute, c(“var1”, “var2”)), “Discretize variables” = list(discretize ~ restore_levels, is.numeric) ), model = list(“glmnet”, link = “binomial”, alpha = 0.5),
export = list(s3 = “location/of/model”))
Final toy model
27
T h e R e s u l t
Our most complex models went from 1,000s of lines of code to 100.
Zero code deployment:
development = production
Modularity: Can re-use credit model data preparation and methods for lead conversion model, or direct mail for collections model
28
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion)
29
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion)
0/1 or continuous-valued dependent variable
30
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion
0/1 or continuous-valued dependent variable
612 variables
31
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313
0/1 or continuous-valued dependent variable
612 variables 312 variables
32
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline
0/1 or continuous-valued dependent variable
612 variables 312 variables 217 variables
33
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend
0/1 or continuous-valued dependent variable
612 variables 312 variables 217 variables
34
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend
+ bank
0/1 or continuous-valued dependent variable
612 variables 312 variables 217 variables
35
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend
+ bank + application
0/1 or continuous-valued dependent variable
612 variables 312 variables 217 variables
36
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend
+ bank + application + iovation
0/1 or continuous-valued dependent variable
612 variables 312 variables 217 variables
37
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend
+ bank + application + iovation + clarity
0/1 or continuous-valued dependent variable
612 variables 312 variables 217 variables
38
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity
R formula object
39
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity
R formula object
4 Avant R packages using database connections + caching layersinstantly translate this R formula to a live data source yielding adata.frame with
100,000s – 1,000,000s of rows 1,000s of cols
Adding new data sources = Easy
40
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list( "Make validation set" = list( make_validation_set, seed = seed, trainpct = trainpct )
41
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list( "Make validation set" = list( make_validation_set, seed = …
1 Line of Code per data preparation step encouragesDRY (Don’t Repeat Yourself) code and easier testing
42
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL,…
Parsing engine translates this grammar to mean“do this in training but not in live production”.
Data scientist forced to ensure data preparationworks in production while developing model.
Avoid later headaches and angry girlfriends whenmodel breaks at midnight.
43
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,…
Train versus predict duality is present on final model object.
Model will predict on raw (not clean) production data, either1-row or 1,000,000-row data.frames, in interactive R modeor on deployed Amazon EC2 instance, without extra code.
44
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05…
Data scientist can re-run data steps and examine or visualize data to interactively build and debug data preparation. No need to start fromscratch or keep duplicate copies of data. R memory problem = Solved
> run("data/Sweep") Special notation defined by framework
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .
R a p i d E x p e r i m e n t a ti o n
list( import = fraud(num_installments = 2, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) , "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .
Answering new business questions is easy: change indicator query.
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list( "Make validation set" = list(make_validation_set, seed = … , "Set numeric cols" = list(coltrans(as.numeric), numer… , "Remove rows w > 90% missing" = list(select_rows ~ NULL, … , "Convert dates to num" = list(parse_datetime, date_vars) #, "Handle mixed factor-numeric” = list(parse_mixed_vars, mix… , "Handle TU NA codes" = list(value_replace, lapply(c('T’,… , "Change NAs with new level" = list(value_replace, is.factor… , "Sweep up remaining levels" = list(group_minor, min = 0.05… , "Restore categorical variables" = list(restore_factors, is… , "Remove 0-variance columns" = list(drop_variables, funct… , "Remove highly correlated columns" = list(drop_highly_corr… , "Sure independence screening" = list(SIS, exclude = "dep_… , . . .
Comment out or hyper-parameterize data preparation andre-run full model to test the effect of feature engineering on final model performance.
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))
Any of 6,500+ CRAN, BioConductor, or in-house stats packages can be incorporated with a light-weight wrapper.
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "mars", nk = knots, degree = deg ))
Any of 6,500+ CRAN, BioConductor, or in-house stats packages can be incorporated with a light-weight wrapper.
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "gbm", trees = 20000, shrinkage = 0.001 ))
Can use parallel, snow, etc. to parallelize models that require many cores/nodes.
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))
Parametrization of full modeling process is configurable.
Different grammar exists for ensemble models,sequential models, etc.
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))
list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "gbm", trees = 20000, shrinkage = 0.001 ))
Train models in parallelto solve differentbusiness questions
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))
list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "gbm", trees = 20000, shrinkage = 0.001 ))
list( import = collection(delinquency_window = 60) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "mars", nk = knots, degree = deg ))
list( import = conversion(response_window = 3) ~ transunion data = list(…),
model = list( "ensemble", lapply(seq(0, 1, by = 0.1), function(alpha) { list( "glmnet", link = "bernoulli", alpha = alpha ) }) ))
R a p i d E x p e r i m e n t a ti o n
list( import = delinquency(num_installments = 3, days_delinquent = 7) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "glmnet", link = "bernoulli", alpha = 0.5 ))
list( import = fraud(num_installments = 3, collection_window = 30) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "gbm", trees = 20000, shrinkage = 0.001 ))
list( import = collection(delinquency_window = 60) ~ transunion + tu313 + tradeline + frontend + bank + application + iovation + clarity,
data = list(…),
model = list( "mars", nk = knots, degree = deg ))
list( import = conversion(response_window = 3) ~ transunion data = list(…),
model = list( "ensemble", lapply(seq(0, 1, by = 0.1), function(alpha) { list( "glmnet", link = "bernoulli", alpha = alpha ) }) ))
Framework of 25 R packages.
Deploy models same-day on raw production data with no extra code.
Provides smooth transition from R to “Big Data”.
R a p i d E x p e r i m e n t a ti o n
56
57
Contact: [email protected]
peterhurford/batchman kirillseva/ruigi michaelochurch/fixedwidth-hs robertzk/cachemeifyoucan