allstate research and planning center proprietary and confidential information. do not reproduce....
TRANSCRIPT
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®1
Start Small Before Going Big
STRATA NYC October 23, 2012
Steve Yun - Principal Predictive Modeler, Allstate InsuranceJoseph Rickert – Technical Marketing Manager, Revolution Analytics
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®2
As insurance company our ability to analyze data is our major competitive advantage
Data is our competitive advantage
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®3
But we struggle with large data sets
SAS Proc Genmod took over 5 hours to return results for a Poisson model with 150 Million observations and 70 Degrees of Freedom
Its difficult to be productive on a tight schedule if it takes over 5 hours just to fit 1 candidate model
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®4
So we installed a small Hadoop cluster somewhere in the Midwest
And hoped that it would improve performance
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®5
Some assembly required
We couldn't find an open source MapReduce GLM implementation, so we wrote our own
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®6
We are a diverse group, but...
...we are reluctant software engineers
Physicist ActuaryStatistician
Anthropologist
Database Developer
Anthropologist
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®7
So we chose rmr and rhdfs to implement our models
We use R and SAS daily, Python occasionally, and Java rarely. It made sense to start with what we already know: R. Using R allowed us to use a lot of the existing linear algebra, and R
functions.
Uses Tableau for Visualization
Uses SAS for data analysis
Uses R for data analysis
Uses AbInitio for database
development
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®8
The mapper and reducer are simple
Mapper:
if first iteration:mu = y + epsiloneta = linkfunction(mu)else:eta = sum(beta * t(x))mu = linkinv(eta) w = mu_eta(eta)^2 / variance(mu)XWX = t(x) * x * wz = eta + (y - mu) / mu_eta(eta)XWz = x * mu * zemit(1, concat(XWX, XWz))
Reducer:
for each value: XWX += XWX XWz += XWz
beta = solve(XWX, XWz)Repeat until convergence criteria is
satisfied
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®9
Let's examine a single observation
3,3,1,2005,1998,AR,AR.41,AR.41.,2,-0.7542818,-1.6461267,-1.1010908,-1.6794449,-0.9714867,-1.4057974,-0.8370476,-1.1768577,F,-0.2315299,-0.2661168,-0.2723372,-0.25141,89,0,1,B,?,A,A,A,C,C,A,B,A,E,D
Dependent Variable
Categorical Variable
[2] [3, 3, 41, -0.7542818,-1.6461267,1, 0, 0, 0, 1]
Dummy Variables
Dependent Variable
Numeric Variables
What the data looks like
What we want the data to look like
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®10
A lot of plumbing is required• We needed to be able to pass which independent variables we wanted to useo MapReduce read in the entire data row, we needed to split it into a vector, and
remove the elements we don't wanto And convert the strings into either numerics or convert the categorical variables into
dummy variables Need to know what the variable types: numeric, categorical Need to know the levels of the categorical variables so that we can create the
dummy variables• We needed to identify which was the dependent variableo We have several possible dependent variables in our data sets. We have to be able to
tell our implementation which variable to use as the dependent variable• We needed to handle missing variableso If there are missing values, we have to omit the row• We need to read the results of the MapReduce iteration, convert the result into the XWX
and the XWz matrices and solve
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®11
Now that we have everything in place: Error
Error in solve.default(XWX, XWz) : system is computationally singular: reciprocal condition number = 1.15541e-27Timing stopped at: 88.546 7.428 5290.394
• If Each iteration is going to take 1.5 hours, and it take 5-10 iteration to converge, I may want to rethink using
MapReduce for the model selection task.
Almost 1.5 hours before realizing that there was a
problem with the design matrix.
Not a lot of information about which variables are causing the
problem.
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®12
What are our options?
Buy a very large server and install R
Error in solve.default(XWX, XWz) : system is computationally singular: reciprocal condition number = 1.15541e-27Timing stopped at: 88.546 7.428 5290.394
Use Small Random Samples
Use RevolutionR
Enterprise
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®13
Our large R server could not load the entire data set
Monday Tuesday Wednesday Thursday Friday
Started to load the data on Monday Morning
Killed the the data load process on Wednesday
afternoon
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®14
So we sampled down the data set
We randomly partitioned the data set into 10 subsets.
Even after partitioning the data set into 10 subsets, it still took over an hour to read in the CSV file and convert the variables
into the proper data types
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®15
We used GLMNET package to select our variables
Somewhere around 25th to 30th step the
improvement in deviance begins to
diminish
• We used L1 Regularization because it drops variables completely• We used GLMNET package from CRAN because I didn't have the time
and resources to write my own implementation in MapReduce
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®16
Used the variables selected from GLMNET in a Poisson GLM model
We fit the model on a 10% subset, then evaluated the model on a different subset
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®17
Each GLM fit takes still takes over half an hour
We probably should have sampled the data set down even further to speed up each iteration
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®18
Once we identified the model we want, we went back to Hadoop
[1] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"
[2] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7" [3] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"...........................................[40] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"[41] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"[42] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"Time difference of 8.83211 hours
There were no errors and no problems with the design matrix, but it still took almost 9 hours to finish
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®19
We tried preprocessing the data in Hadoop to improve performanceFrom: Drake, Don (Ranstadt)Sent: Thursday, September 27, 2012 2:15 PMTo: Yun, SteveCc: Slusar, Mark; Anderson, Fred; Barry, RaymondSubject: RE: Lab Cluster (lxe9700)Steve,Thanks for the quick reply, I’ve killed the jobs. Ray, can you restart JobTracker the cluster is now idle.Thanks.-Don_____________________________________________From: Yun, SteveSent: Thursday, September 27, 2012 4:12 PMTo: Drake, Don (Ranstadt)Cc: Slusar, Mark; Anderson, FredSubject: RE: Lab Cluster (lxe9700)Feel free to kill the jobs. I’m about to kill them anyway._____________________________________________From: Drake, Don (Ranstadt)Sent: Thursday, September 27, 2012 2:10 PMTo: Yun, SteveCc: Slusar, Mark; Anderson, FredSubject: Lab Cluster (lxe9700)Steve,I see you have a few really long running jobs on the lab hadoop cluster. I have been running smaller jobs much frequently without an issue over the past
few months. I was curious if you still needed these jobs to finish?Also, we have a configuration change on the hadoop cluster that requires a JobTracker restart. If we did that now, your jobs would be killed and not
restarted. Please let me know when you think your jobs will be done so we can do the quick restart.Thanks.-Don
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®20
Maybe we should have used Revolution R
Allstate does not use RevolutionRBut Joseph Rickert from Revolution
Will tell you about what it can do
Revolution Confidential
RevoScaleR PEMAsParallel External Memory Algorithms
21
Block 1
Block 2
Block i
Block i + 1
Block i + 2
XDF File
Block i Block i + 1
Block
i + 2
Read blocks and compute intermediate results in parallel, iterating as necessary
Block 1 results
Block i results
Block i+1 results
Block i+2 results
Results from last block
2nd pass
3rd pass
1st pass
R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data
R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data
Revolution Confidential
Compute Node
Compute Node
Compute Node
Compute Node
Master Node
Master Node
DataPartition
DataPartition
DataPartition
DataPartition
Compute Node
Compute Node
Compute Node
Compute Node
DataPartition
DataPartition
DataPartition
DataPartition
RevoScaleR Big Data AnalyticsServers & Distributed Clusters
Data Step, Statistical Summary, Tables/Cubes, Covariance, Linear & Logistic Regression, GLM, K-means clustering, …
22
BIGDATA
Revolution Confidential
RevoScaleR Poisson Model Benchmarks
23
Platform File Vars in Model
Avg. Elapsed Time (Minutes)
5 nodes Large 40 5.7
1 node Large 40 54.0
1 node Small 40 2.3
Linux LFS cluster 5 nodesEach node:
• 4 cores• CPU: Intel® Xeon® E31240 @ 3.30GHz• RAM: 16 GB
Linux LFS cluster 5 nodesEach node:
• 4 cores• CPU: Intel® Xeon® E31240 @ 3.30GHz• RAM: 16 GB
Time to import large file minutes
All Variables 90.0
Just model variables (distributed import) 9.8
File Size (GB) Rows Cols
Large 85.0 145,814,612 139
Small 8.6 14,796,180 139
Revolution Confidential
24
Revolution ConfidentialSummary
Model computation times (full dataset)
25
Software Platform Time to fit
SAS 16-core Sun Server 5 hours
rmr / map-reduce 10-node (8 cores / node) Hadoop cluster
> 10 hours
Open source R 250-GB Server Impossible (> 3 days)
RevoScaleR 5-node (4 cores / node) LSF cluster
5.7 minutes
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.
®26
Lessons Learned
• Take advantage of existing tools – R has many libraries and functions to tackle model
construction and selection challenges– Sample down and stand on the shoulders of giants
• Once you have selected the models to use then “Go Big”– Determine what must be implemented and
optimized in Hadoop– Look at other tools for model building
26