allstate research and planning center proprietary and confidential information. do not reproduce....

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®1

Start Small Before Going Big

STRATA NYC October 23, 2012

Steve Yun - Principal Predictive Modeler, Allstate InsuranceJoseph Rickert – Technical Marketing Manager, Revolution Analytics


®2

As insurance company our ability to analyze data is our major competitive advantage

Data is our competitive advantage


®3

But we struggle with large data sets

SAS Proc Genmod took over 5 hours to return results for a Poisson model with 150 Million observations and 70 Degrees of Freedom

Its difficult to be productive on a tight schedule if it takes over 5 hours just to fit 1 candidate model


®4

So we installed a small Hadoop cluster somewhere in the Midwest

And hoped that it would improve performance


®5

Some assembly required

We couldn't find an open source MapReduce GLM implementation, so we wrote our own


®6

We are a diverse group, but...

...we are reluctant software engineers

Physicist ActuaryStatistician

Anthropologist

Database Developer

Anthropologist


®7

So we chose rmr and rhdfs to implement our models

We use R and SAS daily, Python occasionally, and Java rarely. It made sense to start with what we already know: R. Using R allowed us to use a lot of the existing linear algebra, and R

functions.

Uses Tableau for Visualization

Uses SAS for data analysis

Uses R for data analysis

Uses AbInitio for database

development


®8

The mapper and reducer are simple

Mapper:

if first iteration:mu = y + epsiloneta = linkfunction(mu)else:eta = sum(beta * t(x))mu = linkinv(eta) w = mu_eta(eta)^2 / variance(mu)XWX = t(x) * x * wz = eta + (y - mu) / mu_eta(eta)XWz = x * mu * zemit(1, concat(XWX, XWz))

Reducer:

for each value: XWX += XWX XWz += XWz

beta = solve(XWX, XWz)Repeat until convergence criteria is

satisfied


®9

Let's examine a single observation

3,3,1,2005,1998,AR,AR.41,AR.41.,2,-0.7542818,-1.6461267,-1.1010908,-1.6794449,-0.9714867,-1.4057974,-0.8370476,-1.1768577,F,-0.2315299,-0.2661168,-0.2723372,-0.25141,89,0,1,B,?,A,A,A,C,C,A,B,A,E,D

Dependent Variable

Categorical Variable

[2] [3, 3, 41, -0.7542818,-1.6461267,1, 0, 0, 0, 1]

Dummy Variables

Dependent Variable

Numeric Variables

What the data looks like

What we want the data to look like


®10

A lot of plumbing is required• We needed to be able to pass which independent variables we wanted to useo MapReduce read in the entire data row, we needed to split it into a vector, and

remove the elements we don't wanto And convert the strings into either numerics or convert the categorical variables into

dummy variables Need to know what the variable types: numeric, categorical Need to know the levels of the categorical variables so that we can create the

dummy variables• We needed to identify which was the dependent variableo We have several possible dependent variables in our data sets. We have to be able to

tell our implementation which variable to use as the dependent variable• We needed to handle missing variableso If there are missing values, we have to omit the row• We need to read the results of the MapReduce iteration, convert the result into the XWX

and the XWz matrices and solve


®11

Now that we have everything in place: Error

Error in solve.default(XWX, XWz) : system is computationally singular: reciprocal condition number = 1.15541e-27Timing stopped at: 88.546 7.428 5290.394

• If Each iteration is going to take 1.5 hours, and it take 5-10 iteration to converge, I may want to rethink using

MapReduce for the model selection task.

Almost 1.5 hours before realizing that there was a

problem with the design matrix.

Not a lot of information about which variables are causing the

problem.


®12

What are our options?

Buy a very large server and install R

Error in solve.default(XWX, XWz) : system is computationally singular: reciprocal condition number = 1.15541e-27Timing stopped at: 88.546 7.428 5290.394

Use Small Random Samples

Use RevolutionR

Enterprise


®13

Our large R server could not load the entire data set

Monday Tuesday Wednesday Thursday Friday

Started to load the data on Monday Morning

Killed the the data load process on Wednesday

afternoon


®14

So we sampled down the data set

We randomly partitioned the data set into 10 subsets.

Even after partitioning the data set into 10 subsets, it still took over an hour to read in the CSV file and convert the variables

into the proper data types


®15

We used GLMNET package to select our variables

Somewhere around 25th to 30th step the

improvement in deviance begins to

diminish

• We used L1 Regularization because it drops variables completely• We used GLMNET package from CRAN because I didn't have the time

and resources to write my own implementation in MapReduce


®16

Used the variables selected from GLMNET in a Poisson GLM model

We fit the model on a 10% subset, then evaluated the model on a different subset


®17

Each GLM fit takes still takes over half an hour

We probably should have sampled the data set down even further to speed up each iteration


®18

Once we identified the model we want, we went back to Hadoop

[1] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"

[2] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7" [3] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"...........................................[40] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"[41] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"[42] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"Time difference of 8.83211 hours

There were no errors and no problems with the design matrix, but it still took almost 9 hours to finish


®19

We tried preprocessing the data in Hadoop to improve performanceFrom: Drake, Don (Ranstadt)Sent: Thursday, September 27, 2012 2:15 PMTo: Yun, SteveCc: Slusar, Mark; Anderson, Fred; Barry, RaymondSubject: RE: Lab Cluster (lxe9700)Steve,Thanks for the quick reply, I’ve killed the jobs. Ray, can you restart JobTracker the cluster is now idle.Thanks.-Don_____________________________________________From: Yun, SteveSent: Thursday, September 27, 2012 4:12 PMTo: Drake, Don (Ranstadt)Cc: Slusar, Mark; Anderson, FredSubject: RE: Lab Cluster (lxe9700)Feel free to kill the jobs. I’m about to kill them anyway._____________________________________________From: Drake, Don (Ranstadt)Sent: Thursday, September 27, 2012 2:10 PMTo: Yun, SteveCc: Slusar, Mark; Anderson, FredSubject: Lab Cluster (lxe9700)Steve,I see you have a few really long running jobs on the lab hadoop cluster. I have been running smaller jobs much frequently without an issue over the past

few months. I was curious if you still needed these jobs to finish?Also, we have a configuration change on the hadoop cluster that requires a JobTracker restart. If we did that now, your jobs would be killed and not

restarted. Please let me know when you think your jobs will be done so we can do the quick restart.Thanks.-Don


®20

Maybe we should have used Revolution R

Allstate does not use RevolutionRBut Joseph Rickert from Revolution

Will tell you about what it can do

Revolution Confidential

RevoScaleR PEMAsParallel External Memory Algorithms

21

Block 1

Block 2

Block i

Block i + 1

Block i + 2

XDF File

Block i Block i + 1

Block

i + 2

Read blocks and compute intermediate results in parallel, iterating as necessary

Block 1 results

Block i results

Block i+1 results

Block i+2 results

Results from last block

2nd pass

3rd pass

1st pass

R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data

R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data


Compute Node

Compute Node

Compute Node

Compute Node

Master Node

Master Node

DataPartition

DataPartition

DataPartition

DataPartition

Compute Node

Compute Node

Compute Node

Compute Node

DataPartition

DataPartition

DataPartition

DataPartition

RevoScaleR Big Data AnalyticsServers & Distributed Clusters

Data Step, Statistical Summary, Tables/Cubes, Covariance, Linear & Logistic Regression, GLM, K-means clustering, …

22

BIGDATA


RevoScaleR Poisson Model Benchmarks

23

Platform File Vars in Model

Avg. Elapsed Time (Minutes)

5 nodes Large 40 5.7

1 node Large 40 54.0

1 node Small 40 2.3

Linux LFS cluster 5 nodesEach node:

• 4 cores• CPU: Intel® Xeon® E31240 @ 3.30GHz• RAM: 16 GB

Linux LFS cluster 5 nodesEach node:

• 4 cores• CPU: Intel® Xeon® E31240 @ 3.30GHz• RAM: 16 GB

Time to import large file minutes

All Variables 90.0

Just model variables (distributed import) 9.8

File Size (GB) Rows Cols

Large 85.0 145,814,612 139

Small 8.6 14,796,180 139


24

Revolution ConfidentialSummary

Model computation times (full dataset)

25

Software Platform Time to fit

SAS 16-core Sun Server 5 hours

rmr / map-reduce 10-node (8 cores / node) Hadoop cluster

> 10 hours

Open source R 250-GB Server Impossible (> 3 days)

RevoScaleR 5-node (4 cores / node) LSF cluster

5.7 minutes


®26

Lessons Learned

• Take advantage of existing tools – R has many libraries and functions to tackle model

construction and selection challenges– Sample down and stand on the shoulders of giants

• Once you have selected the models to use then “Go Big”– Determine what must be implemented and

optimized in Hadoop– Look at other tools for model building

26

allstate research and planning center proprietary and confidential information. do not reproduce....

Documents