allstate research and planning center proprietary and confidential information. do not reproduce....

26
Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management ® 1 Start Small Before Going Big STRATA NYC October 23, 2012 Steve Yun - Principal Predictive Modeler, Allstate Insurance Joseph Rickert – Technical Marketing Manager, Revolution Analytics

Upload: brice-doyle

Post on 13-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®1

Start Small Before Going Big

STRATA NYC October 23, 2012

Steve Yun - Principal Predictive Modeler, Allstate InsuranceJoseph Rickert – Technical Marketing Manager, Revolution Analytics

Page 2: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®2

As insurance company our ability to analyze data is our major competitive advantage

Data is our competitive advantage

Page 3: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®3

But we struggle with large data sets

SAS Proc Genmod took over 5 hours to return results for a Poisson model with 150 Million observations and 70 Degrees of Freedom

Its difficult to be productive on a tight schedule if it takes over 5 hours just to fit 1 candidate model

Page 4: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®4

So we installed a small Hadoop cluster somewhere in the Midwest

And hoped that it would improve performance

Page 5: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®5

Some assembly required

We couldn't find an open source MapReduce GLM implementation, so we wrote our own

Page 6: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®6

We are a diverse group, but...

...we are reluctant software engineers

Physicist ActuaryStatistician

Anthropologist

Database Developer

Anthropologist

Page 7: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®7

So we chose rmr and rhdfs to implement our models

We use R and SAS daily, Python occasionally, and Java rarely. It made sense to start with what we already know: R. Using R allowed us to use a lot of the existing linear algebra, and R

functions.

Uses Tableau for Visualization

Uses SAS for data analysis

Uses R for data analysis

Uses AbInitio for database

development

Page 8: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®8

The mapper and reducer are simple

Mapper:

if first iteration:mu = y + epsiloneta = linkfunction(mu)else:eta = sum(beta * t(x))mu = linkinv(eta) w = mu_eta(eta)^2 / variance(mu)XWX = t(x) * x * wz = eta + (y - mu) / mu_eta(eta)XWz = x * mu * zemit(1, concat(XWX, XWz))

Reducer:

for each value: XWX += XWX XWz += XWz

beta = solve(XWX, XWz)Repeat until convergence criteria is

satisfied

Page 9: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®9

Let's examine a single observation

3,3,1,2005,1998,AR,AR.41,AR.41.,2,-0.7542818,-1.6461267,-1.1010908,-1.6794449,-0.9714867,-1.4057974,-0.8370476,-1.1768577,F,-0.2315299,-0.2661168,-0.2723372,-0.25141,89,0,1,B,?,A,A,A,C,C,A,B,A,E,D

Dependent Variable

Categorical Variable

[2] [3, 3, 41, -0.7542818,-1.6461267,1, 0, 0, 0, 1]

Dummy Variables

Dependent Variable

Numeric Variables

What the data looks like

What we want the data to look like

Page 10: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®10

A lot of plumbing is required• We needed to be able to pass which independent variables we wanted to useo MapReduce read in the entire data row, we needed to split it into a vector, and

remove the elements we don't wanto And convert the strings into either numerics or convert the categorical variables into

dummy variables Need to know what the variable types: numeric, categorical Need to know the levels of the categorical variables so that we can create the

dummy variables• We needed to identify which was the dependent variableo We have several possible dependent variables in our data sets. We have to be able to

tell our implementation which variable to use as the dependent variable• We needed to handle missing variableso If there are missing values, we have to omit the row• We need to read the results of the MapReduce iteration, convert the result into the XWX

and the XWz matrices and solve

Page 11: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®11

Now that we have everything in place: Error

Error in solve.default(XWX, XWz) : system is computationally singular: reciprocal condition number = 1.15541e-27Timing stopped at: 88.546 7.428 5290.394

• If Each iteration is going to take 1.5 hours, and it take 5-10 iteration to converge, I may want to rethink using

MapReduce for the model selection task.

Almost 1.5 hours before realizing that there was a

problem with the design matrix.

Not a lot of information about which variables are causing the

problem.

Page 12: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®12

What are our options?

Buy a very large server and install R

Error in solve.default(XWX, XWz) : system is computationally singular: reciprocal condition number = 1.15541e-27Timing stopped at: 88.546 7.428 5290.394

Use Small Random Samples

Use RevolutionR

Enterprise

Page 13: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®13

Our large R server could not load the entire data set

Monday Tuesday Wednesday Thursday Friday

Started to load the data on Monday Morning

Killed the the data load process on Wednesday

afternoon

Page 14: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®14

So we sampled down the data set

We randomly partitioned the data set into 10 subsets.

Even after partitioning the data set into 10 subsets, it still took over an hour to read in the CSV file and convert the variables

into the proper data types

Page 15: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®15

We used GLMNET package to select our variables

Somewhere around 25th to 30th step the

improvement in deviance begins to

diminish

• We used L1 Regularization because it drops variables completely• We used GLMNET package from CRAN because I didn't have the time

and resources to write my own implementation in MapReduce

Page 16: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®16

Used the variables selected from GLMNET in a Poisson GLM model

We fit the model on a 10% subset, then evaluated the model on a different subset

Page 17: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®17

Each GLM fit takes still takes over half an hour

We probably should have sampled the data set down even further to speed up each iteration

Page 18: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®18

Once we identified the model we want, we went back to Hadoop

[1] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"

[2] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7" [3] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"...........................................[40] "beta.hat:x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"[41] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"[42] "beta.hat:-x.xxxxxxxxxxxxxxx:distance:1.13903242562306e-06:iter:7"Time difference of 8.83211 hours

There were no errors and no problems with the design matrix, but it still took almost 9 hours to finish

Page 19: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®19

We tried preprocessing the data in Hadoop to improve performanceFrom: Drake, Don (Ranstadt)Sent: Thursday, September 27, 2012 2:15 PMTo: Yun, SteveCc: Slusar, Mark; Anderson, Fred; Barry, RaymondSubject: RE: Lab Cluster (lxe9700)Steve,Thanks for the quick reply, I’ve killed the jobs. Ray, can you restart JobTracker the cluster is now idle.Thanks.-Don_____________________________________________From: Yun, SteveSent: Thursday, September 27, 2012 4:12 PMTo: Drake, Don (Ranstadt)Cc: Slusar, Mark; Anderson, FredSubject: RE: Lab Cluster (lxe9700)Feel free to kill the jobs. I’m about to kill them anyway._____________________________________________From: Drake, Don (Ranstadt)Sent: Thursday, September 27, 2012 2:10 PMTo: Yun, SteveCc: Slusar, Mark; Anderson, FredSubject: Lab Cluster (lxe9700)Steve,I see you have a few really long running jobs on the lab hadoop cluster. I have been running smaller jobs much frequently without an issue over the past

few months. I was curious if you still needed these jobs to finish?Also, we have a configuration change on the hadoop cluster that requires a JobTracker restart. If we did that now, your jobs would be killed and not

restarted. Please let me know when you think your jobs will be done so we can do the quick restart.Thanks.-Don

Page 20: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®20

Maybe we should have used Revolution R

Allstate does not use RevolutionRBut Joseph Rickert from Revolution

Will tell you about what it can do

Page 21: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Revolution Confidential

RevoScaleR PEMAsParallel External Memory Algorithms

21

Block 1

Block 2

Block i

Block i + 1

Block i + 2

XDF File

Block i Block i + 1

Block

i + 2

Read blocks and compute intermediate results in parallel, iterating as necessary

Block 1 results

Block i results

Block i+1 results

Block i+2 results

Results from last block

2nd pass

3rd pass

1st pass

R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data

R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data

Page 22: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Revolution Confidential

Compute Node

Compute Node

Compute Node

Compute Node

Master Node

Master Node

DataPartition

DataPartition

DataPartition

DataPartition

Compute Node

Compute Node

Compute Node

Compute Node

DataPartition

DataPartition

DataPartition

DataPartition

RevoScaleR Big Data AnalyticsServers & Distributed Clusters

Data Step, Statistical Summary, Tables/Cubes, Covariance, Linear & Logistic Regression, GLM, K-means clustering, …

22

BIGDATA

Page 23: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Revolution Confidential

RevoScaleR Poisson Model Benchmarks

23

Platform File Vars in Model

Avg. Elapsed Time (Minutes)

5 nodes Large 40 5.7

1 node Large 40 54.0

1 node Small 40 2.3

Linux LFS cluster 5 nodesEach node:

• 4 cores• CPU: Intel® Xeon® E31240 @ 3.30GHz• RAM: 16 GB

Linux LFS cluster 5 nodesEach node:

• 4 cores• CPU: Intel® Xeon® E31240 @ 3.30GHz• RAM: 16 GB

Time to import large file minutes

All Variables 90.0

Just model variables (distributed import) 9.8

File Size (GB) Rows Cols

Large 85.0 145,814,612 139

Small 8.6 14,796,180 139

Page 24: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Revolution Confidential

24

Page 25: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Revolution ConfidentialSummary

Model computation times (full dataset)

25

Software Platform Time to fit

SAS 16-core Sun Server 5 hours

rmr / map-reduce 10-node (8 cores / node) Hadoop cluster

> 10 hours

Open source R 250-GB Server Impossible (> 3 days)

RevoScaleR 5-node (4 cores / node) LSF cluster

5.7 minutes

Page 26: Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce. Prepared for the purpose of Allstate management discussion

Allstate Research and Planning Center Proprietary and confidential information. Do not reproduce.Prepared for the purpose of Allstate management discussion only.

®26

Lessons Learned

• Take advantage of existing tools – R has many libraries and functions to tackle model

construction and selection challenges– Sample down and stand on the shoulders of giants

• Once you have selected the models to use then “Go Big”– Determine what must be implemented and

optimized in Hadoop– Look at other tools for model building

26