caravan insurance data mining prediction models

11
Page 1 of 11 Caravan Insurance Data Mining Assignment K6225 Knowledge Discovery and Data Mining By, Sesagiri Raamkumar Aravind(G1101761F) Thangavelu Muthu Kumaar(G1101765E)

Upload: muthu-kumaar

Post on 04-Nov-2014

9 views

Category:

Education


0 download

DESCRIPTION

DATA MINING EXAMPLE

TRANSCRIPT

Page 1: Caravan insurance data mining prediction models

Page 1 of 11

Caravan Insurance Data

Mining Assignment

K6225 Knowledge Discovery and Data Mining

By,

Sesagiri Raamkumar Aravind(G1101761F)

Thangavelu Muthu Kumaar(G1101765E)

Page 2: Caravan insurance data mining prediction models

Page 2 of 11

Table of Contents

1.0 Objective ........................................................................................................................................... 3

2.0 Summary of Final Results .................................................................................................................. 3

3.0 Exercise Lifecycle............................................................................................................................... 3

3.1 Understanding the objective of the exercise and its expectations .............................................. 4

3.2 Understanding the data dictionary of the data set ...................................................................... 4

3.3 Assigning appropriate measure values (Set/Range) for data fields .............................................. 4

3.4 Constructing first level models with Training dataset .................................................................. 4

3.4.1 Logistic Regression ................................................................................................................. 4

3.4.2 Decision Trees ........................................................................................................................ 5

3.11.3 Neural Networks .................................................................................................................. 5

3.5 Running the first level Models with Test data .............................................................................. 6

3.6 Performing bivariate analysis on training dataset ........................................................................ 7

3.7 Creating interaction variables based on results of Step 5 ............................................................ 7

3.8 Balancing the training data ........................................................................................................... 8

3.9 Constructing second level models with Training dataset ............................................................. 9

3.10 Running the second level Models with Test dataset .................................................................. 9

3.11 Constructing third level models by adding new interaction variables ..................................... 10

3.12 Running the third level models with Test dataset .................................................................... 10

3.13 Final Results Interpretation ...................................................................................................... 11

Page 3: Caravan insurance data mining prediction models

Page 3 of 11

1.0 Objective

The objective of this data mining exercise is to find the best possible model to predict whether

customer signature will opt for caravan insurance (mobile home policy) or not. The techniques used

are logistic regression, decision tree and neural network.

2.0 Summary of Final Results

The model built using Logistic Regression and Decision Tree came out with the highest accuracy on

comparison with the models built using Neural Network. The best model had an accuracy of 94%.

The most interesting part of the exercise is that base model (as provided originally) without any

interaction variables and balancing, gave the best results. It has been expectedly observed that most

models had higher accuracy with training data set but the accuracy rate reduced when run with test

dataset .Cross-validation techniques such as 10-step validation was not done in this exercise which

could have delineated the results even more.

3.0 Exercise Lifecycle

The lifecycle of the complete data mining exercise comprises of the following steps:-

1. Understanding the objective of the exercise and its expectations

2. Understanding the data dictionary of the data set

3. Assigning appropriate measure values (Set/Range) for data fields

4. Constructing first level Models with Training dataset

5. Running the first level Models with Test dataset

6. Performing bivariate analysis

7. Balancing the training data

8. Constructing second level Models with Training dataset

9. Dataset modification of Training dataset

10. Running the second level Models with Test dataset

11. Creating interaction variables based on results of Step 6

12. Constructing third level Models with Training dataset

13. Running the third level Models with Test dataset

14. Final Results Interpretation

Page 4: Caravan insurance data mining prediction models

Page 4 of 11

3.1 Understanding the objective of the exercise and its

expectations The first and foremost step in a data mining exercise is to understand the business objective. The

business wants to use their existing customer signatures to build a predictive model for predicting the

number of mobile home policies. The model construction and its inference will be a precursor for a

potential marketing campaign to target specific customer groups. The data mining techniques that are

in the scope of this exercise are logistic regression, decision trees and neural networks.

3.2 Understanding the data dictionary of the data set The data dictionary consists of 86 variables with an equal mix of socio-demographic and product

ownership data. There are few ordinal variables that need to be changed to numeric variables for build

efficiency. The socio-demographic variables are captured at zip-code level.

3.3 Assigning appropriate measure values (Set/Range) for data

fields The measure of the below variables were manually changed to ‘Range’ in Clementine, apart from the

automatically assigned measures:-

MAANTHUI Number of houses

MGEMOMV Avg size household

MGEMLEEF Avg age

MGODRK Roman catholic

PWAPART Contribution private third party insurance

There is an academic insight that socio-demographic variables are to be converted to ‘Range’

variables so that it would be convenient to plot the values in logistic regression graph curve. The

authors retained the variables as ‘Set’ variables initially to test the postulation at a later stage.

3.4 Constructing first level models with Training dataset The authors made a plan of arriving at the best model by using a three level approach. The models

built in first level will be crude models constructed on the data set directly without any new

interaction variables or data balancing. These models will be the first benchmark to gauge subsequent

improvements. Models were built using the Logistic, C5.0 and Neural Net nodes.

3.4.1 Logistic Regression No changes were done for the Logistic Regression as all attributes were seemingly optimal.

Page 5: Caravan insurance data mining prediction models

Page 5 of 11

3.4.2 Decision Trees The changes done for C5 node under the Export mode are

Pruning Severity was set to 5

‘Minimum records per child branch’ was changed to 5 from 2 as it was found to be optimal number.

Value 1 impaired the results and the same could be said for values greater than 2

‘Use Boosting’ option was enabled so that more classifiers are created. The value was set to15 for

first level and changed to 5 for second and third level.

Fig 7: C5 Model Attributes

3.4.3 Neural Networks For the Neural networks, the RBFN method was selected first but the model did not produce better

results. The final method selected was ‘Quick’. The number of hidden layers was set as 3 so that more

transformations can take place. The learning rates were initially increased marginally to check for

performance improvements assuming that the results are converging towards the globally consistent

depression in the learning curve of the networks. But as marginal increase of alpha learning rate didn’t

get produce significant results, it was increased dramatically to 0.9 for overcoming the possibly

assumed local depression. The final values are available in the screenshot below.

Page 6: Caravan insurance data mining prediction models

Page 6 of 11

Fig 8: Neural Network Attributes

3.5 Running the first level Models with Test data The trained first level models were run with the test dataset and the results of the different modelling

techniques were compared with the Analysis and Evaluation node. Logistic Regression and Decision

Tree both had the best accuracy rate of 94%. The Nagelkerke Rsquare value with training data set was

16.7%. These results will be maintained as the first level benchmark. Screenshots provided below

Fig1: First Level Models Analysis Node Results

Page 7: Caravan insurance data mining prediction models

Page 7 of 11

Fig 2: First Level Gain and Lift Chart

3.6 Performing bivariate analysis on training dataset This step marks the start of the second level model building process. Bivariate analysis in Clementine

can be done using the Web node that represents the relationships between the values of variables

using thick and thin lines. The authors performed the analysis using both the normal web and directed

web option in the web node. The directed web had the target as Caravan variable and all the other

variables were put in dependant section. This analysis wasn’t helpful as the relationships were present

among different values in independent variables and CARAVAN therefore no significant inferences

were made. However, the normal web analysis indicated strong relationships between the customer

type and customer subtype, a potential candidate for interaction variable.

3.7 Creating interaction variables based on results of Step 5 The indication from last step was implemented in this step by creating two interaction variables. The

first interaction variable Derive1(aka customer lifestyle reflector) contains the parent variables

Customer Type and Subtype. The second interaction variable Derive3(aka Combined Age-Income

Factor ) contains the parent variables Avg age and Avg Income. This variable was created based on

the author’s intuition that it would help build a better model. Screenshots provided below for

reference

Page 8: Caravan insurance data mining prediction models

Page 8 of 11

Fig 3: Derived Variables

3.8 Balancing the training data It has been noticed that the training dataset is not highly representative of positive cases

i.e.CARAVAN=1. Therefore, models constructed using this data set may not be the best predictor for

positive cases. Clementine provides a feature called as Balancing to create more signatures based on

conditions. The overall positivity is increased in the data set. The authors chose a factor of 6 to make

the dataset slightly better looking in terms of value share (72%:28%)

Fig 4: Balancing

Page 9: Caravan insurance data mining prediction models

Page 9 of 11

3.9 Constructing second level models with Training dataset The second level models were built with the balanced dataset. The attributes of the nodes were

maintained from the first level except for C5 node in which the boosting interval was changed to 5 as

the software did not have enough memory to run with value 15.

3.10 Running the second level Models with Test dataset The trained second level models were run with the test dataset and the results of the different

modelling techniques were compared with the Analysis and Evaluation node. Decision Tree model

came out with the highest accuracy of 90.48%. These results were maintained as the second level

benchmark. Screenshots provided below

Fig 5: Second Level Models Analysis Node Results

Page 10: Caravan insurance data mining prediction models

Page 10 of 11

Fig 6: Second Level Models Gain and Lift Charts

3.11 Constructing third level models by adding new interaction

variables The third level model building step is not the same as the second level in terms of data fields. The two

new interaction variables Derive 1and Derive 2 were created. No additional balancing was done.

3.12 Running the third level models with Test dataset The trained third level models were run with the test dataset and the results of the different modelling

techniques were compared with the Analysis and Evaluation node. Neural Network model gives the

best accuracy rate at 90.1%.

Fig 9: Third Level Models Analysis Node Results

Page 11: Caravan insurance data mining prediction models

Page 11 of 11

Fig 10: Third Level Models Gain and Lift Charts

3.13 Final Results Interpretation The below table compares the output of the Analysis node from all three levels. There is no marked

improvement in each level. It has been inferred that building the model after balancing the training

data set, doesn’t produce a better model.

In Level1 (base dataset): Highest accuracy is generated by both Decision Tree and Logistic

Regression

In Level2 (model build with balanced dataset): Highest accuracy is generated by Decision Tree

In Level 3(model build with balanced dataset and interaction variables): Highest accuracy is

generated by Neural Network

Technique Factor 1st level

2nd level

3rd level

Logistic Regression Test dataset Accuracy 94.00% 87.50% 87.50%

Decision Tree Test dataset Accuracy 94.00% 90.48% 89.75%

Neural Network Test dataset Accuracy 92.05% 90.12% 90.10%

Combined Agreement with CARAVAN 94.52% 95.11% 94.80% Table 1: Level Comparison