caravan insurance data mining prediction models
DESCRIPTION
DATA MINING EXAMPLETRANSCRIPT
Page 1 of 11
Caravan Insurance Data
Mining Assignment
K6225 Knowledge Discovery and Data Mining
By,
Sesagiri Raamkumar Aravind(G1101761F)
Thangavelu Muthu Kumaar(G1101765E)
Page 2 of 11
Table of Contents
1.0 Objective ........................................................................................................................................... 3
2.0 Summary of Final Results .................................................................................................................. 3
3.0 Exercise Lifecycle............................................................................................................................... 3
3.1 Understanding the objective of the exercise and its expectations .............................................. 4
3.2 Understanding the data dictionary of the data set ...................................................................... 4
3.3 Assigning appropriate measure values (Set/Range) for data fields .............................................. 4
3.4 Constructing first level models with Training dataset .................................................................. 4
3.4.1 Logistic Regression ................................................................................................................. 4
3.4.2 Decision Trees ........................................................................................................................ 5
3.11.3 Neural Networks .................................................................................................................. 5
3.5 Running the first level Models with Test data .............................................................................. 6
3.6 Performing bivariate analysis on training dataset ........................................................................ 7
3.7 Creating interaction variables based on results of Step 5 ............................................................ 7
3.8 Balancing the training data ........................................................................................................... 8
3.9 Constructing second level models with Training dataset ............................................................. 9
3.10 Running the second level Models with Test dataset .................................................................. 9
3.11 Constructing third level models by adding new interaction variables ..................................... 10
3.12 Running the third level models with Test dataset .................................................................... 10
3.13 Final Results Interpretation ...................................................................................................... 11
Page 3 of 11
1.0 Objective
The objective of this data mining exercise is to find the best possible model to predict whether
customer signature will opt for caravan insurance (mobile home policy) or not. The techniques used
are logistic regression, decision tree and neural network.
2.0 Summary of Final Results
The model built using Logistic Regression and Decision Tree came out with the highest accuracy on
comparison with the models built using Neural Network. The best model had an accuracy of 94%.
The most interesting part of the exercise is that base model (as provided originally) without any
interaction variables and balancing, gave the best results. It has been expectedly observed that most
models had higher accuracy with training data set but the accuracy rate reduced when run with test
dataset .Cross-validation techniques such as 10-step validation was not done in this exercise which
could have delineated the results even more.
3.0 Exercise Lifecycle
The lifecycle of the complete data mining exercise comprises of the following steps:-
1. Understanding the objective of the exercise and its expectations
2. Understanding the data dictionary of the data set
3. Assigning appropriate measure values (Set/Range) for data fields
4. Constructing first level Models with Training dataset
5. Running the first level Models with Test dataset
6. Performing bivariate analysis
7. Balancing the training data
8. Constructing second level Models with Training dataset
9. Dataset modification of Training dataset
10. Running the second level Models with Test dataset
11. Creating interaction variables based on results of Step 6
12. Constructing third level Models with Training dataset
13. Running the third level Models with Test dataset
14. Final Results Interpretation
Page 4 of 11
3.1 Understanding the objective of the exercise and its
expectations The first and foremost step in a data mining exercise is to understand the business objective. The
business wants to use their existing customer signatures to build a predictive model for predicting the
number of mobile home policies. The model construction and its inference will be a precursor for a
potential marketing campaign to target specific customer groups. The data mining techniques that are
in the scope of this exercise are logistic regression, decision trees and neural networks.
3.2 Understanding the data dictionary of the data set The data dictionary consists of 86 variables with an equal mix of socio-demographic and product
ownership data. There are few ordinal variables that need to be changed to numeric variables for build
efficiency. The socio-demographic variables are captured at zip-code level.
3.3 Assigning appropriate measure values (Set/Range) for data
fields The measure of the below variables were manually changed to ‘Range’ in Clementine, apart from the
automatically assigned measures:-
MAANTHUI Number of houses
MGEMOMV Avg size household
MGEMLEEF Avg age
MGODRK Roman catholic
PWAPART Contribution private third party insurance
There is an academic insight that socio-demographic variables are to be converted to ‘Range’
variables so that it would be convenient to plot the values in logistic regression graph curve. The
authors retained the variables as ‘Set’ variables initially to test the postulation at a later stage.
3.4 Constructing first level models with Training dataset The authors made a plan of arriving at the best model by using a three level approach. The models
built in first level will be crude models constructed on the data set directly without any new
interaction variables or data balancing. These models will be the first benchmark to gauge subsequent
improvements. Models were built using the Logistic, C5.0 and Neural Net nodes.
3.4.1 Logistic Regression No changes were done for the Logistic Regression as all attributes were seemingly optimal.
Page 5 of 11
3.4.2 Decision Trees The changes done for C5 node under the Export mode are
Pruning Severity was set to 5
‘Minimum records per child branch’ was changed to 5 from 2 as it was found to be optimal number.
Value 1 impaired the results and the same could be said for values greater than 2
‘Use Boosting’ option was enabled so that more classifiers are created. The value was set to15 for
first level and changed to 5 for second and third level.
Fig 7: C5 Model Attributes
3.4.3 Neural Networks For the Neural networks, the RBFN method was selected first but the model did not produce better
results. The final method selected was ‘Quick’. The number of hidden layers was set as 3 so that more
transformations can take place. The learning rates were initially increased marginally to check for
performance improvements assuming that the results are converging towards the globally consistent
depression in the learning curve of the networks. But as marginal increase of alpha learning rate didn’t
get produce significant results, it was increased dramatically to 0.9 for overcoming the possibly
assumed local depression. The final values are available in the screenshot below.
Page 6 of 11
Fig 8: Neural Network Attributes
3.5 Running the first level Models with Test data The trained first level models were run with the test dataset and the results of the different modelling
techniques were compared with the Analysis and Evaluation node. Logistic Regression and Decision
Tree both had the best accuracy rate of 94%. The Nagelkerke Rsquare value with training data set was
16.7%. These results will be maintained as the first level benchmark. Screenshots provided below
Fig1: First Level Models Analysis Node Results
Page 7 of 11
Fig 2: First Level Gain and Lift Chart
3.6 Performing bivariate analysis on training dataset This step marks the start of the second level model building process. Bivariate analysis in Clementine
can be done using the Web node that represents the relationships between the values of variables
using thick and thin lines. The authors performed the analysis using both the normal web and directed
web option in the web node. The directed web had the target as Caravan variable and all the other
variables were put in dependant section. This analysis wasn’t helpful as the relationships were present
among different values in independent variables and CARAVAN therefore no significant inferences
were made. However, the normal web analysis indicated strong relationships between the customer
type and customer subtype, a potential candidate for interaction variable.
3.7 Creating interaction variables based on results of Step 5 The indication from last step was implemented in this step by creating two interaction variables. The
first interaction variable Derive1(aka customer lifestyle reflector) contains the parent variables
Customer Type and Subtype. The second interaction variable Derive3(aka Combined Age-Income
Factor ) contains the parent variables Avg age and Avg Income. This variable was created based on
the author’s intuition that it would help build a better model. Screenshots provided below for
reference
Page 8 of 11
Fig 3: Derived Variables
3.8 Balancing the training data It has been noticed that the training dataset is not highly representative of positive cases
i.e.CARAVAN=1. Therefore, models constructed using this data set may not be the best predictor for
positive cases. Clementine provides a feature called as Balancing to create more signatures based on
conditions. The overall positivity is increased in the data set. The authors chose a factor of 6 to make
the dataset slightly better looking in terms of value share (72%:28%)
Fig 4: Balancing
Page 9 of 11
3.9 Constructing second level models with Training dataset The second level models were built with the balanced dataset. The attributes of the nodes were
maintained from the first level except for C5 node in which the boosting interval was changed to 5 as
the software did not have enough memory to run with value 15.
3.10 Running the second level Models with Test dataset The trained second level models were run with the test dataset and the results of the different
modelling techniques were compared with the Analysis and Evaluation node. Decision Tree model
came out with the highest accuracy of 90.48%. These results were maintained as the second level
benchmark. Screenshots provided below
Fig 5: Second Level Models Analysis Node Results
Page 10 of 11
Fig 6: Second Level Models Gain and Lift Charts
3.11 Constructing third level models by adding new interaction
variables The third level model building step is not the same as the second level in terms of data fields. The two
new interaction variables Derive 1and Derive 2 were created. No additional balancing was done.
3.12 Running the third level models with Test dataset The trained third level models were run with the test dataset and the results of the different modelling
techniques were compared with the Analysis and Evaluation node. Neural Network model gives the
best accuracy rate at 90.1%.
Fig 9: Third Level Models Analysis Node Results
Page 11 of 11
Fig 10: Third Level Models Gain and Lift Charts
3.13 Final Results Interpretation The below table compares the output of the Analysis node from all three levels. There is no marked
improvement in each level. It has been inferred that building the model after balancing the training
data set, doesn’t produce a better model.
In Level1 (base dataset): Highest accuracy is generated by both Decision Tree and Logistic
Regression
In Level2 (model build with balanced dataset): Highest accuracy is generated by Decision Tree
In Level 3(model build with balanced dataset and interaction variables): Highest accuracy is
generated by Neural Network
Technique Factor 1st level
2nd level
3rd level
Logistic Regression Test dataset Accuracy 94.00% 87.50% 87.50%
Decision Tree Test dataset Accuracy 94.00% 90.48% 89.75%
Neural Network Test dataset Accuracy 92.05% 90.12% 90.10%
Combined Agreement with CARAVAN 94.52% 95.11% 94.80% Table 1: Level Comparison