competition16

Competition ‘16

Machine Learning Project

Data Mining Stages

Objective

• To predict the policy number and the price quoted for that policy, customer is more likely to purchase.

• The data provided is the historical data from an Insurance company which provides session history as well as purchased history of its customers.

Datasets

• Train.csv

• Train_Short.csv

Data Understanding

• Class imbalance for Policy 4 with

respect to the other classes is

the major problem with dataset.

• The dataset heavily features Policy

1 and Policy 3.

• The heavy imbalance shows the

massive difference between maximum

value (Policy 3 with 25294 records)

and minimum value

(Policy 4 with 3925 records).

Approach

• Analyzed the shopping patterns of the customer by looking at Train.csv dataset.

• Duplicates and outliers(Calculated the standard deviation for each data point and attribute. Excluded the data points which are out of standard deviation.) were removed.

• Data was normalized using python.

• Problem statement consists of 2 parts:

– predicting the policy (Classification)

– predicting the cost of the policy (Regression)

• 2 different models were trained and tested using two different algorithms in Microsoft Azure Machine Learning Suite.

Data Preparation

Findings– Observances like unique customer_id (67,663 in total) has atleast 3 unique

shopping_pt i.e 1,2,3. The pattern is extracted from the Train.csv file.

– Combination of this information aided with the session history of every customer up to three shopping points and removal of anomalies like duplication, un uniformity is done.

Data Normalization– For the high range attributes such as location, data normalization is done

for better results.

– We used normalize_features(feature_set) function in python for normalization.

Feature Selection– Considered the Pearson Correlation, the “Filter-Based Feature Selection”

module was employed in Azure to cut down irrelevant features.

Feature Selection

• Considering the Pearson Correlation(‘r’ value) which is a indication of the

strength of correlation between any two features.• Projected top 14 features to train and test the models. Features not used

are: “record_type”, “homeowner”, “group_size”, “married_couple” and “C_previous” which has the lowest pearson correlation value.

• Using different combinations we tweaked the features and trained our model but after evaluating the results the Pearson correlation helped improve the performance.

Synthetic Minority Over-Sampling Technique (SMOTE)

• Smote is a technique employed to over sample the minority class in our multi-class Classification problem.

• Through this the immense gap in between the values of the four policy classes in comparison to other classes was reduced.

• Smote is a common data manipulation technique for increasing the number of cases to create a more balanced dataset.

• Since the instances of policy number 4 is almost seven times less than the instances of policy 3, we increased the SMOTE sampling to 300% which increased the accuracy of the classification model by 15%.

Policy Prediction(Classification)

Building Model• Implemented 2 different algorithms on our training set and after

evaluating their performance, Multiclass Decision Forest turned out the better results among them both.

• The Decision Forest produced better performance and worked better towards resolving the class imbalance in the data.

• Tune Model Hyperparameter” helped evaluating the performance of our model for different combinations of parameter values.

• Through this we were able to conclude that our model works best when the Decision trees are low in number but high in depth.

Classification (Multi-Class Decision Forest Model)

Parameter Values

Performance Metrics

Cost Prediction (Regression)

• Model-1– Used Boosted Decision Tree Regression module to create an ensemble of

regression trees using boosting

– The term Boosting implies that every tree is dependent upon its preceding tree and learns by fitting the residual of the trees that preceded it.

• Model-2– Used Neural Network Regression to create a regression model which is a

customizable neural network algorithm.


Building Model– The “Root Mean Squared Error” for Neural Network Regression came out

to be 36.85 while for Boosted Decision Tree Regression it was 30 which clearly shows Boosted Decision Tree Regression is working better for our dataset.

– With the help of “Tune Model Hyperparameter” , the “Coefficient of Determination” was achieved close to 0.50 and the “Root Mean Squared error” close to 23.46(approx.)

– We figured out the best parameters value of Boosted Decision Tree Regression should have Maximum number of Leaf Nodes to be 20 and Maximum number of trees to be 20 with learning rate 0.2.

Algorithm Properties

Performance Metrics

THANK YOU

competition16

Documents