cequity analytics variable reduction

8/14/2019 CEQUITY Analytics Variable Reduction

1/3

V a r i a b l e r ed u c t i o n

Doesnt it become tricky resolving accurate relationships betweenvariables when there are thousands of them? More so, when each

can be used to create segmentat ion models

More often than not, some variables are decidedly correlated with

one another. Including these highly correlated variables in the

modeling process definitely increases the amount of t ime spent by

the statistician finding a segmentation model that meets businessneeds. In order to speed up the modeling process, the predictor

variables should be grouped into similar clusters. A few variables

can then be selected from each cluster - this way the analyst can

quickly reduce the number of variables and speed up the

modeling process.

Enhancing Analytical Modeling forLarge Data Sets

with


2/3

oday, technology helps store huge data at no or token

additional cost, as compared to earlier days. In todays

business we keep information in different tables in suitable

structure. For instance, it can have account data, transaction

data, customer demographic data, payment data, inbound-

outbound call data, campaign data, account history data etc. in a

financial business. For our analytical purpose we collate all the

information to create customers single view that may contain

huge number of variables. The challenge is to identify which few

of them we will use for our modeling purpose. In high

dimensional data sets, identifying irrelevant variables is more

difficult than identifying redundant variables. It is suggested that

first the redundant variables be removed and then the irrelevant

ones looked for. There are several ways of identifying irrelevant

variables. The technique of identifying less important variables

can differ on the basis of what specific analysis need to be done

using the final data. We will discuss what different steps should

be followed to reduce the variables.

Step One. Reduce variables on the basis ofmissing percentage

The variable with high missing information has very lesscontribution / predictive power in statistical model. Sometimes

the missing value needs to be imputed on the judgmental basis.

E.g. in case of amount purchase in last month missing can

indicate no purchase happened by the customer for last one

month. So missing field need to be replaced by zero. But there

are many cases where missing is actual missing. In this situation

it is not suggested to impute the value in case the % missing is

very high. We should remove the variables for which a high

proportion of missing observations are there.

Step Two. Variable reduction on the basis of

percentage of equal value

There might be fields with equal value for all the observation.

For this specific situation variable standard deviation will be

zero. We can remove these set of variables as it cannot have

any contribution on the model. There may be variables also for

which almost all (say > 98%) the records are with equal value.

We should not use these variables as they cannot contribute

much in the model. Calculating percentile with minimum and

maximum value of the variable will help identify such variables.

Step Three. Variable reduction among the

correlated variables

It is not desirable to use set of correlated predictor variables

either in cluster analysis or any types of regression analysis,

and forecasting technique. When we do subjective

segmentation using cluster technique we can identify

correlated variables by:

Fac tor ana lysis technique

Cluster of variab le tec hnique

Method of Correlation*

(*for Predictive Analysis)

Factor Analysis Technique

Let us consider we have N number of predictors. Do a factor

analysis for M factors where M is significantly less than N. For

a specific factor we will get load value for each and every

variable. The load factor will be high corresponding to those

variables which have a high influence on the specific factor.

The set of variables which are highly correlated with eachother will get high absolute load value. Select one variable for

the model with high load value out of those which have high

load value corresponding to this factor. Select the 2nd

variable for the model looking at the load value corresponding

to 2nd factor in the similar way. Continue this till the kth factor

to identify the variable corresponding to that factor. Number

of k (


3/3

Cluster of Variable Technique

In this technique it will split the variables in two groups on the

basis of correlation to each other in each step of variable

splitting. The variables within the same group have higher

correlation with each other as compared to between-group

correlation. We can impose a condition, whether a specific

subgroup needs to split farther or not on basis of cut-off point

of 2nd highest Eigen value of a subgroup. The cut-off point for

Eigen value is typically chosen as 0.8 to 0.5. Once the final

convergence happens you can select one or two variables

from each child group for the purpose of modeling.

Method of Correlation

In the method of prediction where we have one response

variable and other set of predictors this technique is very

much useful. Though we can first use any one of the abovetwo methods to reduce the number of predictors in stage one

and then use this method for farther reduction. Let use

consider we have response variable as Y and predictors are

X1, X2, , Xn. Calculate the correlation matrix for all

predictors including Y. Here we can impose a condition on

correlation value when we will take any one of two predictors

if it is higher than some specific value, say r. Now if rij,

correlation between Xi and Xj is greater than r we will keep

Xi if ryi > ryj, where ryi is correlation between Y and Xi. Inpractice we generally use r ranges within (0.75 to 0.9).

Note: If you feel that still you have many variables for model

and need to reduce prior to actual model you can do this on

the basis of VIF value of each predictors performing

regression of Y on predictors. Remove the variable which

has VIF higher than 2.5 and remove variable one by one.

Reach us at 105-106, 1st Floor, Anand Estate, 189-A, Sane Guruji Marg, Mahalaxmi, Mumbai-400 011, IndiaPhone: +91 22-43453800 Fax: +91 22-43453840

For more case studies, white papers and presentations log on to www.cequitysolutions.com

Or Write to [email protected]

For the latest thinking in Analyical Marketing, check out our blog at blog.cequitysolutions.com

Data mining methods simplify the extraction of

key insights from a huge database. They offer thepossibility of starting the analysis from any given

point in it. However, without proper methods and

techniques we may never be able to do so. Variable

reduction technique greatly helps both in handling

huge data and reducing the model development

time. And in the bargain, all of this is accomplished

without sacrificing the quality of the model.

Identifying the right technique becomes all the

more easier with a better understanding of the

data.

With techniques like these we, at Cequity, are able

to combine data & technology, and build

actionable analytical marketing services to

accelerate ROI-driven, real-time customer-

engaged marketing. Touch base with us to learn

more

Cequity Solutions Pvt. Ltd.

cequity analytics variable reduction

Documents