cequity analytics variable reduction
TRANSCRIPT
-
8/14/2019 CEQUITY Analytics Variable Reduction
1/3
V a r i a b l e r ed u c t i o n
Doesnt it become tricky resolving accurate relationships betweenvariables when there are thousands of them? More so, when each
can be used to create segmentat ion models
More often than not, some variables are decidedly correlated with
one another. Including these highly correlated variables in the
modeling process definitely increases the amount of t ime spent by
the statistician finding a segmentation model that meets businessneeds. In order to speed up the modeling process, the predictor
variables should be grouped into similar clusters. A few variables
can then be selected from each cluster - this way the analyst can
quickly reduce the number of variables and speed up the
modeling process.
Enhancing Analytical Modeling forLarge Data Sets
with
-
8/14/2019 CEQUITY Analytics Variable Reduction
2/3
oday, technology helps store huge data at no or token
additional cost, as compared to earlier days. In todays
business we keep information in different tables in suitable
structure. For instance, it can have account data, transaction
data, customer demographic data, payment data, inbound-
outbound call data, campaign data, account history data etc. in a
financial business. For our analytical purpose we collate all the
information to create customers single view that may contain
huge number of variables. The challenge is to identify which few
of them we will use for our modeling purpose. In high
dimensional data sets, identifying irrelevant variables is more
difficult than identifying redundant variables. It is suggested that
first the redundant variables be removed and then the irrelevant
ones looked for. There are several ways of identifying irrelevant
variables. The technique of identifying less important variables
can differ on the basis of what specific analysis need to be done
using the final data. We will discuss what different steps should
be followed to reduce the variables.
Step One. Reduce variables on the basis ofmissing percentage
The variable with high missing information has very lesscontribution / predictive power in statistical model. Sometimes
the missing value needs to be imputed on the judgmental basis.
E.g. in case of amount purchase in last month missing can
indicate no purchase happened by the customer for last one
month. So missing field need to be replaced by zero. But there
are many cases where missing is actual missing. In this situation
it is not suggested to impute the value in case the % missing is
very high. We should remove the variables for which a high
proportion of missing observations are there.
Step Two. Variable reduction on the basis of
percentage of equal value
There might be fields with equal value for all the observation.
For this specific situation variable standard deviation will be
zero. We can remove these set of variables as it cannot have
any contribution on the model. There may be variables also for
which almost all (say > 98%) the records are with equal value.
We should not use these variables as they cannot contribute
much in the model. Calculating percentile with minimum and
maximum value of the variable will help identify such variables.
Step Three. Variable reduction among the
correlated variables
It is not desirable to use set of correlated predictor variables
either in cluster analysis or any types of regression analysis,
and forecasting technique. When we do subjective
segmentation using cluster technique we can identify
correlated variables by:
Fac tor ana lysis technique
Cluster of variab le tec hnique
Method of Correlation*
(*for Predictive Analysis)
Factor Analysis Technique
Let us consider we have N number of predictors. Do a factor
analysis for M factors where M is significantly less than N. For
a specific factor we will get load value for each and every
variable. The load factor will be high corresponding to those
variables which have a high influence on the specific factor.
The set of variables which are highly correlated with eachother will get high absolute load value. Select one variable for
the model with high load value out of those which have high
load value corresponding to this factor. Select the 2nd
variable for the model looking at the load value corresponding
to 2nd factor in the similar way. Continue this till the kth factor
to identify the variable corresponding to that factor. Number
of k (
-
8/14/2019 CEQUITY Analytics Variable Reduction
3/3
Cluster of Variable Technique
In this technique it will split the variables in two groups on the
basis of correlation to each other in each step of variable
splitting. The variables within the same group have higher
correlation with each other as compared to between-group
correlation. We can impose a condition, whether a specific
subgroup needs to split farther or not on basis of cut-off point
of 2nd highest Eigen value of a subgroup. The cut-off point for
Eigen value is typically chosen as 0.8 to 0.5. Once the final
convergence happens you can select one or two variables
from each child group for the purpose of modeling.
Method of Correlation
In the method of prediction where we have one response
variable and other set of predictors this technique is very
much useful. Though we can first use any one of the abovetwo methods to reduce the number of predictors in stage one
and then use this method for farther reduction. Let use
consider we have response variable as Y and predictors are
X1, X2, , Xn. Calculate the correlation matrix for all
predictors including Y. Here we can impose a condition on
correlation value when we will take any one of two predictors
if it is higher than some specific value, say r. Now if rij,
correlation between Xi and Xj is greater than r we will keep
Xi if ryi > ryj, where ryi is correlation between Y and Xi. Inpractice we generally use r ranges within (0.75 to 0.9).
Note: If you feel that still you have many variables for model
and need to reduce prior to actual model you can do this on
the basis of VIF value of each predictors performing
regression of Y on predictors. Remove the variable which
has VIF higher than 2.5 and remove variable one by one.
Reach us at 105-106, 1st Floor, Anand Estate, 189-A, Sane Guruji Marg, Mahalaxmi, Mumbai-400 011, IndiaPhone: +91 22-43453800 Fax: +91 22-43453840
For more case studies, white papers and presentations log on to www.cequitysolutions.com
Or Write to [email protected]
For the latest thinking in Analyical Marketing, check out our blog at blog.cequitysolutions.com
Data mining methods simplify the extraction of
key insights from a huge database. They offer thepossibility of starting the analysis from any given
point in it. However, without proper methods and
techniques we may never be able to do so. Variable
reduction technique greatly helps both in handling
huge data and reducing the model development
time. And in the bargain, all of this is accomplished
without sacrificing the quality of the model.
Identifying the right technique becomes all the
more easier with a better understanding of the
data.
With techniques like these we, at Cequity, are able
to combine data & technology, and build
actionable analytical marketing services to
accelerate ROI-driven, real-time customer-
engaged marketing. Touch base with us to learn
more
Cequity Solutions Pvt. Ltd.