cequity analytics variable reduction

Upload: cequity-solutions

Post on 30-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 CEQUITY Analytics Variable Reduction

    1/3

    V a r i a b l e r ed u c t i o n

    Doesnt it become tricky resolving accurate relationships betweenvariables when there are thousands of them? More so, when each

    can be used to create segmentat ion models

    More often than not, some variables are decidedly correlated with

    one another. Including these highly correlated variables in the

    modeling process definitely increases the amount of t ime spent by

    the statistician finding a segmentation model that meets businessneeds. In order to speed up the modeling process, the predictor

    variables should be grouped into similar clusters. A few variables

    can then be selected from each cluster - this way the analyst can

    quickly reduce the number of variables and speed up the

    modeling process.

    Enhancing Analytical Modeling forLarge Data Sets

    with

  • 8/14/2019 CEQUITY Analytics Variable Reduction

    2/3

    oday, technology helps store huge data at no or token

    additional cost, as compared to earlier days. In todays

    business we keep information in different tables in suitable

    structure. For instance, it can have account data, transaction

    data, customer demographic data, payment data, inbound-

    outbound call data, campaign data, account history data etc. in a

    financial business. For our analytical purpose we collate all the

    information to create customers single view that may contain

    huge number of variables. The challenge is to identify which few

    of them we will use for our modeling purpose. In high

    dimensional data sets, identifying irrelevant variables is more

    difficult than identifying redundant variables. It is suggested that

    first the redundant variables be removed and then the irrelevant

    ones looked for. There are several ways of identifying irrelevant

    variables. The technique of identifying less important variables

    can differ on the basis of what specific analysis need to be done

    using the final data. We will discuss what different steps should

    be followed to reduce the variables.

    Step One. Reduce variables on the basis ofmissing percentage

    The variable with high missing information has very lesscontribution / predictive power in statistical model. Sometimes

    the missing value needs to be imputed on the judgmental basis.

    E.g. in case of amount purchase in last month missing can

    indicate no purchase happened by the customer for last one

    month. So missing field need to be replaced by zero. But there

    are many cases where missing is actual missing. In this situation

    it is not suggested to impute the value in case the % missing is

    very high. We should remove the variables for which a high

    proportion of missing observations are there.

    Step Two. Variable reduction on the basis of

    percentage of equal value

    There might be fields with equal value for all the observation.

    For this specific situation variable standard deviation will be

    zero. We can remove these set of variables as it cannot have

    any contribution on the model. There may be variables also for

    which almost all (say > 98%) the records are with equal value.

    We should not use these variables as they cannot contribute

    much in the model. Calculating percentile with minimum and

    maximum value of the variable will help identify such variables.

    Step Three. Variable reduction among the

    correlated variables

    It is not desirable to use set of correlated predictor variables

    either in cluster analysis or any types of regression analysis,

    and forecasting technique. When we do subjective

    segmentation using cluster technique we can identify

    correlated variables by:

    Fac tor ana lysis technique

    Cluster of variab le tec hnique

    Method of Correlation*

    (*for Predictive Analysis)

    Factor Analysis Technique

    Let us consider we have N number of predictors. Do a factor

    analysis for M factors where M is significantly less than N. For

    a specific factor we will get load value for each and every

    variable. The load factor will be high corresponding to those

    variables which have a high influence on the specific factor.

    The set of variables which are highly correlated with eachother will get high absolute load value. Select one variable for

    the model with high load value out of those which have high

    load value corresponding to this factor. Select the 2nd

    variable for the model looking at the load value corresponding

    to 2nd factor in the similar way. Continue this till the kth factor

    to identify the variable corresponding to that factor. Number

    of k (

  • 8/14/2019 CEQUITY Analytics Variable Reduction

    3/3

    Cluster of Variable Technique

    In this technique it will split the variables in two groups on the

    basis of correlation to each other in each step of variable

    splitting. The variables within the same group have higher

    correlation with each other as compared to between-group

    correlation. We can impose a condition, whether a specific

    subgroup needs to split farther or not on basis of cut-off point

    of 2nd highest Eigen value of a subgroup. The cut-off point for

    Eigen value is typically chosen as 0.8 to 0.5. Once the final

    convergence happens you can select one or two variables

    from each child group for the purpose of modeling.

    Method of Correlation

    In the method of prediction where we have one response

    variable and other set of predictors this technique is very

    much useful. Though we can first use any one of the abovetwo methods to reduce the number of predictors in stage one

    and then use this method for farther reduction. Let use

    consider we have response variable as Y and predictors are

    X1, X2, , Xn. Calculate the correlation matrix for all

    predictors including Y. Here we can impose a condition on

    correlation value when we will take any one of two predictors

    if it is higher than some specific value, say r. Now if rij,

    correlation between Xi and Xj is greater than r we will keep

    Xi if ryi > ryj, where ryi is correlation between Y and Xi. Inpractice we generally use r ranges within (0.75 to 0.9).

    Note: If you feel that still you have many variables for model

    and need to reduce prior to actual model you can do this on

    the basis of VIF value of each predictors performing

    regression of Y on predictors. Remove the variable which

    has VIF higher than 2.5 and remove variable one by one.

    Reach us at 105-106, 1st Floor, Anand Estate, 189-A, Sane Guruji Marg, Mahalaxmi, Mumbai-400 011, IndiaPhone: +91 22-43453800 Fax: +91 22-43453840

    For more case studies, white papers and presentations log on to www.cequitysolutions.com

    Or Write to [email protected]

    For the latest thinking in Analyical Marketing, check out our blog at blog.cequitysolutions.com

    Data mining methods simplify the extraction of

    key insights from a huge database. They offer thepossibility of starting the analysis from any given

    point in it. However, without proper methods and

    techniques we may never be able to do so. Variable

    reduction technique greatly helps both in handling

    huge data and reducing the model development

    time. And in the bargain, all of this is accomplished

    without sacrificing the quality of the model.

    Identifying the right technique becomes all the

    more easier with a better understanding of the

    data.

    With techniques like these we, at Cequity, are able

    to combine data & technology, and build

    actionable analytical marketing services to

    accelerate ROI-driven, real-time customer-

    engaged marketing. Touch base with us to learn

    more

    Cequity Solutions Pvt. Ltd.