lecture 9 mark2039 summer 2006 george brown college wednesday 9-12

32
Lecture 9 MARK2039 Summer 2006 George Brown College Wednesday 9-12

Upload: brendan-snow

Post on 29-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Lecture 9

MARK2039

Summer 2006

George Brown College

Wednesday 9-12

2

Assignment 7

• An acquisition campaign with no targetting was conducted in January. The available information is as follows:

– Mail files containing name and address

– Responder files containing name and address

– 2001 Stats Can Census data available at the enumeration area

– A conversion table which maps enumeration areas to postal codes

• How would you use the above information to better target prospects to become new customers.

• Describe how the analytical file would be created– Answered in last class and is in worknotes

3

Types of Predictive Models-Assignment 7

• You have been asked to create programs that better target existing customers for insurance products. You have the following info:

Customer file Transaction FileAccount ID Account IDpostal code amountstart date date birth date type

incomehousehold sizebehaviour score

•What would you do and how would you What would you do and how would you create the analytical filecreate the analytical file

•In last lecture’s studynotesIn last lecture’s studynotes

4

Types of Predictive Models

• You have been asked to target customer that will not only purchase insurance but will also purchase the largest premiums

• What type of model would be built here?

• In last lecture’s studynotesIn last lecture’s studynotes –

5

• Geocoding is the process that assigns a latitude-longitude coordinate to an address. Once a latitude-longitude coordinate is assigned, the address can be displayed on a map or used in a spatial search.

• Data miners often use these coordinates to calculate such things as “distance to the nearest store”

Creating the Analytical File- Geo-Coding

6

Demographic Analysis

Population Population CountCount

Population Population CountCount

Age Age DistributionDistribution

Age Age DistributionDistribution Average AgeAverage AgeAverage AgeAverage Age

Store Store LocationLocation

Store Store LocationLocation

GeoGeoProfileProfile

7

Creating the Analytical File-What is Geocoding?

• Let’s look at a sample of what some data might look like?

Postal Code latitude LongitudeA1A5A2 5 10B5V1A2 7 20M6B2A2 10 30T4B1A2 6 40V4H2B5 11 50

How do we use this data to create meaningful variables?-use the latitude metric and longitude metrics and then use pythagorean theorem to calculate distance between the two postal codesEx: distance between A1A5A2 and B5V1A2=Distance=square root of abs.value[(7-5)**2+(20-10)**2]

=10.19degreesAbove number has to then be converted to kilometres or miles

8

Creating the Analytical File-What is Geocoding

• Example:– A retailer has the following information:

• Name and address of its customers

• Address of its stores

• Stats Can Information

– As a marketer, how would you intelligently use this information

9

Correlation Coefficient

Correlation coefficient is a measure of how much variation within the response variable is explained by the variation of a given variable.

Gender Response Household Size Response

Male 1 1 0Female 0 2 1Male 1 3 0Female 0 4 1Male 1 5 0Female 0 6+ 1

Gender vs. Response Household Size vs. Response

10

Correlation Coefficient

Correlation coefficient is a measure of how much variation within the response variable is explained by the variation of a given variable.

Gender Response Household Size Response

Male 1 1 0Female 0 2 1Male 1 3 0Female 0 4 1Male 1 5 0Female 0 6+ 1

Gender vs. Response Household Size vs. Response

11

Correlation Analysis

• The male gender variable has a perfect correlation of +1.

• The female gender variable has a perfect correlation of -1.

• Household size has no correlation with response, hence the correlation coefficient is 0.

12

Correlation Results

• Show the level of confidence which a given variable has with the modelled behaviour i.e. response

Example of Output:

Age Tenure # of Products Purchased

-0.0673 0.055 0.045

99.5% 98% 97%

Example of Output

# of Promotions Income Household SizeSince Last Purchase

-0.031 -0.0045 0.001

96% 50% 20%

Correlation coefficient

ConfidenceInterval

13

Examples-Correlation-Response Model• Listed below is an example of a correlation matrix

Variable Correlation Coeff. Stat.Sign.Income 0.50 99%

Age -0.45 95%Product Spend in last

12 months -0.48 97%Live in Quebec -0.01 15%

Tenure -0.47 96%# in household 0.02 20%

# of months since last promoted -0.55 99%

# of months since last purchase 0.02 20%

Pay with credit card 0.49 98%Gender is male -0.46 95%

Answer the following:•Is each variable relevant•What is the relationship or impact of each variable with response•What is the strongest variable and what is the weakest variable?

Income –relevant and positive imactAge-relevant and negative impactProduct Spend in last 12 months-relevant and negative Live in Quebec-not relevant and negativeTenure-relevant and negative# in household-not relevant and positive# of months since last promoted-relevant and negative# of months since last purchase-not relevant and positivePay with credit card-relevant and positiveGender is male-relevant and negative

strongest variable-# of months since last promoted

Weakest variable-live in Quebec

14

Exploratory Data Analysis Reports(EDA)

• After looking at the correlation reports, we also need to create EDA reports which help to better understand the relationship of a given variable with the desired marketing behaviour.

• It helps the business people and marketers to get inside the so-called black box of modelling.

15

Exploratory Data Analysis Reports(EDA)

Income # of observations Response Rate0-20000 10000 1%

20000-40000 10000 1.50%40000-60000 10000 2%60000-80000 10000 3%

80000+ 10000 4%Average 50000 2.30%

Age # of Observations Response Rate0-20 10000 3.50%20-40 10000 2.89%40-60 10000 2.25%60-70 10000 1.65%70+ 10000 1.25%

Average 2.30%

16

Exploratory Data Analysis Reports(EDA)

• Let’s take a look at example of a binary variable

On the next page are some examples of EDA reports of variables that are not statistically significant according to the correlation matrix.

Male # of Observations Response RateYes 50000 2.00%No 50000 2.60%

Average 100000 2.30%

17

Exploratory Data Analysis Reports(EDA)

• EDA’s of non-stat.sign. variables

# of months since last purchase # of observations Response Rate

0-2 months 10000 2.15%3-6 months 10000 2.10%6-12 months 10000 2.45%

12-18 months 10000 2.40%18+ months 10000 2.35%

Average 50000 2.30%

Live in Quebec # of observations Response Rateyes 10000 2.02%no 40000 2.38%

Average 50000 2.30%

18

More examples of correlation

• Previous analysis has indicated the following trends

Spending # of customers Response Rate0-100 1000 1%

100-200 1000 0.80%200-300 1000 1.20%300-400 1000 0.90%

400+ 1000 0.95%

Tenure # of customers Response Rate< 1 year 1000 3%1-2 yrs 1000 2.00%2-3 yrs 1000 1.00%3-4 yrs 1000 0.75%4yrs+ 1000 0.30%

•Would the Would the correlations be correlations be closer to 1,-1 , orcloser to 1,-1 , or0 here for both0 here for bothvariables?variables?

Closer to 0 here

Closer to –1 here

19

More examples of correlation

Spending # of customers Credit Risk0-100 1000 1%

100-200 1000 2.00%200-300 1000 3.00%300-400 1000 4.00%

400+ 1000 5.00%

Tenure # of customers Credit Risk< 1 year 1000 3%1-2 yrs 1000 2.50%2-3 yrs 1000 3.30%3-4 yrs 1000 2.70%

•Would the Would the correlations be correlations be closer to 1,-1 , orcloser to 1,-1 , or0 here for both0 here for bothvariables?variables?

•What is the learning here vs. the previousWhat is the learning here vs. the previousslide slide

Closer to +1 here

Closer to –1 here

20

Exploratory Data Analysis Reports

• Exploratory Data Analysis Reports:

Age% of

CustomersResponse

RateResponse

Indexunder 25 25% 6% 171

25-35 25% 4.50% 12835-50 25% 2.50% 7150+ 25% 1% 28

Average 100% 3.50% 100

What does this tell us?

Younger are more likely to

respond

Household Size% of

customersResponse

RateResponse

Index1 person 25% 4% 1142 persons 25% 3% 863 persons 25% 4% 114

4+ persons 25% 3% 86Average 100% 3.50% 100

What does this tell us?

No trend exists here

21

Exploratory Data Analysis Reports

Income% of

customersResponse

Rate Income > 40under20K 16% 1.50%

20-30K 16% 2.50% 030-40K 16% 2.00%40-55K 16% 6%55-80K 16% 5% 180K+ 16% 4%

Average 100% 3.50%

What does this mean?

# of Months % of Response Response MonthsSince Last Customers Rate Index Since LastPromotion Promotion

1 16% 2.50% 0.71

2 16% 1.50% 0.43

3 16% 3.75% 1.07

4 16% 3.25% 0.93

5 16% 6.00% 1.71

6 16% 4.00% 1.14

Average 100% 3.50% 1.00

0.620.62

1.001.00

1.431.43

What does this mean?

Clearly, there is more of a binary rather than linear relationship here. Would create binary variable

on income >=40K

Not quite binary but not perfect inlinear sense, would create indexvariables here

22

Creating the Final Model

• Why couldn’t we just use results of correlation to create model and create index values for each sign .variable.– Age– Tenure– # of products purchased– # of promotions since last purchase

Think Statistics here?Think Statistics here?

Independent or predictor variables have interaction here known as multicollinearity and this interaction must be accounted for when building any model. The interaction between model variables(independent variables) willhave an impact on the actual variable weight or coefficient within any model equation that is parametric(ie. there are weights or coefficients associated with each parameter)

23

Creating the Final Model

• Need to account for interaction here

Age EducationLive in QuebecSpendAge 1 0.3 -0.3 -0.7

Education 0.3 1 -0.6 0.9Live in Quebec -0.3 -0.6 1 0.2

Spend -0.7 0.9 0.2 1

Correlation Coefficient with

Response Confidence

IntervalAge 0.4 99%

Education 0.4 99%Quebec 0.3 97%Spend 0.2 95%

Let’s take a look at some equations

24

The Data Mining Process : Application of Data Mining Techniques-Creating the Final Model

Problems with Multicollinearity• Example: Years of Education and Income on Response Rate

• Regression Equation is:Response= .50+.00001*income -.03*yrs. of education

Years of Income

EducationCorrelation Coefficient 0.11

0.12Confidence Interval 99%

99.50%

Response

What is the problem here and what do you do?

Income and education are highly correlated causing education to flip its sign within the modelequation. I would either replace it with some other variable that does not reduce the model power too much or I would create an interaction variable between the two(i.e. Age X Income)

Problems with Multicollinearity• Example: Years of Education and Income on Response Rate

• Regression Equation is:Response= .50+.00001*income -.03*yrs. of education

25

Continuing to build the model

• Multivariate analytical techniques such as multiple regression,logistic regression,etc. may be employed to produce the final model

• Final equation:Predicted Response Rate:=A –B1*Age +B2*tenure

• Corr. Coeff. is +.5 for age and +.55 for tenure

• What is the problem here?• Age has flipped

• What other diagnostics would you undertake to better understand the situation?• Examine correlation coefficient between these two variable and compare this

result to other independent variable correlations. The magnitude of the age and tenure correlation should be much greater than other independent variable correlations

26

Continuing to build the model

Variable CorrelationSpend 0.6

Live in Ontario 0.5Number in House -0.3

Response= A(+.05 X spend)

(-.03 X Live in Ontario)(-.01 X Number in House)

Variable Correlation# of products 0.6Credit Score 0.4

Tenure -0.2Response= A

(-.03*number of products)(+.08 X Credit Score)

(-.01 X tenure)

27

Continuing to build the model

• After observing correlation results and EDA’s what can we begin to do at this point.– Derive new variables-EDA’s– Derive new variables-multicollinearity– Derive new variables-Factor Analysis– Derive new variables-CHAID(will explore later)

Reference Material: Factor Analysis-look up in any Statistics Handbook

Regression-look up in textbook under Regression and Statistics Regression.

28

Continuing to build the model

• Running further statistical routines, we are able to develop a final model. The marketer or business person should receive a report that looks as follows:

Model Variable Impact on Response % contribution to ModelIncome Positive 45%Tenure Negative 25%

Product Spend positive 20%Gender is Male negative 10%

For those of you that have statistics training, how is the % For those of you that have statistics training, how is the % Contribution to model calculated derived?Contribution to model calculated derived?Looks at the partial R2 of each variable and calculates as follows: % contribution= partial R2/ total R2Looks at the partial R2 of each variable and calculates as follows: % contribution= partial R2/ total R2

29

Continuing to Build the Model

ParameterEstimate

Intercept 0.04331 69.06 <.0001var 1 0.01321 18.16 <.0001var 2 -0.00586 39.8 <.0001var 3 -0.01181 12.07 0.0005var 4 0.01584 13.97 0.0002var 5 -0.01496 13.93 0.0002var 6 -0.03684 4.31 0.038

Variable F Value Pr > F

Variable Partial ModelEntered R-Square R-Square

var 4 0.0036 0.0036var 3 0.0034 0.007var 1 0.0016 0.0086var 2 0.0007 0.0092var 6 0.0009 0.0102var 5 0.0003 0.0105

30

Continuing to Build the Model

Variable Impact Strength

var 4 + 34.29%

var 3 - 32.38%

var 1 + 15.24%

var 2 - 6.67%

var 6 - 8.57%

var 5 - 2.86%

What would be the final equation in terms of the sign?What would be the final equation in terms of the sign?

The equation should have the same signs as seen above from the impact column

31

Continuing to build the model

•What would you do hereWhat would you do here

Model Variable Impact on Response Contribution to ModelLive in Quebec positive 85%

Income negative 7%Behaviour Score negative 5%# of promotions negative 3%

I would conduct upfront segmentation as live in Quebec is overwhelmingly strong and essentiallyindicates that we have a one variable model. Create two segments-live in Quebec and Rest of Canadaand perhaps develop models to each of these segments

32

Continuing to build the model

Model Variable Impact on Response % contribution to ModelIncome Positive 45%Tenure Negative 25%

Product Spend positive 20%Gender is Male negative 10%

•Suppose we have the following equation:Suppose we have the following equation:

Response= Response= +.09+.09

+.05 X Income+.05 X Income

+.06 X Tenure+.06 X Tenure

+.08 X Product Spend+.08 X Product Spend

-.04 X Male -.04 X Male •What is the problem here?What is the problem here?

•Problem tenure-sign is inconsistent between report and actual equation-doublecheckProblem tenure-sign is inconsistent between report and actual equation-doublecheckactual equation and coefficient signsactual equation and coefficient signs