lecture 10 mark2039 summer 2006 george brown college wednesday 9-12

18
Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

Upload: annice-flowers

Post on 14-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

Lecture 10

MARK2039

Summer 2006

George Brown College

Wednesday 9-12

Page 2: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

2

Assignment 8: Geocoding example

• Example:– A retailer has the following information:

• Name and address of its customers• Address of its stores• Stats Can Information

– As a marketer, how would you intelligently use this information• Get Postal codes of customers and stores• Get geocodes(latitude and longitude numbers of each postal code)• Calculate distance between each customer and neares store• Create trading area around store to determine relevant customers for store• Identify best stores and calculate demographics of best stores vs. the

remaining stores• Use above learning to either promote non performing stores with similar

customer demographic makeup of best stores• Use above info to determine where to open up or perhaps close stores

Page 3: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

3

Assignment 8

• Why do we look at correlation analysis as our first statistical exercise in the data mining process

• Allows us to initially use statistics as a prescreen tool in eliminating variables from the data mining exercise

Page 4: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

4

Assignment 8

• Give me an example of a correlation table of 5 variables where two variables aresignificant and three variables are not significant. Provide correlation values that support your results

• Variable Corr. Coefficient Stat.Sign.Spend 0.43 99%

Age -0.35 95%Income 0.05 20%

Household Size 0.04 15%live in Quebec -0.03 10%

Spend Response Rate<100 1%

100-200 2%200-300 3%300-400 4%

400+ 5%

Age Response Rate<20 5%

20-40 4%40-50 3%50-60 2%60+ 1%

Income Response Rate<30M 1%

30M-40M 2%40M-50M 1%50M-60M 2%

60M+ 1%

Household Size Response Rate1 1%2 2%3 1%4 2%

4+ 1%

Live in Quebec Response RateYes 1.99%no 2.01%

Page 5: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

5

Recapping from last week

• Geocoding

– What are key things to think of.

• Look at answer from two slides ago.Geo coding gives us numbers to calculate distance between two postal codes

• More Material on correlation analysis

• How do EDA reports tie into the correlation analysis

– They are trend-like reports which demonstrate why a given variable has a strong relationship with the objective function.

• How should we present the final results of a model?

Model Variable Impact on Response Contribution to ModelLive in Quebec positive 85%

Income negative 7%Behaviour Score negative 5%# of promotions negative 3%

How is the above derived?From the partial R2 of each variable divided by the total R2 of the equation.

Page 6: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

6

Notion of Lift

• What is Lift: the performance of a group relative to the performance of the benchmark

• Examples:Type of ActivityUntargetted/ Benchmark

Targetted/ Challenger Lift

Acquisition Campaign Response Rate 1% 2% 200.Retention Campaign Churn Rate 15% 25% 166Credit Card Loss Rate 5% 8% 160Product Affinity Rate 10% 30% 300

The targetted group represents those names as determined by a data mining tool such as a predictive model.

Page 7: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

7

Notion of Lift

• Examples of cases where lift is below 100

Type of ActivityUntargetted/ Benchmark

Targetted/ Challenger Lift

Acquisition Campaign Response Rate 1% .5% 50Retention Campaign Churn Rate 15% 10% 66Credit Card Loss Rate 5% 2% 40Product Affinity Rate 10% 6% 60

Page 8: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

8

Validating the Model: Example of a Gains Chart

• Listed below are the hard numbers that might comprise a lift curve

Revenue per order is $60. Cost of 1 mail piece is $.855 Benefits of modelling are the foregone promotion costs by promoting

fewer names to achieve a given # of orders at a higher response rate.

% of List Validation Cum. Cum. % Cum. Interval Benefits(Ranked by Mail Resp. of all Lift ROIModel Quantity Rate RespScore)0-10% 20000 3.50% 23.33% 233 145% $2279910-20% 40000 3.00% 40% 200 75% $3420020-30% 60000 2.75% 55% 183 58% $4275030-40% 80000 2.50% 67% 167 23% $4560040-50% 100000 2.25% 75% 150 -12.2% $42750...90-100% 20,0000 1.50% 100% 100 -58% $0

How might this be plotted?-in class we saw this as a straight decreasing linear slope if we were plotting interval resp. rate against the deciles. If we plot the Cum % of responders, then the shape would be a parobola type curve with a larger parobola representing a better model. Meanwhile, a steeper slope if we plotted interval response rate against deciles would represent a stronger model.

Page 9: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

9

Validating the Model: Calculating the metrics on the gains charts.

• Cum. % of Responders in top 10%:– Total Responders: 200000 X 1.5%: 3000– # of responders in top 10%:20000X3.5%: 700– Cum. % in top 10%: 700/3000: 23%

• Cum. Lift in top 10%:– Average Response Rate: 1.5%– Cum. Response Rate in top 10%: 3.5%– Cum .Lift: 233

Page 10: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

10

Calculating the metrics on the gains charts.

• Interval ROI in 10%-20%– # of persons mailed: 20000– # of responders in 10%-20%(40%-23.33%)*3000: 500– Net revenue: (500*60)-.855*20000: 12900– Costs: 17100– ROI:(12900/17100): 75%

• Calculating Benefits Column at 30%:– Mailed costs to achieve 1650 responders without modelling:

• ((.0275*60000)/.015) * .855= 94050

– Mailed costs with modelling=60000*.855= 51300– Benefits: 94050-51300= $42750

Page 11: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

11

Gains Chart Examples

Assume a mail cost of $1.00 per piece and a revenue per order of $50.00.

Please fill in the blanks for the first 4 rows.

Cum. # of Names Mailed

Cum. Response Rate Interval Resp.Rate Interval Lift Benefits Interval ROI

10000 2.50%20000 2.25%30000 2.10%40000 1.80%

.

.

.

.100000 1%

2.5%

qty cum. Resp.rate interval resp. rate interval lift roi benefits10000 2.50% 2.50% 250 25.00% 1500020000 2.25% 2.00% 200 0.00% 2500030000 2.10% 1.80% 180 -10.00% 3300040000 1.80% 0.90% 90 -55.00% 32000

1

IntervalResp.Rate10,000*0.025=250=2.5%20,000*0.

2.5% 250 $15,000$25,000$33,000$32,000

2001.8% 180

0.9% 90

2.5%25%0-10%-55%

Page 12: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

12

Lift Curve with Zero Model Effectiveness

% of ListCum. Response

RateCum. % of

Responders0-20% 0.95% 19.50%

20%-40% 1.05% 41%40%-60% 1.02% 61%60%-80% 0.99% 79%80-100% 1.00% 100%

What does this look like if we plot it on a lift curve

A line rather than a parobola if we plot cum % of responders

Page 13: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

13

Gains Chart Examples

Cum. # of Names Mailed

Interval Response Rate-

Model 1Interval Response

Rate-Model 2

Interval Response Rate-

Model 3

Interval Response Rate-

Model 410000 2.50% 2% 1.75% 1.10%20000 1.75% 1.60% 1.60% 0.90%30000 1.45% 1.30% 1.55% 0.80%40000 1.25% 1.10% 1.10% 1.15%50000 1.00% 1% 0.90% 1.05%60000 0.90% 0.90% 1% 0.85%70000 0.75% 0.80% 0.50% 0.90%80000 0.55% 0.70% 0.60% 1.02%90000 0.35% 0.60% 0.45% 0.90%

100000 0.30% 0.55% 0.50% 1.03%Average 1% 1% 1% 1%

What is the best model?-Model 1

What is the worst model?-Model 4

What are the Model 3 results telling you. –we have some rank ordering all the way down to 70000 names and then the model flattens out-may need a strategy herefor this bottom segment.

Page 14: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

14

Gains Chart Examples

Cum. # of Names Mailed

Interval Response Rate-

Model 1Interval Response

Rate-Model 2

Interval Response Rate-

Model 3

Interval Response Rate-

Model 410000 2.50% 2% 1.75% 1.10%20000 1.75% 1.60% 1.60% 0.90%30000 1.45% 1.30% 1.55% 0.80%40000 1.25% 1.10% 1.10% 1.15%50000 1.00% 1% 0.90% 1.05%60000 0.90% 0.90% 1% 0.85%70000 0.75% 0.80% 0.50% 0.90%80000 0.55% 0.70% 0.60% 1.02%90000 0.35% 0.60% 0.45% 0.90%

100000 0.30% 0.55% 0.50% 1.03%Average 1% 1% 1% 1%

In each response model case, answer the following questions:•Where would you cutoff be with a budget of $80000 and a cost per piece of $2.00

•40000 names•Where would you cutoff be if you needed to attain a forecasted order qty of 350.

•Between 10000 and 20000 names-model 1 and 2, between 20000 and 30000 for model 3 and between 30000 and 40000 formodel 4

•Where would your optimum cutoff be presuming that budget nor forecasted order model quantities were constraints? 50000-model 1,2, and 60000 for model 3 –it does not matter for model 4

Page 15: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

15

Gains Chart Examples

Avg. Response Rate: 1.75%Cum Names Mailed Interval Response Rate

5000 4%10000 3.50%15000 3.10%20000 2.50%

…100000 0.40%

•Calculate the Following:Calculate the Following: -Interval Names Mailed -Interval Names Mailed -Cum. Response Rate -Cum. Response Rate

•Assuming a cost per name of $1.50 and revenue perAssuming a cost per name of $1.50 and revenue perresponder of $75, calculate the interval ROI forresponder of $75, calculate the interval ROI foreach intervaleach intervaland modelling benefits for each interval? and modelling benefits for each interval?

Page 16: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

16

Tracking of Models

• Two models are used in two campaigns. In campaign A, the overall response rate is 3.5% which is above the breakeven response rate of 2%. In campaign B, the overall response rate is 1.2% which is below the breakeven response rate of 2%. Yet, the model in campaign B is more effective. Explain Why?

3.50%

ResponseRate

1.20%

1 2 3 4 5 ……. 10Deciles

Model is rank ordering names quite well for campaign B(1.2% overall) while the better campaign overall(3.5%) exhibits no rank ordering of response rate between deciles.

Page 17: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

17

CHAID

• CHAID” is an acronym for Chi-square Automatic Interaction Detection

• Produces decision-tree like report

– Branches and Nodes

• Non parametric approach

– Output of routine is a segment or groupas opposed to a score

• Uses Chi-Square statistics to determine statistically significant breaks

• Conceptual Interpretation:(Observed-Expected)/Expected

Page 18: Lecture 10 MARK2039 Summer 2006 George Brown College Wednesday 9-12

18

CHAID

<2 years tenure50% of mail file

2% response rate

> 2 years tenure50% of mail file

6% response rate

Income<50K20% of mail file2.4% response

rate

Income>50K30% of mail file1.75% response

rate

<40 yrs. old20% of mail file4.9% response

rate

>40 yrs. old30% of mail file6.7% response

rate

Total100% of mail file4% response rate

What criteria determine the end nodes? What criteria determine the end nodes?