targeted marketing, kdd cup and customer modeling
Post on 19-Dec-2015
220 views
TRANSCRIPT
![Page 2: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/2.jpg)
22
Outline
Direct Marketing
Review: Evaluation: Lift, Gains
KDD Cup 1997
Lift and Benefit estimation
Privacy and Data Mining
![Page 3: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/3.jpg)
33
Direct Marketing Paradigm
Find most likely prospects to contact
Not everybody needs to be contacted
Number of targets is usually much smaller than number of prospects
Typical Applications retailers, catalogues, direct mail (and e-mail)
customer acquisition, cross-sell, attrition
...
![Page 4: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/4.jpg)
44
Direct Marketing Evaluation
Accuracy on the entire dataset is not the right measure
Approach develop a target model
score all prospects and rank them by decreasing score
select top P% of prospects for action
Evaluate Performance on top P% using Gains and Lift
![Page 5: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/5.jpg)
CPH (Gains): Random List vs Model-ranked
list
0102030405060708090
100
5
15 25 35 45 55 65 75 85 95
RandomModel
5% of random list have 5% of targets,
but 5% of model ranked list have 21% of targets CPH(5%,model)=21%.
Pct list
Cum
ulative %
Hits
![Page 6: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/6.jpg)
Lift Curve
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
15 25 35 45 55 65 75 85 95
Lift
Lift(P) = CPH(P) / P
P -- percent of the list
Lift (at 5%)
= 21% / 5%
= 4.2betterthan random
![Page 7: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/7.jpg)
KDD-CUP 1997
Task: given data on past responders to fund-raising, predict most likely responders for new campaign
Population of 750K prospects 10K responded to a broad campaign mailing
(1.4% response rate)
Analysis file included a stratified (non-random) sample of 10K responders and 26K non-responders (28.7% response
rate)
75% used for learning; 25% used for validation
target variable removed from the validation data set
![Page 8: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/8.jpg)
KDD-CUP 1997 Data Set
321 fields/variables with ‘sanitized’ names and labels Demographic information
Credit history
Promotion history
Significant effort on data preprocessing leaker detection and removal
![Page 9: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/9.jpg)
KDD-CUP Participant Statistics
45 companies/institutions participated 23 research prototypes
22 commercial tools
16 contestants turned in their results 9 research prototypes
7 commercial tools
![Page 10: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/10.jpg)
KDD-CUP Algorithm Statistics
Algorithm # of Entries Ave. Score
Rules 2 87
k-NN 1 85
Bayesian 3 83
Multiple/Hybrid 4 79
Other 2 68
Decision Tree 4 44
Of the 16 software/tools… (Score as % of best)
![Page 11: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/11.jpg)
1111
KDD Cup 97 Evaluation
Best Gains at 40% Urban Science
BNB
Mineset
Best Gains at 10% BNB
Urban Science
Mineset
![Page 12: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/12.jpg)
KDD-CUP 1997 Awards
The GOLD MINER GOLD MINER award is jointly shared by two contestants this year
1) Charles ElkanCharles Elkan, Ph.D. from University of California, San , Ph.D. from University of California, San Diego Diego with his software BNB, Boosted Naive Bayesian BNB, Boosted Naive Bayesian ClassifierClassifier
1) Urban Science Applications, IncUrban Science Applications, Inc. . with their software gain, Direct Marketing Selection Systemgain, Direct Marketing Selection System
The BRONZE MINER BRONZE MINER award went to the runner-up
3) Silicon Graphics, IncSilicon Graphics, Inc with their software MineSetMineSet
![Page 13: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/13.jpg)
KDD-CUP Results Discussion
Top finishers very close
Naïve Bayes algorithm was used by 2 of the top 3 contestants (BNB and MineSet)
BNB and MineSet did little data preprocessing
MineSet used a total of 6 variables in their final model
Urban Science implemented a tremendous amount of automated data preprocessing and exploratory data analysis and developed more than 50 models in an automated fashion to get to their results
![Page 14: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/14.jpg)
1414
KDD Cup 1997: Top 3 results
Top 3 finishersare very close
![Page 15: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/15.jpg)
1515
KDD Cup 1997 – worst results
Note that the worstresult (C6) was actuallyworse than random.
Competitor names werekept anonymous,apart from top 3 winners
![Page 16: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/16.jpg)
1616
Better Model Evaluation?
Comparing Gains at 10% and 40% is ad-hoc
Are there more principled methods? Area Under the Curve (AUC) of Gains Chart
Lift Quality
Ultimately, financial measures: Campaign Benefits
![Page 17: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/17.jpg)
1717
Model Evaluation: AUC
Area Under the Curve (AUC) is defined as the
Difference between Gains and Random Curves
Selection
Cum
% H
its
![Page 18: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/18.jpg)
1818
Model Evaluation: Lift Quality
See Measuring Lift Quality in Database Marketing, Piatetsky-Shapiro and Steingold, SIGKDD Explorations, December 2000 .
AUC(Model) – AUC(Random)LQ = ----------------------------- AUC(Perfect) –AUC(Random)
![Page 19: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/19.jpg)
1919
Lift Quality (Lquality)
For a perfect model, Lquality = 100%
For a random model, Lquality = 0
For KDD Cup 97, Lquality(Urban Science) = 43.3%
Lquality(Elkan) = 42.7%
However, small differences in Lquality are not significant
![Page 20: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/20.jpg)
2020
Estimating Profit: Campaign Parameters
Direct Mail example N -- number of prospects, e.g. 750,000
T -- fraction of targets, e.g. 0.014
B -- benefit of hitting a target, e.g. $20 Note: this is simplification – actual benefit will vary
C -- cost of contacting a prospect, e.g. $0.68
P -- percentage selected for contact, e.g. 10%
Lift(P ) -- model lift at P , e.g. 3
![Page 21: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/21.jpg)
2121
Contacting Top P of Model-Sorted List Using previous example, let selection be P = 10% and Lift(P)
= 3
Selection size = N P , e.g. 75,000
Random has N P T targets in first P list, e.g. 1,050
Q: How many targets are in model P-selection?
Model has more by a factor Lift(P) or N P T Lift(P) targets in the selection, e.g. 3,150
Benefit of contacting the selection is N P T Lift(P) B , e.g. $63,000
Cost of contacting N P is N P C , e.g. $51,000
![Page 22: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/22.jpg)
2222
Profit of Contacting Top P
Profit(P) = Benefit(P) – Cost(P) =
N P T Lift(P) B - N P C =
NP (T Lift(P) B - C ) e.g. $12,000
Q: When is Profit Positive?
CLift(P) > ------ , e.g. 2.4 T ·B
When T • Lift(P) B > C , or
![Page 23: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/23.jpg)
Finding Optimal Cutoff
-60
-40
-20
0
20
40
60
10 20 30 40 50 60 70 80 90 100
Est Payoff
Use the formula to estimate benefit for each PFind optimal P
![Page 24: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/24.jpg)
2424
*Feasibility Assessment
Expected Profit(P) depends on known Cost C,
Benefit B,
Target Rate T
and unknown Lift(P)
To compute Lift(P) we need to get all the data, load it, clean it, ask for correct data, build models, ...
![Page 25: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/25.jpg)
2525
*Can Expected Lift be estimated ?
only from N and T ?
In theory -- no, but in many practical applications,
?!?! surprisingly yes ?!?!
![Page 26: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/26.jpg)
2626
*Empirical Observations about Lift
For good models, usually Lift(P) is monotically decreasing with P
Lift at fixed P (e.g. 0.05) is usually higher for lower T
Special point P = T
for a perfect predictor, all targets are in the first T of the list, for a maximum lift of 1/T
What can we expect compared to 1/T ?
![Page 27: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/27.jpg)
2727
*Meta Analysis of Lift
26 attrition & cross-sell problems from finance and telecom domains
N ranges from 1,000 to 150,000
T ranges from 1% to 22%
No clear relation to N, but there is dependence on T
![Page 28: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/28.jpg)
2828
*Results: Lift(T) vs 1/T
Best Model (R2 = 0.86)
log10(Lift(T)) = -0.05 + 0.52 log10(1/T)
Approximately
Lift(T) ~ T -0.5 = sqrt (1/T)
Tried several linear and log-linear fits
![Page 29: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/29.jpg)
2929
*Actual Lift(T) vs sqrt(1/T) for All Problems
0
2
4
6
8
10
12
14
0 5 10 15 20 25
100*T%
Lift
Actual lift(T) Est. lift(T) Error = Actual Lift - sqrt(1/T)
Avg(Error) = -0.08
St. Dev(Error) = 1.0
![Page 30: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/30.jpg)
3030
*GPS Lift(T) Rule of Thumb
For targeted marketing campaigns,
where 0.01 < T < 0.25,
Lift(T) = sqrt (1/T) 1
Exceptions for
truly predictable or random behaviors
poor models
information leakers
![Page 31: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/31.jpg)
3131
*Estimating Entire Curve
Cumulative Percent Hits
CPH(P) = Lift(P) * P
CPH is easier to model than Lift
Several regressions for all CPH curves
Best results with regression
log10(CPH(P)) = a + b log10(P)
Average R2 = 0.97
![Page 32: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/32.jpg)
3232
*CPH Curve Estimate
Approximately
CPH(P) ~ sqrt(P)
bounds:
P 0.6 < CPH(P) < P 0.4
![Page 33: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/33.jpg)
3333
*Lift Curve Estimate
Since Lift(P) = CPH(P)/P
Lift(P) ~ 1/sqrt(P)
bounds:
(1/P ) 0.4 < Lift(P) < (1/P ) 0.6
![Page 34: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/34.jpg)
3434
*More onEstimating Lift and Profitability
G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and Modeling Lift, Proc. KDD-99, ACM. www.KDnuggets.com/gpspubs/
![Page 35: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/35.jpg)
3535
KDD Cup 1998
Data from Paralyzed Veterans of America (charity)
Goal: select mailing with the highest profit
Winners: Urban Science, SAS, Quadstone see full results and winner’s presentations at
www.kdnuggets.com/meetings/kdd98
![Page 36: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/36.jpg)
3636
KDD-CUP-98 Analysis UniverseParalyzed Veterans of America (PVA), a not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease, generously provided the data set PVA’s June 97 fund raising mailing, sent to 3.5 million
donors, was selected as the competition data
Within this universe, a group of 200K “Lapsed” donors was of particular interest to PVA. “Lapsed” donors are individuals who made their last donation to PVA 13 to 24 months prior to the mailing
![Page 37: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/37.jpg)
3737
KDD Cup-98 Example
Evaluation: Expected profit maximization with a mailing cost of $0.68
Sum of (actual donation-$0.68) for all records with predicted/ expected donation > $0.68
Participant with the highest actual sum wins
![Page 38: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/38.jpg)
3838
KDD Cup Cost Matrix
Predicted Donation
Yes No
Actual
Donation
Yes DonationAmt-0.68
0
No -0.68 0
![Page 39: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/39.jpg)
3939
KDD Cup 1998 Results
Model Selected
Result Rank
GainSmarts
56,330 $14,712
1
SAS 55,838 $14,662
2
Quadstone
57,836 $13,954
3
… … … …
*ALL* 96,367 $10,560
13
… … … …
#20 42,270 $1,706 20
#21 1,551 $ -54 21
Selected: how manywere selected by themodel
Result: the total profit(donations-cost)of the model
*ALL* - selecting all
![Page 40: Targeted Marketing, KDD Cup and Customer Modeling](https://reader038.vdocument.in/reader038/viewer/2022103123/56649d405503460f94a1a69b/html5/thumbnails/40.jpg)
4040
Summary
KDD Cup 1997 case study
Model Evaluation: AUC and Lift Quality
Estimating Campaign Profit
*Feasibility Assessment GPS Rule of Thumb for Typical Lift Curve
KDD Cup 1998