bagging and boosting classification trees to predict churn. insights from the us telecom industry...
TRANSCRIPT
Bagging and Boosting Classification Trees to Predict Churn.
Insights from the US Telecom Industry
Forthcoming, Journal of Marketing Research
Joint work with Christophe Croux
Aurélie Lemmens
The 2002 Churn Tournament organised by Teradata Center for CRM
at Duke University
Churn means defecting from a company, i.e. take his business
elsewhere
Customer database from an anonymous U.S. wireless telecom
company
Challenge: predicting churn for elaborating targeted retention
strategies (Bolton et al. 2000, Ganesh et al. 2000, Shaffer and Zhang
2002)
Details can be found in Neslin et al. (2004)
The Context
The US Wireless Telecom market (2004)
182.1 million subscribers
Leader in market share: Cingular Wireless
26.9% total market volume
turnover US$19.4 billion / net income US$201 million
Other major players: AT&T, Verizon, Sprint and Nextel
Mergers & Acquisitions : Cingular with AT&T Wireless & Sprint
with Nextel
The Context (cont’d)
Churn
High churn rates 2.6% a month
Causes: increased competition, lack of differentiation,
market saturation
Cost: $300 to $700 cost of replacement of a lost
customer in terms of sales support, marketing,
advertising, etc.
Targeted retention strategies
The Context (cont’d)
Formulation of the Churn Problem
Churn as a Classification issue:
Classify a customer i characterized by k variables
xi = (xi1 , xi2 , …, xiK ) as
Churner yi = + 1
Non-churner yi = - 1
Churn is the response binary variable to predict: yi = f(xi )
Choice of the binary choice model f ( . ) ?
Classification Models in Marketing
Simple binary logit choice model (e.g. Andrews et al. 2002)
Models allowing for the heterogeneity in consumers’ response:
Finite mixture model (e.g. Wedel and Kamakura 2000)
Hierarchical Bayes model (e.g. Yang and Allenby 2003)
Non-parametric choice models:
Decisions trees, neural nets (e.g. Thieme et al. 2000; West et al. 1997)
Bagging (Breiman 1996), Boosting (Freund and Schapire 1996),
Stochastic gradient boosting (Friedman 2002)
Classification Models in Marketing
Simple binary logit choice model (e.g. Andrews et al. 2002)
Models allowing for the heterogeneity in consumers’ response:
Finite mixture model (e.g. Wedel and Kamakura 2000)
Hierarchical Bayes model (e.g. Yang and Allenby 2003)
Non-parametric choice models:
Decisions trees, neural nets (e.g. Thieme et al. 2000; West et al. 1997)
Bagging (Breiman 1996), Boosting (Freund and Schapire 1996),
Stochastic gradient boosting (Friedman 2002)
Mostly ignored in the marketing literature
S.G.B. won the Tournament (Cardell, from Salford Systems)
Decision Trees for ChurnChange in consumption
Customer care calls
< 0.5 ≥ 0.5
≥ 3< 3
Age
Yes
≥ 55
55< & ≥ 26 < 26
No
Handset price
≥ $150 <$150
No Yes
Yes No
Example:
Bagging and Boosting
Machine Learning Algorithms
Principle: classifier aggregation (Breiman, 1996)
Tree-based method (e.g. Currim et al. 1988)
Bagging: Bootstrap AGGregatING
Calibration sampleZ = {(xi , yi ) }, i = 1, …, N
Random sample Z1*
Random sample Z2*
xf *1̂
xf *2̂
e.g. tree
Aggregating bootstrap samples
. . .
xf *2̂
xf *1̂
xf *3̂
xfB*ˆ
…
B
bbbag xf
Bxf
1
*ˆ1)(ˆ
Churn propensity score:
Churn classification:
)(ˆ)(ˆ xfsignxc bagbag
Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN ,yN)}
B bootstrap samples
From each , a base classifier (e.g. tree) is estimated,
giving B score functions:
The final classifier is obtained by averaging the scores
The classification rule is carried out via
BbZb ,,2 ,1 ,*
B
bbbag xf
Bxf
1
*ˆ1)(ˆ
*bZ
xfxfxf Bb***
1ˆ,, ˆ,, ˆ
)(ˆ)(ˆ xfsignxc bagbag
Bagging
Winner of the Teradata Churn Modeling Tournament
(Cardell, Golovnya and Steinberg, Salford Systems).
Data adaptively resampled
Stochastic Gradient Boosting
• Previously misclassified observations weights
• Previously well-classified observations weights
Data
Time
Customer
Balanced
Sample
Proportional
Sample
Calibration Sample Validation Hold-Out Sample
yi = + 1
yi = + 1
yi = - 1
yi = - 1
Xi = (x1,…, x46) yi
Xi=(x1,…, x46) yi
Behavioral predictorse.g. the average monthly minutes of use
Company interaction’s variablese.g. mean unrounded minutes of customer care calls
Customer demographicse.g. the number of adults in the household
N = 51,306
N=100,462Real-life proportion of churners = 1.8%
Equal proportion of churners = 50%
Research Questions
Do bagging (and boosting) provide better results than
other benchmarks?
What are the financial gains to be expected from this improvement?
What are the more relevant churn drivers or triggers that marketers
could watch for?
How to correct estimated scores obtained from a
balanced calibration sample, when predicting rare
events like churn?
Comparing Error Rates…
Model* Validated Error Rate**
Binary Logit Model 0.400
Bagging (tree-based) 0.374
Stochastic Gradient Boosting 0.460
* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample
Bias due to Balanced Sampling
Overestimation of the number of churners
Several bias correction methods exist (see e.g. Cosslett 1993;
Donkers et al. 2003; Franses and Paap 2001, p.73-75; Imbens and Lancaster
1996; King and Zeng 2001a,b; Scott and Wild 1997).
However, most are dedicated to traditional models (e.g. logit).
We discuss two corrections for bagging and boosting.
The Bias Correction Methods
The weighting correction:
Based on marketers’ prior beliefs about the churn rate, i.e. the
proportion of churners among their customers, we attach weights
to observations of a balanced calibration sample.
The intercept correction:
Take a non-zero cut-off value τB such that the proportion of
predicted churners in the calibration sample equals the actual a
priori proportion of churners.
Let the calibration sample be Z={(x1,y1), …, (xi,yi), …, (xN,yN)}
B bootstrap samples
From each , a base classifier (e.g. tree) is estimated,
giving B score functions:
The final classifier is obtained by averaging the scores
The classification is carried out via
BbZb ,,2 ,1 ,*
B
bbbag xf
Bxf
1
*ˆ1)(ˆ
*bZ
xfxfxf Bb***
1ˆ,, ˆ,, ˆ
)(ˆ)(ˆ Bbagbag xfsignxc
Bagging
Assessing the Best Bias Correction…
Bias Correction
No correction Intercept Weighting
Model* Validated Error Rates**
Binary logit model 0.400 0.035 0.018
Bagging (tree-based) 0.374 0.034 0.025
S.G. boosting 0.460 0.034 0.018
* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample
The Top-Decile Lift Focuses on the most critical group of customers
regarding their churn risk: Ideal segment for targeting a
retention marketing campaign
The top 10% riskiest customers
With = the proportion of churners in this risky segment
And = the proportion of churners in the whole validation set
Risk to churn
10%
ˆ
ˆlift decile-Top %10
%10̂̂
Financial Gains: Neslin et al. (2004)
N : customer base of the company
α : percentage of targeted customers (here, 10%)
ΔTop decile : increase in top-decile lift
γ : success rate of the incentive among the churners
LVC : lifetime value of a customer (Gupta, Lehmann and Stuart 2004)
δ : incentive cost per customer
ψ : success rate of the incentive among the non-churners.
LVCdecileTopNGain ˆ
0 20 40 60 80 100
Number of iterations
1.6
1.8
2.0
2.2
2.4
2.6
Top d
eci
le*
BaggingStochastic Gradient BoostingBinary Logit Model
Top-Decile Lift with Intercept Correction
* Model estimated on the balanced sample, and lift computed on the validation sample.
+26%
Validated** Top-Decile Lift
Model*No / Intercept
correctionWeighting correction
Binary logit model 1.775 1.764
Bagging (tree-based) 2.246 1.549
Stochastic gradient boosting 2.290 1.632
* Model estimated on the balanced calibration sample** Error rates computed on the hold-out proportional validation sample
Financial Gains
If we consider
N : customer base of 5,000,000 customers
α : 10% of targeted customers
γ : 30% success rate of the incentive among the churners
LVC : $2,500 lifetime value of a customer
δ : $50 incentive cost per customer
ψ : 50% success rate of the incentive among the non-churners
LVCdecileTopNGain ˆ
Financial Gains
Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of the logit model:
ΔTop decile : 0. 471 (= 2.246 – 1.775)
Gain = + $ 3,214,800
Additional financial gains that we may expect from a retention marketing campaign which would be targeted using the scores predicted by the bagging instead of a random selection:
ΔTop decile : 1.246 (= 2.246 – 1.000)
Gain = + $ 8,550,000
Most Important Churn Triggers
Bagging
Partial Dependence Plots
-1000 0 1000 2000
Change in monthly min. of use
48
50
52
54
56
58
60
62
Pro
bability t
o c
hurn
0 500 1000 1500
Equipment days
44
46
48
50
52
54
56
Pro
bability t
o c
hurn
Bagging
Partial Dependence Plot
Pro
bab
ilit
y to
ch
urn
49
50
51
Conclusions: Main Findings
1. Bagging and S.G. boosting are substantially better
classifiers than the binary logit choice model
Improvement of 26% for the top-decile lift,
Good diagnostic measures offering face validity,
Interesting insights about potential churn drivers,
Bagging is conceptually simple and easy-to-implement.
2. Intercept correction constitutes an appropriate bias
correction for bagging when using balanced sampling
scheme.
Thanks for your attention
From Profit to Financial Gains
LVCdecileTopN
LVCN
ˆ
ˆ-ˆ
ProfitProfitGain
2 1
2 1 2-1
cLVCN 1111 ˆ1ˆ ˆ Profit
LVC of a churner
who does not
churn
Incentive cost for the
churners retained
+ non-churners
targeted
Contact
cost
ˆ/ ˆdecile Top 1 1