predicting customer behavior with r: part 1
DESCRIPTION
An introduction to modeling non-contractual customer purchasing with BYTD package in RTRANSCRIPT
Predicting Customer Behavior with R: Part 1
Matthew Baggott, Ph.D.University of Chicago
Goal for Today’s Workshop
• Use R and the BTYD package to make a Pareto/Negative Binomial Distribution model of customer purchasing
• Understand our assumptions and how we could refine them
Goal for Today’s WorkshopFrom this To thiscust date sales1 1997-01-01 29.331 1997-01-18 29.731 1997-08-02 14.961 1997-12-12 26.482 1997-01-01 63.342 1997-01-13 11.773 1997-01-01 6.794 1997-01-01 13.975 1997-01-01 23.946 1997-01-01 35.996 1997-01-11 32.996 1997-06-23 91.926 1997-07-22 47.086 1997-07-26 71.966 1997-10-25 78.476 1997-12-06 83.476 1998-01-18 84.46
• Tutorial assumes working knowledge of R (but feel free to ask questions)
• Main R packages used: BTYD, plyr, ggplot2, reshape2, lubridate
• BTYD vignette covers some of the same ground
• R Script to carry out today’s analysis is at:gist.github.com/mattbaggott/5113177
Why Model?
• Help separate– active customers, – inactive customers who should be re-engaged,
and– unprofitable customers– Inactive customers
• Forecast future business profits and needs
Annual Customer ‘Defection’ Rates are High
Industry Defection Rate
Internet service providers 22%
U.S. long distance (telephone) 30%
German mobile telephone market 25%
Clothing catalogs 25%
Residential tree and lawn care 32%
Newspaper subscriptions 66%
Griffin and Lowenstein 2001
Why Not Model?
• Wübben & Wangenheim (2008) found simple rules of thumb often beat basic models
• Simple calculations can faster, clearer:
Long Term Value = (Avg Monthly Revenue per Customer * Gross Margin per Customer) / Monthly Churn Rate
But it’s All Models
• Our choice is not model vs. no model
• Our choice is formal, scalable models
vs. informal, manual models
• We can and should compare, refine, & combine simple rules and complex models
RFM Family of Models
• Models use three variables:– Recency of purchases– Frequency of purchases– Monetary value of purchases
• Used for non-contractual purchasing• Data needed: dates and amounts of
purchases for individual customers
Simple RFM model of Purchasing
1. A probabilistic purchasing process for active customers, modeled as a Poisson process with rate λ
2. A probabilistic dropout process of active customers becoming inactive, modeled as an exponential distributions with dropout rate γ
Simple RFM model of Purchasing
3. Purchasing rates follow a gamma distribution across customers with shape and scale parameters: r and α
4. Dropout rates follow a gamma distribution across customers with shape and scale parameters and sβ
5. Transaction rate λ and the dropout rate µ vary independently across customers
6. Customers are considered in isolation (no indirect value, no influencing each other)
Purchasing as a Poisson process• Single parameter indicating the constant
probability of some event• Each event is independent -- one does not
make another one more or less likely
• Other Poisson processes : e-mail arrival , radioactive decay, wars per year
(Are these realistic?) Frequency of war
Hayes, 2002
Dropout rates• Latent variable: without subscriptions, not
directly observed• ‘Right censored’ (we don’t know the future)• Fancy survival / hazard models possible
(such as Cox regression)• Here, we use a simple exponential function
with dropout rate > 0 as a constant γf(t)= γe –γt
Gamma distributions• Family of continuous probability
distributions with two parameters, shape and scale/rate.
• Often used to fit scale/rate parameters, as we do here with Poisson and exponential distributions.
Model is of repeat customers• Customers are only customers after they
make their first purchase• Frequency is not defined for first purchase• We will change purchase data log into
repeat purchase data log with dc.SplitUpElogForRepeatTrans()
or as part of dc.ElogToCbsCbt()
CDNOW Data set• We will use data from online retailer
CDNOW, included in BTYD package• 10% of the cohort of customers who made
their first transactions in the first quarter of 1997
• 6919 purchases by 2357 customers over a 78-week period
• Not too big; we won’t need to wait long
Install/load packagesInstallCandidates <- c("ggplot2", "BTYD", "reshape2", "plyr", "lubridate")# check if pkgs are already presenttoInstall <- InstallCandidates[!InstallCandidates %in% library()$results[,1]] if(length(toInstall)!=0) {install.packages(toInstall, repos = "http://cran.r-project.org")}
# load pkgslapply(InstallCandidates, library, character.only = TRUE)
Load data
cdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD") elog=read.csv(cdnowElog) # read datahead(elog) # take a lookelog<-elog[,c(2,3,5)] # we need these columns
names(elog) <- c("cust","date","sales") # model funcs expect these names # format dateelog$date <- as.Date(as.character(elog$date), format="%Y%m%d")
Aggregate by cust, dates • Our model is concerned with inter-purchase
intervals. • We only have dates (w/o times) and there
may be multiple purchases on a day• We merge all transactions that occurred on
the same day:
elog <- dc.MergeTransactionsOnSameDate(elog)
Plot dataggplot(elog, aes(x=date,y=sales,group=cust))+ geom_line(alpha=0.1)+ scale_x_date()+ scale_y_log10()+ ggtitle("Sales for individual customers")+ ylab("Sales ($, US)")+xlab("")+ theme_minimal()
(Ugly plot, but could haveRevealed data issues.)
A more useful plotpurchaseFreq <- ddply(elog, .(cust), summarize, daysBetween = as.numeric(diff(date))) windows();ggplot(purchaseFreq,aes(x=daysBetween))+ geom_histogram(fill="orange")+ xlab("Time between purchases (days)")+ theme_minimal()
Divide data into train and test(end.of.cal.period <- min(elog$date) + as.numeric((max(elog$date)- min(elog$date))/2)) # split data into train(calibration) and test (holdout) and make matricesdata <- dc.ElogToCbsCbt(elog, per="week", T.cal=end.of.cal.period, merge.same.date=TRUE, # already did this statistic = "freq") # which CBT to return # take a lookstr(data)
> str(data)List of 3 $ cal :List of 2 ..$ cbs: num [1:2357, 1:3] 2 1 0 0 0 7 1 0 2 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:3] "x" "t.x" "T.cal" ..$ cbt: num [1:2357, 1:266] 0 0 0 0 0 0 0 0 0 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:266] "1997-01-08" "1997-01-09" "1997-01-10" "1997-01-11" ... $ holdout :List of 2 ..$ cbt: num [1:2357, 1:272] 0 0 0 0 0 0 0 0 0 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:272] "1997-10-01" "1997-10-02" "1997-10-03" "1997-10-04" ... ..$ cbs: num [1:2357, 1:2] 1 0 0 0 0 8 0 2 2 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:2] "x.star" "T.star" $ cust.data:'data.frame': 2357 obs. of 5 variables: ..$ cust : int [1:2357] 1 2 3 4 5 6 7 8 9 10 ... ..$ birth.per : Date[1:2357], format: "1997-01-01" ... ..$ first.sales: num [1:2357] 29.33 63.34 6.79 13.97 23.94 ... ..$ last.date : Date[1:2357], format: "1997-08-02" ... ..$ last.sales : num [1:2357] 14.96 11.77 6.79 13.97 23.94 ...
Cal period matrix
Holdout period matrix
Customer info
Extract cbs matrix• cbs is short for "customer-by-sufficient-
statistic” matrix, with the sufficient stats being: – frequency– recency (time of last transaction) and– total time observed
cal2.cbs <- as.matrix(data[[1]][[1]])str(cal2.cbs)
(First item in list, first item in it)
Estimate parameters for model• Purchase shape and scale params: r and α• Dropout shape and scale params: β and s
# initial estimate(params2 <- pnbd.EstimateParameters(cal2.cbs))# 0.5528797 10.5838911 0.6250764 12.2011828 # look at log likelihood (LL <- pnbd.cbs.LL(params2, cal2.cbs))# -9598.711
Estimate parameters for model# make a series of estimates, see if they convergep.matrix <- c(params2, LL)for (i in 1:20) { params2 <- pnbd.EstimateParameters(cal2.cbs, params2) LL <- pnbd.cbs.LL(params2, cal2.cbs) p.matrix.row <- c(params2, LL) p.matrix <- rbind(p.matrix, p.matrix.row)}
# examinep.matrix # use final set of values(params2 <- p.matrix[dim(p.matrix)[1],1:4])
Plot iso-likelihood for param pairs# make parameter names for descriptive result # parameter names for a more descriptive resultparam.names <- c("r", "alpha", "s", "beta") LL <- pnbd.cbs.LL(params2, cal2.cbs) dc.PlotLogLikelihoodContours(pnbd.cbs.LL, params2, cal.cbs = cal2.cbs, n.divs = 5, num.contour.lines = 7, zoom.percent = 0.3, allow.neg.params = FALSE, param.names = param.names)
Plot iso-likelihood for param pairs
-106
00
-10
400
-1
020
0 -
10
000
-9
800
-9
800
-9
600
0.5 1.0 1.5 2.0
9.0
10.0
11.0
12.0
Log-likelihood contour of r and alpha
r
alph
a
-11000
-100
00
-10000
0.5 1.0 1.5 2.00.
00.
51.
01.
52.
0
Log-likelihood contour of r and s
r
s
-10
600
-10
400
-10
200
-10
000
-98
00
-98
00
-96
00
- 96
00
0.5 1.0 1.5 2.0
11.0
12.0
13.0
Log-likelihood contour of r and beta
r
beta
-10100 -10000
-9900
-9800
-9800
-9700
-9700
-9600
9.0 9.5 10.0 11.0 12.0
0.0
0.5
1.0
1.5
2.0
Log-likelihood contour of alpha and s
alpha
s
-96
06
-9
604
-960
4
-960
2
-9
602
-960
0
-960
0
9.0 9.5 10.0 11.0 12.0
11.0
12.0
13.0
Log-likelihood contour of alpha and beta
alpha
beta
-1
010
0
-1
000
0
-9
900
-9
800
-9
800
-9
700
-9
700
-96
00
0.0 0.5 1.0 1.5 2.0
11.0
12.0
13.0
Log-likelihood contour of s and beta
s
beta
Plot population estimates# par to make two plots side by sidepar(mfrow=c(1,2)) # Plot the estimated distribution of# customers' propensities to purchasepnbd.PlotTransactionRateHeterogeneity(params2,
lim = NULL)
# lim is upper xlim
# Plot estimated distribution of# customers' propensities to drop out pnbd.PlotDropoutRateHeterogeneity(params2) # set par to normalpar(mfrow = c(1,1))
Plot population estimates
0.00 0.10 0.20 0.30
05
1525
Heterogeneity in Transaction Rate
Transaction Rate
Den
sity
Mean: 0.0522 Var: 0.0049
0.00 0.10 0.20 0.300
515
25
Heterogeneity in Dropout Rate
Dropout rate
Den
sity
Mean: 0.0512 Var: 0.0042
Examine individual predictions# predicted num. transactions a new customer # will make in 52 weekspnbd.Expectation(params2, t = 52) # expected characteristics for customer 1516, # conditional on their purchasing during calibration cal2.cbs["1516",]x <- cal2.cbs["1516", "x"] # x is frequencyt.x <- cal2.cbs["1516", "t.x"] # t.x is time last buyT.cal <- cal2.cbs["1516", "T.cal"] # T.cal is time observed # estimate their transactions in a T.star durationpnbd.ConditionalExpectedTransactions(params2, T.star = 52, # weeks
x, t.x, T.cal)# [1] 25.24912
Probability a customer is ‘alive’x # freq of purchaset.x # week of last purchaseT.cal <- 39 # week of end of cal, i.e. presentpnbd.PAlive(params2, x, t.x, T.cal) # To visualize the distribution of P(Alive) # across customers:params3 <- pnbd.EstimateParameters(cal2.cbs)p.alives <- pnbd.PAlive(params3, cal2.cbs[,"x"], cal2.cbs[,"t.x"], cal2.cbs[,"T.cal"])
Plot P(Alive)ggplot(as.data.frame(p.alives),aes(x=p.alives))+ geom_histogram(colour="grey", fill="orange")+ ylab("Number of Customers")+ xlab("Probability Customer is 'Live'")+ theme_min\imal()
0
200
400
600
0.0 0.3 0.6 0.9Probability Customer is 'Live'
Num
ber
of C
usto
mer
s
Plot Observed, Model Transactions# plot actual & expected customers binned by # num of repeat transactionspnbd.PlotFrequencyInCalibration(params2, cal2.cbs, censor=10, title="Model vs. Reality during Calibration")
0 1 2 3 4 5 6 7 8 9 10+
Model vs. Reality during Calibration
Calibration period transactions
Cus
tom
ers
050
015
00
ActualModel
Compare calibration to holdout
• Note of caution: potential overfitting – Our gamma distributions are based on the
specific customers we had during calibration.– How would our parameters and predictions
change with different customers?– We will addresses this in Part 2
Get holdout results, duration# get holdout transactions from dataframe data, # add in as x.star x.star <- data[[2]][[2]][,1]cal2.cbs <- cbind(cal2.cbs, x.star)
str(cal2.cbs) holdoutdates <- attributes(data[[2]][[1]])[[2]][[2]]holdoutlength <- round(as.numeric(max(as.Date(holdoutdates))- min(as.Date(holdoutdates)))/7)
Plot frequency comparison# plot predicted vs seen conditional freqs T.star <- holdoutlengthcensor <- 10 # Bin all order numbers here and abovecomp <- pnbd.PlotFreqVsConditionalExpectedFrequency(params2,
T.star, cal2.cbs, x.star, censor)
02
46
810
Conditional Expectation
Calibration period transactions
Hol
dout
per
iod
tran
sact
ions
0 1 2 3 4 5 6 7 8 9 10+
ActualModel
Examine accompanying matrix
rownames(comp) <- c("act", "exp", "bin")comp
freq.0 freq.1 freq.2 freq.3 freq.4 freq.5act 0.2367116 0.6970387 1.392523 1.560000 2.532258 2.947368exp 0.1367795 0.5921279 1.181825 1.693969 2.372472 2.876888bin 1411.0000000 439.0000000 214.000000 100.000000 62.000000 38.000000 freq.6 freq.7 freq.8 freq.9 freq.10+act 3.862069 4.913043 3.714286 8.400000 7.793103exp 3.776675 4.167163 5.698026 5.487862 8.369321bin 29.000000 23.000000 7.000000 5.000000 29.000000
• Bin size in that plot can be seen in comp matrix:
Compare Weekly transactions # get data without first transaction: removes those who buy 1xremovedFirst.elog <- dc.SplitUpElogForRepeatTrans(elog)$repeat.trans.elogremovedFirst.cbt <- dc.CreateFreqCBT(removedFirst.elog)
# get all data, so we have customers who buy 1xallCust.cbt <- dc.CreateFreqCBT(elog) # add 1x customers into matrixtot.cbt <- dc.MergeCustomers(data.correct=allCust.cbt, data.to.correct=removedFirst.cbt) lengthInDays <- as.numeric(max(as.Date(colnames(tot.cbt)))- min(as.Date(colnames(tot.cbt))))origin <- min(as.Date(colnames(tot.cbt)))
Compare Weekly transactions tot.cbt.df <- melt(tot.cbt,varnames = c("cust","date"), value.name="Freq")
tot.cbt.df$date <- as.Date(tot.cbt.df$date)tot.cbt.df$week <- as.numeric(1 + floor((tot.cbt.df$date-origin+1)/7)) transactByDay <- ddply(tot.cbt.df,.(date),summarize,sum(Freq))transactByWeek <- ddply(tot.cbt.df,.(week),summarize,sum(Freq))names(transactByWeek) <- c("week","Transactions")names(transactByDay) <- c("date","Transactions") T.cal <- cal2.cbs[,"T.cal"]T.tot <- 78 # end of holdoutcomparisonByWeek <- pnbd.PlotTrackingInc(params2, T.cal, T.tot, actual.inc.tracking.data = transactByWeek$Transactions)
Compare Weekly transactions
Formal Measures of Accuracy# root mean squared errorrmse <- function(est, act) { return(sqrt(mean((est-act)^2))) } # mean squared logarithmic errormsle <- function(est, act) { return(mean((log1p(est)- log1p(act))^2)) } Predict <- pnbd.ConditionalExpectedTransactions(params2, T.star = 38, # weeks x = cal2.cbs[,"x"], t.x = cal2.cbs[,"t.x"], T.cal = cal2.cbs[,"T.cal"]) cal2.cbs[,"x.star"] # actual transactions for each person rmse(act=cal2.cbs[,"x.star"],est=predict)msle(act=cal2.cbs[,"x.star"],est=predict)
Measures not really meaningful without some comparison
Next Week:
• Compare results to a simple model• Estimate of expenditure / customer value• Use info about clumpiness of purchase patterns (as
in Platzer 2008)• Use info about seasonality of purchasing, with
forecast package• Improve model predictions with machine learning
techniques:– Cross-validation to avoid over-fitting– Combining model predictions
References
• Griffin and Lowenstein (2001), Customer Winback: How to Recapture Lost Customers—And Keep Them Loyal. San Francisco: Jossey-Bass.
• Platzer (2008). “Stochastic models of noncontractual consumer relationships.” Master of Science in Business Administration thesis, Vienna University of Economics and Business Administration, Austria.
• Schmittlein, Morrison, and Colombo (1987). Counting Your Customers: Who Are They and What Will They Do Next? Management Science, 33, 1–24.
• Wang, Gao, and Li (2010). Empirical analysis of customer behaviors in Chinese e-commerce. Journal of networks 5.10: 1177-1184.
• Wübben & Wangenheim (2008). Instant customer base analysis: Managerial heuristics often “get it right”. Journal of Marketing, 72(3), 82-93.
• Zhang, Y., Bradlow, E. T., & Small, D. S. (2012). New Measures of Clumpiness for Incidence Data.
Purchase rate often depends on type of purchase
1.1 million purchases on 360buy.com from Wang, Gao, & Li 2010