predicting customer behavior with r: part 1

Predicting Customer Behavior with R: Part 1

Matthew Baggott, Ph.D.University of Chicago

Goal for Today’s Workshop

• Use R and the BTYD package to make a Pareto/Negative Binomial Distribution model of customer purchasing

• Understand our assumptions and how we could refine them

Goal for Today’s WorkshopFrom this To thiscust date sales1 1997-01-01 29.331 1997-01-18 29.731 1997-08-02 14.961 1997-12-12 26.482 1997-01-01 63.342 1997-01-13 11.773 1997-01-01 6.794 1997-01-01 13.975 1997-01-01 23.946 1997-01-01 35.996 1997-01-11 32.996 1997-06-23 91.926 1997-07-22 47.086 1997-07-26 71.966 1997-10-25 78.476 1997-12-06 83.476 1998-01-18 84.46

• Tutorial assumes working knowledge of R (but feel free to ask questions)

• Main R packages used: BTYD, plyr, ggplot2, reshape2, lubridate

• BTYD vignette covers some of the same ground

• R Script to carry out today’s analysis is at:gist.github.com/mattbaggott/5113177

Why Model?

• Help separate– active customers, – inactive customers who should be re-engaged,

and– unprofitable customers– Inactive customers

• Forecast future business profits and needs

Annual Customer ‘Defection’ Rates are High

Industry Defection Rate

Internet service providers 22%

U.S. long distance (telephone) 30%

German mobile telephone market 25%

Clothing catalogs 25%

Residential tree and lawn care 32%

Newspaper subscriptions 66%

Griffin and Lowenstein 2001

Why Not Model?

• Wübben & Wangenheim (2008) found simple rules of thumb often beat basic models

• Simple calculations can faster, clearer:

Long Term Value = (Avg Monthly Revenue per Customer * Gross Margin per Customer) / Monthly Churn Rate

But it’s All Models

• Our choice is not model vs. no model

• Our choice is formal, scalable models

vs. informal, manual models

• We can and should compare, refine, & combine simple rules and complex models

RFM Family of Models

• Models use three variables:– Recency of purchases– Frequency of purchases– Monetary value of purchases

• Used for non-contractual purchasing• Data needed: dates and amounts of

purchases for individual customers

Simple RFM model of Purchasing

1. A probabilistic purchasing process for active customers, modeled as a Poisson process with rate λ

2. A probabilistic dropout process of active customers becoming inactive, modeled as an exponential distributions with dropout rate γ

Simple RFM model of Purchasing

3. Purchasing rates follow a gamma distribution across customers with shape and scale parameters: r and α

4. Dropout rates follow a gamma distribution across customers with shape and scale parameters and sβ

5. Transaction rate λ and the dropout rate µ vary independently across customers

6. Customers are considered in isolation (no indirect value, no influencing each other)

Purchasing as a Poisson process• Single parameter indicating the constant

probability of some event• Each event is independent -- one does not

make another one more or less likely

• Other Poisson processes : e-mail arrival , radioactive decay, wars per year

(Are these realistic?) Frequency of war

Hayes, 2002

Dropout rates• Latent variable: without subscriptions, not

directly observed• ‘Right censored’ (we don’t know the future)• Fancy survival / hazard models possible

(such as Cox regression)• Here, we use a simple exponential function

with dropout rate > 0 as a constant γf(t)= γe –γt

Gamma distributions• Family of continuous probability

distributions with two parameters, shape and scale/rate.

• Often used to fit scale/rate parameters, as we do here with Poisson and exponential distributions.

Model is of repeat customers• Customers are only customers after they

make their first purchase• Frequency is not defined for first purchase• We will change purchase data log into

repeat purchase data log with dc.SplitUpElogForRepeatTrans()

or as part of dc.ElogToCbsCbt()

CDNOW Data set• We will use data from online retailer

CDNOW, included in BTYD package• 10% of the cohort of customers who made

their first transactions in the first quarter of 1997

• 6919 purchases by 2357 customers over a 78-week period

• Not too big; we won’t need to wait long

Install/load packagesInstallCandidates <- c("ggplot2", "BTYD", "reshape2", "plyr", "lubridate")# check if pkgs are already presenttoInstall <- InstallCandidates[!InstallCandidates %in% library()$results[,1]] if(length(toInstall)!=0) {install.packages(toInstall, repos = "http://cran.r-project.org")}

# load pkgslapply(InstallCandidates, library, character.only = TRUE)

Load data

cdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD") elog=read.csv(cdnowElog) # read datahead(elog) # take a lookelog<-elog[,c(2,3,5)] # we need these columns

names(elog) <- c("cust","date","sales") # model funcs expect these names # format dateelog$date <- as.Date(as.character(elog$date), format="%Y%m%d")

Aggregate by cust, dates • Our model is concerned with inter-purchase

intervals. • We only have dates (w/o times) and there

may be multiple purchases on a day• We merge all transactions that occurred on

the same day:

elog <- dc.MergeTransactionsOnSameDate(elog)

Plot dataggplot(elog, aes(x=date,y=sales,group=cust))+ geom_line(alpha=0.1)+ scale_x_date()+ scale_y_log10()+ ggtitle("Sales for individual customers")+ ylab("Sales ($, US)")+xlab("")+ theme_minimal()

(Ugly plot, but could haveRevealed data issues.)

A more useful plotpurchaseFreq <- ddply(elog, .(cust), summarize, daysBetween = as.numeric(diff(date))) windows();ggplot(purchaseFreq,aes(x=daysBetween))+ geom_histogram(fill="orange")+ xlab("Time between purchases (days)")+ theme_minimal()

Divide data into train and test(end.of.cal.period <- min(elog$date) + as.numeric((max(elog$date)- min(elog$date))/2)) # split data into train(calibration) and test (holdout) and make matricesdata <- dc.ElogToCbsCbt(elog, per="week", T.cal=end.of.cal.period, merge.same.date=TRUE, # already did this statistic = "freq") # which CBT to return # take a lookstr(data)

> str(data)List of 3 $ cal :List of 2 ..$ cbs: num [1:2357, 1:3] 2 1 0 0 0 7 1 0 2 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:3] "x" "t.x" "T.cal" ..$ cbt: num [1:2357, 1:266] 0 0 0 0 0 0 0 0 0 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:266] "1997-01-08" "1997-01-09" "1997-01-10" "1997-01-11" ... $ holdout :List of 2 ..$ cbt: num [1:2357, 1:272] 0 0 0 0 0 0 0 0 0 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:272] "1997-10-01" "1997-10-02" "1997-10-03" "1997-10-04" ... ..$ cbs: num [1:2357, 1:2] 1 0 0 0 0 8 0 2 2 0 ... .. ..- attr(*, "dimnames")=List of 2 .. .. ..$ : chr [1:2357] "1" "2" "3" "4" ... .. .. ..$ : chr [1:2] "x.star" "T.star" $ cust.data:'data.frame': 2357 obs. of 5 variables: ..$ cust : int [1:2357] 1 2 3 4 5 6 7 8 9 10 ... ..$ birth.per : Date[1:2357], format: "1997-01-01" ... ..$ first.sales: num [1:2357] 29.33 63.34 6.79 13.97 23.94 ... ..$ last.date : Date[1:2357], format: "1997-08-02" ... ..$ last.sales : num [1:2357] 14.96 11.77 6.79 13.97 23.94 ...

Cal period matrix

Holdout period matrix

Customer info

Extract cbs matrix• cbs is short for "customer-by-sufficient-

statistic” matrix, with the sufficient stats being: – frequency– recency (time of last transaction) and– total time observed

cal2.cbs <- as.matrix(data[[1]][[1]])str(cal2.cbs)

(First item in list, first item in it)

Estimate parameters for model• Purchase shape and scale params: r and α• Dropout shape and scale params: β and s

# initial estimate(params2 <- pnbd.EstimateParameters(cal2.cbs))# 0.5528797 10.5838911 0.6250764 12.2011828 # look at log likelihood (LL <- pnbd.cbs.LL(params2, cal2.cbs))# -9598.711

Estimate parameters for model# make a series of estimates, see if they convergep.matrix <- c(params2, LL)for (i in 1:20) { params2 <- pnbd.EstimateParameters(cal2.cbs, params2) LL <- pnbd.cbs.LL(params2, cal2.cbs) p.matrix.row <- c(params2, LL) p.matrix <- rbind(p.matrix, p.matrix.row)}

# examinep.matrix # use final set of values(params2 <- p.matrix[dim(p.matrix)[1],1:4])

Plot iso-likelihood for param pairs# make parameter names for descriptive result # parameter names for a more descriptive resultparam.names <- c("r", "alpha", "s", "beta") LL <- pnbd.cbs.LL(params2, cal2.cbs) dc.PlotLogLikelihoodContours(pnbd.cbs.LL, params2, cal.cbs = cal2.cbs, n.divs = 5, num.contour.lines = 7, zoom.percent = 0.3, allow.neg.params = FALSE, param.names = param.names)

Plot iso-likelihood for param pairs

-106

00

-10

400

-1

020

0 -

10

000

-9

800

-9

800

-9

600

0.5 1.0 1.5 2.0

9.0

10.0

11.0

12.0

Log-likelihood contour of r and alpha

r

alph

a

-11000

-100

00

-10000

0.5 1.0 1.5 2.00.

00.

51.

01.

52.

0

Log-likelihood contour of r and s

r

s

-10

600

-10

400

-10

200

-10

000

-98

00

-98

00

-96

00

- 96

00

0.5 1.0 1.5 2.0

11.0

12.0

13.0

Log-likelihood contour of r and beta

r

beta

-10100 -10000

-9900

-9800

-9800

-9700

-9700

-9600

9.0 9.5 10.0 11.0 12.0

0.0

0.5

1.0

1.5

2.0

Log-likelihood contour of alpha and s

alpha

s

-96

06

-9

604

-960

4

-960

2

-9

602

-960

0

-960

0

9.0 9.5 10.0 11.0 12.0

11.0

12.0

13.0

Log-likelihood contour of alpha and beta

alpha

beta

-1

010

0

-1

000

0

-9

900

-9

800

-9

800

-9

700

-9

700

-96

00

0.0 0.5 1.0 1.5 2.0

11.0

12.0

13.0

Log-likelihood contour of s and beta

s

beta

Plot population estimates# par to make two plots side by sidepar(mfrow=c(1,2)) # Plot the estimated distribution of# customers' propensities to purchasepnbd.PlotTransactionRateHeterogeneity(params2,

lim = NULL)

# lim is upper xlim

# Plot estimated distribution of# customers' propensities to drop out pnbd.PlotDropoutRateHeterogeneity(params2) # set par to normalpar(mfrow = c(1,1))

Plot population estimates

0.00 0.10 0.20 0.30

05

1525

Heterogeneity in Transaction Rate

Transaction Rate

Den

sity

Mean: 0.0522 Var: 0.0049

0.00 0.10 0.20 0.300

515

25

Heterogeneity in Dropout Rate

Dropout rate

Den

sity

Mean: 0.0512 Var: 0.0042

Examine individual predictions# predicted num. transactions a new customer # will make in 52 weekspnbd.Expectation(params2, t = 52) # expected characteristics for customer 1516, # conditional on their purchasing during calibration cal2.cbs["1516",]x <- cal2.cbs["1516", "x"] # x is frequencyt.x <- cal2.cbs["1516", "t.x"] # t.x is time last buyT.cal <- cal2.cbs["1516", "T.cal"] # T.cal is time observed # estimate their transactions in a T.star durationpnbd.ConditionalExpectedTransactions(params2, T.star = 52, # weeks

x, t.x, T.cal)# [1] 25.24912

Probability a customer is ‘alive’x # freq of purchaset.x # week of last purchaseT.cal <- 39 # week of end of cal, i.e. presentpnbd.PAlive(params2, x, t.x, T.cal) # To visualize the distribution of P(Alive) # across customers:params3 <- pnbd.EstimateParameters(cal2.cbs)p.alives <- pnbd.PAlive(params3, cal2.cbs[,"x"], cal2.cbs[,"t.x"], cal2.cbs[,"T.cal"])

Plot P(Alive)ggplot(as.data.frame(p.alives),aes(x=p.alives))+ geom_histogram(colour="grey", fill="orange")+ ylab("Number of Customers")+ xlab("Probability Customer is 'Live'")+ theme_min\imal()

0

200

400

600

0.0 0.3 0.6 0.9Probability Customer is 'Live'

Num

ber

of C

usto

mer

s

Plot Observed, Model Transactions# plot actual & expected customers binned by # num of repeat transactionspnbd.PlotFrequencyInCalibration(params2, cal2.cbs, censor=10, title="Model vs. Reality during Calibration")

0 1 2 3 4 5 6 7 8 9 10+

Model vs. Reality during Calibration

Calibration period transactions

Cus

tom

ers

050

015

00

ActualModel

Compare calibration to holdout

• Note of caution: potential overfitting – Our gamma distributions are based on the

specific customers we had during calibration.– How would our parameters and predictions

change with different customers?– We will addresses this in Part 2

Get holdout results, duration# get holdout transactions from dataframe data, # add in as x.star x.star <- data[[2]][[2]][,1]cal2.cbs <- cbind(cal2.cbs, x.star)

str(cal2.cbs) holdoutdates <- attributes(data[[2]][[1]])[[2]][[2]]holdoutlength <- round(as.numeric(max(as.Date(holdoutdates))- min(as.Date(holdoutdates)))/7)

Plot frequency comparison# plot predicted vs seen conditional freqs T.star <- holdoutlengthcensor <- 10 # Bin all order numbers here and abovecomp <- pnbd.PlotFreqVsConditionalExpectedFrequency(params2,

T.star, cal2.cbs, x.star, censor)

02

46

810

Conditional Expectation

Calibration period transactions

Hol

dout

per

iod

tran

sact

ions

0 1 2 3 4 5 6 7 8 9 10+

ActualModel

Examine accompanying matrix

rownames(comp) <- c("act", "exp", "bin")comp

freq.0 freq.1 freq.2 freq.3 freq.4 freq.5act 0.2367116 0.6970387 1.392523 1.560000 2.532258 2.947368exp 0.1367795 0.5921279 1.181825 1.693969 2.372472 2.876888bin 1411.0000000 439.0000000 214.000000 100.000000 62.000000 38.000000 freq.6 freq.7 freq.8 freq.9 freq.10+act 3.862069 4.913043 3.714286 8.400000 7.793103exp 3.776675 4.167163 5.698026 5.487862 8.369321bin 29.000000 23.000000 7.000000 5.000000 29.000000

• Bin size in that plot can be seen in comp matrix:

Compare Weekly transactions # get data without first transaction: removes those who buy 1xremovedFirst.elog <- dc.SplitUpElogForRepeatTrans(elog)$repeat.trans.elogremovedFirst.cbt <- dc.CreateFreqCBT(removedFirst.elog)

# get all data, so we have customers who buy 1xallCust.cbt <- dc.CreateFreqCBT(elog) # add 1x customers into matrixtot.cbt <- dc.MergeCustomers(data.correct=allCust.cbt, data.to.correct=removedFirst.cbt) lengthInDays <- as.numeric(max(as.Date(colnames(tot.cbt)))- min(as.Date(colnames(tot.cbt))))origin <- min(as.Date(colnames(tot.cbt)))

Compare Weekly transactions tot.cbt.df <- melt(tot.cbt,varnames = c("cust","date"), value.name="Freq")

tot.cbt.df$date <- as.Date(tot.cbt.df$date)tot.cbt.df$week <- as.numeric(1 + floor((tot.cbt.df$date-origin+1)/7)) transactByDay <- ddply(tot.cbt.df,.(date),summarize,sum(Freq))transactByWeek <- ddply(tot.cbt.df,.(week),summarize,sum(Freq))names(transactByWeek) <- c("week","Transactions")names(transactByDay) <- c("date","Transactions") T.cal <- cal2.cbs[,"T.cal"]T.tot <- 78 # end of holdoutcomparisonByWeek <- pnbd.PlotTrackingInc(params2, T.cal, T.tot, actual.inc.tracking.data = transactByWeek$Transactions)

Compare Weekly transactions

Formal Measures of Accuracy# root mean squared errorrmse <- function(est, act) { return(sqrt(mean((est-act)^2))) } # mean squared logarithmic errormsle <- function(est, act) { return(mean((log1p(est)- log1p(act))^2)) } Predict <- pnbd.ConditionalExpectedTransactions(params2, T.star = 38, # weeks x = cal2.cbs[,"x"], t.x = cal2.cbs[,"t.x"], T.cal = cal2.cbs[,"T.cal"]) cal2.cbs[,"x.star"] # actual transactions for each person rmse(act=cal2.cbs[,"x.star"],est=predict)msle(act=cal2.cbs[,"x.star"],est=predict)

Measures not really meaningful without some comparison

Next Week:

• Compare results to a simple model• Estimate of expenditure / customer value• Use info about clumpiness of purchase patterns (as

in Platzer 2008)• Use info about seasonality of purchasing, with

forecast package• Improve model predictions with machine learning

techniques:– Cross-validation to avoid over-fitting– Combining model predictions

References

• Griffin and Lowenstein (2001), Customer Winback: How to Recapture Lost Customers—And Keep Them Loyal. San Francisco: Jossey-Bass.

• Platzer (2008). “Stochastic models of noncontractual consumer relationships.” Master of Science in Business Administration thesis, Vienna University of Economics and Business Administration, Austria.

• Schmittlein, Morrison, and Colombo (1987). Counting Your Customers: Who Are They and What Will They Do Next? Management Science, 33, 1–24.

• Wang, Gao, and Li (2010). Empirical analysis of customer behaviors in Chinese e-commerce. Journal of networks 5.10: 1177-1184.

• Wübben & Wangenheim (2008). Instant customer base analysis: Managerial heuristics often “get it right”. Journal of Marketing, 72(3), 82-93.

• Zhang, Y., Bradlow, E. T., & Small, D. S. (2012). New Measures of Clumpiness for Incidence Data.

Purchase rate often depends on type of purchase

1.1 million purchases on 360buy.com from Wang, Gao, & Li 2010

predicting customer behavior with r: part 1

Documents

repeat customers customers

cohort of customers

individual customers

separate active customers

rfm family of models

purchase frequency

customer monthly churn

manual models