thompson sampling - columbia universitysa3305/thompsonsampling.pdf · 2016. 7. 21. · 1dwxudo dqg...
TRANSCRIPT
for learning in online decision making
Shipra AgrawalIEOR and Data Science InstituteColumbia University
Movie Recommendations Online Retail Content Search
Goal
Limitations
Challenges
• Maximize revenue / customer satisfaction• Customer “buys” or “likes” or “clicks on” at least one of the products
(preferably the most expensive one)
• Limited display space, customer attention• Limited prior knowledge of customer preferences
1. Learn the “likeability” of products 2. Maximize the revenue or clicks
ARE THE TWO TASKS ALIGNED?
How it works?• Recommend product(s) • Observe customer’s response
Dominated by strong female lead
EXPLORE AND EXPLOITExplore for more informative data
Exploit for immediate clicks
Stuck at second best, need to explore
RANDOMLY EXPLORE FOR EVERY POSSIBLE TYPE OF CUSTOMER?
Personalization
Millions of products
RANDOMLY EXPLORE FOR EVERY POSSIBLE TYPE OF PRODUCT?
Trends change, cold start Short period for collecting and utilizing data
EXPLORE, BUT ONLY AS MUCH AS REQUIRED
The multi-armed bandit problem (Thompson 1933; Robbins 1952)
Multiple rigged slot machines in a casino.Which one to put money on?• Try each one out
WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?
Online decisions At every time step 1, … , , pull one arm out of arms
Bandit feedback Only the reward of the pulled arm can be observed
Stochastic feedback For each arm , reward is generated i.i.d from a fixed but unknown distribution
support [0,1], mean
Maximize expected reward in time ∑ ∑ ]
Minimize expected regret in time Optimal arm is the arm with expected reward ∗ max Expected regret for playing arm : Δ ∗ Expected regret in any time ,
Δ Any time algorithm: the time horizon is not known
Natural and Efficient heuristic Maintain belief about effectiveness (mean reward) of each arm Observe feedback, update belief of pulled arm i in Bayesian manner Pull arm with posterior probability of being best arm
NOT choose the one most likely to be effective Gives benefit of doubt those less explored
“optimal” benefit of doubt [Agrawal and Goyal, COLT 2012, AISTATS 2013]
Bernoulli i.i.d rewards: Playing arm produces 1 with unknown probability , 0 otherwise
Maintain Beta posteriors on Starting prior? Use a very non-informative prior Beta(1,1)
Beta prior, Bernoulli likelihood → beta posterior Posterior for arm at time , Beta , 1, , 1)
At any time t, play every arm with its posterior probability of being the best arm
Start with uniform prior Beta(1,1) for each At time t=1, 2, … Posterior for arm is Beta( , 1, , 1 Sample from posterior for i Play arm arg max Observe reward 0 or 1 with probability , Update success and failures
for
Bayesian algorithm for frequentist setting!
Optimal instance-dependent bounds for Bernoulli rewards Regret 1 ln ∑
∗|| Matches asymptotic lower bound for any algorithm [Lai Robbins 1985] The popular UCB algorithm achieves this only after careful tuning [Bayes-UCB Kaufmann et al. 2012]
Near-optimal worst-case-instance bounds Regret ln )
Lower bound Ω
Only assumption: Bernoulli likelihood
Suppose reward for arm is i.i.d. , 1 Starting prior N(0,1) Gaussian Prior, Gaussian likelihood → Gaussian posterior , , ,
, is empirical mean of , observations for arm Algorithm
Sample from posterior , , , for arm Play arm arg max Observe reward, update empirical mean for arm
Now apply this algorithm for any reward distribution!
Near-optimal instance-dependent bounds Regret ∑
Matches the best available for UCB for general reward distributions
Near-optimal worst-case-instance bounds Regret ln )
Matches lower bound within logarithmic factors
Only assumption: Bounded or subGaussian noise =
Two arms, , Δ Every time arm 2 is pulled, Δ regret Bound the number of pulls of arm 2 by to get regret bound How many pulls of arm 2 are actually needed?
After n=O pulls of arm 2 and arm 1
Empirical means are well separatedError whp
Beta Posteriors are well separated standard deviation ≃
The two arms can be distinguished!No more arm 2 pulls.
Δ
pulls of arm 2, but few pulls of arm 1
Δ
Δ
Arm 1 will be played roughly every constant number of steps in this situation It will take at most constant steps (extra pulls of arm 2) to get out of
this situation Total number of pulls of arm 2 is at most O
Summary: variance of posterior enables exploration Optimal bounds (up to optimal constants) require more careful use of
posterior structure
Scalability Large number of products and customer types Utilize similarity? Content based recommendation
Customers and products described by their features Similar features means similar preferences Parametric models mapping customer and product features to customer
preferences Contextual bandits
Exploration-exploitation to learn the parametric models
N arms, possibly very large N A d-dimensional context (feature vector) , for every arm , time Linear parametric model
Unknown parameter Expected reward for arm at time is , ⋅
Algorithm picks ∈ , , … , , , observes ⋅ Optimal arm depends on context: x∗ arg max
, , ⋅
Goal: Minimize regret Regret(T) = ∑ ∗ ⋅ ⋅
Least square solution of set of 1 equations ⋅ , 1, … , 1
≃ ∑ where ∑ ′ covariance matrix of this estimator
[A., Goyal 2013] 0, starting prior on , Reward distribution given , , : , , 1 , posterior on at time t is ,
Algorithm: At Step t, Sample from , Pull arm with feature where
max, , ⋅
Apply this algorithm for any likelihood , starting prior 0, !
With probability 1 , regret/
Any likelihood, unknown prior, only assumes bounded or sub-Gaussian noise No dependence on number of arm Lower bound Ω For UCB, best bound [Dani et al 2008, Abbasi-Yadkori et al 2011] Best earlier bound for a polynomial time algorithm / [Dani et al 2008]
∗∗∗
∗
Known likelihood Exponential families (with Jeffreys prior) [Korda et al. 2013]
Known prior (Bayesian regret) Near-optimal regret bounds for any prior [Russo and Van Roy 2013, 2014], [Bubeck
and Liu 2013] Extensions for many variations of MAB
side information, delayed feedback, sleeping bandits, sparse bandits, spectral bandits
Assortment selection as multi-armed bandit Arms are products Limited display space, k products at a time Challenge: Customer response on one product is influenced by other
products in assortment Arms are no longer independent
Multinomial logit choice model Probability of choosing product i (feature vector in assortment S
∑ Log ratio is linear in features
1-dimensional case [A., Avadhanula, Goyal, Zeevi, EC 2016]
1 ∑ Log ratio is constant
Independence of irrelevant alternatives
N products, Unknown parameters , , … ,At every step , recommend an assortment of size at most K, observe customer choice , revenue update parameter estimatesGoal: optimize total revenue ∑ or minimize regret compared to optimal assortment S∗ argmax ∑
[A., Avadhanula, Goyal, Zeevi, EC 2016]
Censored feedback Feedback for product effected by other products in assortment N possible assortments
• Getting unbiased estimate offer an assortment until no-purchase Number of times to is purchased is unbiased estimate of its parameter
Then, use standard UCB or Thompson Sampling techniques
UCB regret for 1-dimensional parameter
Assumes no purchase probability is highest Parameter independent, no dependence on K
regret [Ongoing work] Parameter c is a lower bound on gradient of choice probability with respect to any product parameter
Thompson Sampling Ongoing work, significantly more attractive empirical results
Budget/supply constraints, nonlinear utilities [A. and Devanur EC 2014] [A. and Devanur SODA 2015] [A., Devanur, Li, 2016] [A.
and Devanur 2016]
Exploring when your recommendations may not be followed Incentivizing selfish users to explore