thompson sampling - columbia universitysa3305/thompsonsampling.pdf · 2016. 7. 21. · 1dwxudo dqg...

35
for learning in online decision making Shipra Agrawal IEOR and Data Science Institute Columbia University

Upload: others

Post on 22-Jan-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

for learning in online decision making

Shipra AgrawalIEOR and Data Science InstituteColumbia University

Page 2: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Movie Recommendations Online Retail Content Search

Page 3: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Goal

Limitations

Challenges

• Maximize revenue / customer satisfaction• Customer “buys” or “likes” or “clicks on” at least one of the products

(preferably the most expensive one)

• Limited display space, customer attention• Limited prior knowledge of customer preferences

1. Learn the “likeability” of products 2. Maximize the revenue or clicks

ARE THE TWO TASKS ALIGNED?

How it works?• Recommend product(s) • Observe customer’s response

Page 4: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Dominated by strong female lead

EXPLORE AND EXPLOITExplore for more informative data

Exploit for immediate clicks

Stuck at second best, need to explore

Page 5: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

RANDOMLY EXPLORE FOR EVERY POSSIBLE TYPE OF CUSTOMER?

Personalization

Page 6: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Millions of products

RANDOMLY EXPLORE FOR EVERY POSSIBLE TYPE OF PRODUCT?

Page 7: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Trends change, cold start Short period for collecting and utilizing data

EXPLORE, BUT ONLY AS MUCH AS REQUIRED

Page 8: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

The multi-armed bandit problem (Thompson 1933; Robbins 1952)

Multiple rigged slot machines in a casino.Which one to put money on?• Try each one out

WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?

Page 9: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Online decisions At every time step 1, … , , pull one arm out of arms

Bandit feedback Only the reward of the pulled arm can be observed

Stochastic feedback For each arm , reward is generated i.i.d from a fixed but unknown distribution

support [0,1], mean

Page 10: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Maximize expected reward in time ∑ ∑ ]

Minimize expected regret in time Optimal arm is the arm with expected reward ∗ max Expected regret for playing arm : Δ ∗ Expected regret in any time ,

Δ Any time algorithm: the time horizon is not known

Page 11: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Natural and Efficient heuristic Maintain belief about effectiveness (mean reward) of each arm Observe feedback, update belief of pulled arm i in Bayesian manner Pull arm with posterior probability of being best arm

NOT choose the one most likely to be effective Gives benefit of doubt those less explored

“optimal” benefit of doubt [Agrawal and Goyal, COLT 2012, AISTATS 2013]

Page 12: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Bernoulli i.i.d rewards: Playing arm produces 1 with unknown probability , 0 otherwise

Maintain Beta posteriors on Starting prior? Use a very non-informative prior Beta(1,1)

Beta prior, Bernoulli likelihood → beta posterior Posterior for arm at time , Beta , 1, , 1)

At any time t, play every arm with its posterior probability of being the best arm

Page 13: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Start with uniform prior Beta(1,1) for each At time t=1, 2, … Posterior for arm is Beta( , 1, , 1 Sample from posterior for i Play arm arg max Observe reward 0 or 1 with probability , Update success and failures

for

Bayesian algorithm for frequentist setting!

Page 14: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Optimal instance-dependent bounds for Bernoulli rewards Regret 1 ln ∑

∗|| Matches asymptotic lower bound for any algorithm [Lai Robbins 1985] The popular UCB algorithm achieves this only after careful tuning [Bayes-UCB Kaufmann et al. 2012]

Near-optimal worst-case-instance bounds Regret ln )

Lower bound Ω

Only assumption: Bernoulli likelihood

Page 15: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Suppose reward for arm is i.i.d. , 1 Starting prior N(0,1) Gaussian Prior, Gaussian likelihood → Gaussian posterior , , ,

, is empirical mean of , observations for arm Algorithm

Sample from posterior , , , for arm Play arm arg max Observe reward, update empirical mean for arm

Now apply this algorithm for any reward distribution!

Page 16: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Near-optimal instance-dependent bounds Regret ∑

Matches the best available for UCB for general reward distributions

Near-optimal worst-case-instance bounds Regret ln )

Matches lower bound within logarithmic factors

Only assumption: Bounded or subGaussian noise =

Page 17: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Two arms, , Δ Every time arm 2 is pulled, Δ regret Bound the number of pulls of arm 2 by to get regret bound How many pulls of arm 2 are actually needed?

Page 18: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

After n=O pulls of arm 2 and arm 1

Empirical means are well separatedError whp

Beta Posteriors are well separated standard deviation ≃

The two arms can be distinguished!No more arm 2 pulls.

Δ

Page 19: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

pulls of arm 2, but few pulls of arm 1

Δ

Δ

Page 20: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Arm 1 will be played roughly every constant number of steps in this situation It will take at most constant steps (extra pulls of arm 2) to get out of

this situation Total number of pulls of arm 2 is at most O

Summary: variance of posterior enables exploration Optimal bounds (up to optimal constants) require more careful use of

posterior structure

Page 21: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Scalability Large number of products and customer types Utilize similarity? Content based recommendation

Customers and products described by their features Similar features means similar preferences Parametric models mapping customer and product features to customer

preferences Contextual bandits

Exploration-exploitation to learn the parametric models

Page 22: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

N arms, possibly very large N A d-dimensional context (feature vector) , for every arm , time Linear parametric model

Unknown parameter Expected reward for arm at time is , ⋅

Algorithm picks ∈ , , … , , , observes ⋅ Optimal arm depends on context: x∗ arg max

, , ⋅

Goal: Minimize regret Regret(T) = ∑ ∗ ⋅ ⋅

Page 23: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Least square solution of set of 1 equations ⋅ , 1, … , 1

≃ ∑ where ∑ ′ covariance matrix of this estimator

[A., Goyal 2013] 0, starting prior on , Reward distribution given , , : , , 1 , posterior on at time t is ,

Page 24: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Algorithm: At Step t, Sample from , Pull arm with feature where

max, , ⋅

Apply this algorithm for any likelihood , starting prior 0, !

Page 25: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

With probability 1 , regret/

Any likelihood, unknown prior, only assumes bounded or sub-Gaussian noise No dependence on number of arm Lower bound Ω For UCB, best bound [Dani et al 2008, Abbasi-Yadkori et al 2011] Best earlier bound for a polynomial time algorithm / [Dani et al 2008]

Page 26: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

∗∗∗

Page 27: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Known likelihood Exponential families (with Jeffreys prior) [Korda et al. 2013]

Known prior (Bayesian regret) Near-optimal regret bounds for any prior [Russo and Van Roy 2013, 2014], [Bubeck

and Liu 2013] Extensions for many variations of MAB

side information, delayed feedback, sleeping bandits, sparse bandits, spectral bandits

Page 28: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Assortment selection as multi-armed bandit Arms are products Limited display space, k products at a time Challenge: Customer response on one product is influenced by other

products in assortment Arms are no longer independent

Page 29: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup
Page 30: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup
Page 31: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Multinomial logit choice model Probability of choosing product i (feature vector in assortment S

∑ Log ratio is linear in features

1-dimensional case [A., Avadhanula, Goyal, Zeevi, EC 2016]

1 ∑ Log ratio is constant

Independence of irrelevant alternatives

Page 32: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

N products, Unknown parameters , , … ,At every step , recommend an assortment of size at most K, observe customer choice , revenue update parameter estimatesGoal: optimize total revenue ∑ or minimize regret compared to optimal assortment S∗ argmax ∑

[A., Avadhanula, Goyal, Zeevi, EC 2016]

Page 33: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Censored feedback Feedback for product effected by other products in assortment N possible assortments

• Getting unbiased estimate offer an assortment until no-purchase Number of times to is purchased is unbiased estimate of its parameter

Then, use standard UCB or Thompson Sampling techniques

Page 34: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

UCB regret for 1-dimensional parameter

Assumes no purchase probability is highest Parameter independent, no dependence on K

regret [Ongoing work] Parameter c is a lower bound on gradient of choice probability with respect to any product parameter

Thompson Sampling Ongoing work, significantly more attractive empirical results

Page 35: Thompson Sampling - Columbia Universitysa3305/ThompsonSampling.pdf · 2016. 7. 21. · 1dwxudo dqg (iilflhqw khxulvwlf 0dlqwdlq eholhi derxw hiihfwlyhqhvv phdq uhzdug ri hdfk dup

Budget/supply constraints, nonlinear utilities [A. and Devanur EC 2014] [A. and Devanur SODA 2015] [A., Devanur, Li, 2016] [A.

and Devanur 2016]

Exploring when your recommendations may not be followed Incentivizing selfish users to explore