thompson sampling - columbia universitysa3305/thompsonsampling.pdf · 2016. 7. 21. · 1dwxudo dqg...

for learning in online decision making

Shipra AgrawalIEOR and Data Science InstituteColumbia University

Movie Recommendations Online Retail Content Search

Goal

Limitations

Challenges

• Maximize revenue / customer satisfaction• Customer “buys” or “likes” or “clicks on” at least one of the products

(preferably the most expensive one)

• Limited display space, customer attention• Limited prior knowledge of customer preferences

1. Learn the “likeability” of products 2. Maximize the revenue or clicks

ARE THE TWO TASKS ALIGNED?

How it works?• Recommend product(s) • Observe customer’s response

Dominated by strong female lead

EXPLORE AND EXPLOITExplore for more informative data

Exploit for immediate clicks

Stuck at second best, need to explore

RANDOMLY EXPLORE FOR EVERY POSSIBLE TYPE OF CUSTOMER?

Personalization

Millions of products

RANDOMLY EXPLORE FOR EVERY POSSIBLE TYPE OF PRODUCT?

Trends change, cold start Short period for collecting and utilizing data

EXPLORE, BUT ONLY AS MUCH AS REQUIRED

The multi-armed bandit problem (Thompson 1933; Robbins 1952)

Multiple rigged slot machines in a casino.Which one to put money on?• Try each one out

WHEN TO STOP TRYING (EXPLORATION) AND START PLAYING (EXPLOITATION)?

Online decisions At every time step 1, … , , pull one arm out of arms

Bandit feedback Only the reward of the pulled arm can be observed

Stochastic feedback For each arm , reward is generated i.i.d from a fixed but unknown distribution

support [0,1], mean

Maximize expected reward in time ∑ ∑ ]

Minimize expected regret in time Optimal arm is the arm with expected reward ∗ max Expected regret for playing arm : Δ ∗ Expected regret in any time ,

Δ Any time algorithm: the time horizon is not known

Natural and Efficient heuristic Maintain belief about effectiveness (mean reward) of each arm Observe feedback, update belief of pulled arm i in Bayesian manner Pull arm with posterior probability of being best arm

NOT choose the one most likely to be effective Gives benefit of doubt those less explored

“optimal” benefit of doubt [Agrawal and Goyal, COLT 2012, AISTATS 2013]

Bernoulli i.i.d rewards: Playing arm produces 1 with unknown probability , 0 otherwise

Maintain Beta posteriors on Starting prior? Use a very non-informative prior Beta(1,1)

Beta prior, Bernoulli likelihood → beta posterior Posterior for arm at time , Beta , 1, , 1)

At any time t, play every arm with its posterior probability of being the best arm

Start with uniform prior Beta(1,1) for each At time t=1, 2, … Posterior for arm is Beta( , 1, , 1 Sample from posterior for i Play arm arg max Observe reward 0 or 1 with probability , Update success and failures

for

Bayesian algorithm for frequentist setting!

Optimal instance-dependent bounds for Bernoulli rewards Regret 1 ln ∑

∗|| Matches asymptotic lower bound for any algorithm [Lai Robbins 1985] The popular UCB algorithm achieves this only after careful tuning [Bayes-UCB Kaufmann et al. 2012]

Near-optimal worst-case-instance bounds Regret ln )

Lower bound Ω

Only assumption: Bernoulli likelihood

Suppose reward for arm is i.i.d. , 1 Starting prior N(0,1) Gaussian Prior, Gaussian likelihood → Gaussian posterior , , ,

, is empirical mean of , observations for arm Algorithm

Sample from posterior , , , for arm Play arm arg max Observe reward, update empirical mean for arm

Now apply this algorithm for any reward distribution!

Near-optimal instance-dependent bounds Regret ∑

Matches the best available for UCB for general reward distributions

Near-optimal worst-case-instance bounds Regret ln )

Matches lower bound within logarithmic factors

Only assumption: Bounded or subGaussian noise =

Two arms, , Δ Every time arm 2 is pulled, Δ regret Bound the number of pulls of arm 2 by to get regret bound How many pulls of arm 2 are actually needed?

After n=O pulls of arm 2 and arm 1

Empirical means are well separatedError whp

Beta Posteriors are well separated standard deviation ≃

The two arms can be distinguished!No more arm 2 pulls.

Δ

pulls of arm 2, but few pulls of arm 1

Δ

Δ

Arm 1 will be played roughly every constant number of steps in this situation It will take at most constant steps (extra pulls of arm 2) to get out of

this situation Total number of pulls of arm 2 is at most O

Summary: variance of posterior enables exploration Optimal bounds (up to optimal constants) require more careful use of

posterior structure

Scalability Large number of products and customer types Utilize similarity? Content based recommendation

Customers and products described by their features Similar features means similar preferences Parametric models mapping customer and product features to customer

preferences Contextual bandits

Exploration-exploitation to learn the parametric models

N arms, possibly very large N A d-dimensional context (feature vector) , for every arm , time Linear parametric model

Unknown parameter Expected reward for arm at time is , ⋅

Algorithm picks ∈ , , … , , , observes ⋅ Optimal arm depends on context: x∗ arg max

, , ⋅

Goal: Minimize regret Regret(T) = ∑ ∗ ⋅ ⋅

Least square solution of set of 1 equations ⋅ , 1, … , 1

≃ ∑ where ∑ ′ covariance matrix of this estimator

[A., Goyal 2013] 0, starting prior on , Reward distribution given , , : , , 1 , posterior on at time t is ,

Algorithm: At Step t, Sample from , Pull arm with feature where

max, , ⋅

Apply this algorithm for any likelihood , starting prior 0, !

With probability 1 , regret/

Any likelihood, unknown prior, only assumes bounded or sub-Gaussian noise No dependence on number of arm Lower bound Ω For UCB, best bound [Dani et al 2008, Abbasi-Yadkori et al 2011] Best earlier bound for a polynomial time algorithm / [Dani et al 2008]

∗∗∗

∗

Known likelihood Exponential families (with Jeffreys prior) [Korda et al. 2013]

Known prior (Bayesian regret) Near-optimal regret bounds for any prior [Russo and Van Roy 2013, 2014], [Bubeck

and Liu 2013] Extensions for many variations of MAB

side information, delayed feedback, sleeping bandits, sparse bandits, spectral bandits

Assortment selection as multi-armed bandit Arms are products Limited display space, k products at a time Challenge: Customer response on one product is influenced by other

products in assortment Arms are no longer independent

Multinomial logit choice model Probability of choosing product i (feature vector in assortment S

∑ Log ratio is linear in features

1-dimensional case [A., Avadhanula, Goyal, Zeevi, EC 2016]

1 ∑ Log ratio is constant

Independence of irrelevant alternatives

N products, Unknown parameters , , … ,At every step , recommend an assortment of size at most K, observe customer choice , revenue update parameter estimatesGoal: optimize total revenue ∑ or minimize regret compared to optimal assortment S∗ argmax ∑

[A., Avadhanula, Goyal, Zeevi, EC 2016]

Censored feedback Feedback for product effected by other products in assortment N possible assortments

• Getting unbiased estimate offer an assortment until no-purchase Number of times to is purchased is unbiased estimate of its parameter

Then, use standard UCB or Thompson Sampling techniques

UCB regret for 1-dimensional parameter

Assumes no purchase probability is highest Parameter independent, no dependence on K

regret [Ongoing work] Parameter c is a lower bound on gradient of choice probability with respect to any product parameter

Thompson Sampling Ongoing work, significantly more attractive empirical results

Budget/supply constraints, nonlinear utilities [A. and Devanur EC 2014] [A. and Devanur SODA 2015] [A., Devanur, Li, 2016] [A.

and Devanur 2016]

Exploring when your recommendations may not be followed Incentivizing selfish users to explore

thompson sampling - columbia universitysa3305/thompsonsampling.pdf · 2016. 7. 21. · 1dwxudo dqg...

Documents