successes, lessons and challengesmulti‐armed bandits on the web successes, lessons and challenges...

37
Multi armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2 nd Workshop on User Engagement Optimization (KDD’14)

Upload: others

Post on 01-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Multi‐armed Bandits on the WebSuccesses, Lessons and Challenges

Lihong LiMicrosoft Research

08/24/20142nd Workshop on User Engagement Optimization (KDD’14)

Page 2: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

ACTION

Statistics, ML, DM, …

BIG DATA

2MCE

KNOWLEDGE

UTILITY

BIGGERDATA

Ads’ click rateContent adoptionSpam patternSupply forecasting…

correlation

causation

8/24/2014 2Lihong Li

Big TrapCorrelation  Causation

Page 3: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Somewhat Toy‐ish Example

bing.com

8/24/2014 3Lihong Li

Page 4: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

WWII Example

• Statistics collected during WWII…• Bullet holes on bomber planes that came back from mission

• Decision making:• Where to armor?• Abraham Wald: the opposite!

8/24/2014 4Lihong Li

Page 5: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

ReinforcementLearning

ACTION

Statistics, ML, DM, …

BIG DATA

2MCE

KNOWLEDGE

UTILITY

BIGGERDATA

correlation

causation

Ads’ click rateContent adoption Spam patternSupply forecasting…

8/24/2014 5Lihong Li

Page 6: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Machine Learning for Decision Making

World

Observation

Action

Reward

Actions affect world state?

Multi‐armed Bandits

Reinforcement Learning

NO

YES8/24/2014 6Lihong Li

Page 7: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Outline

• Multi‐armed Bandits Algorithms

• Offline Evaluation

• Concluding Remarks

8/24/2014 7Lihong Li

Page 8: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Contextual Bandit [Barto & co’85, Langford & co’08]

Select ∈Observe arms and “context” x ∈ R Receive reward , ∈ [0,1]← + 1

Formally, want to maximize ∑ , 8/24/2014 8Lihong Li

Generalizes classic K‐armed bandits (without context)

Stochastic vs. adversarial

Page 9: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Motivating Applications

• Clinical trials• Resource allocation• Queuing & scheduling• …• Web (more recently)

• Recommendation• Advertising• Search

8/24/2014 9Lihong Li

Page 10: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Case 1: Personalized News Recommendation

8/24/2014 Lihong Li 10

At : available articles at time tx t : user features (age, gender, interests, ...)at : the displayed article at time trt,at

: 1 for click, 0 for no - click

www.yahoo.com

Average reward is click‐through rate (CTR)

Page 11: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Standard Multi‐armed Bandit [R’52, LR’85]No contextual information is available   potentially lower rewards

CTR ≈CTR ≈CTR ≈

‐greedy:

Choose article arg max , with prob. 1 −random , with prob.

UCB1 (Upper Confidence Bound) [ACF’02]

Choose article arg max +Exploration bonus that decays over time 

CTR estimate  = = #clicks#displays

8/24/2014 11Lihong Li

Page 12: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

• Linear model assumption:  E , x   = xT

• Standard least‐squares ridge regression

= D D + I D c ,  where D = −x −−x −⋮ and c = ⋮• Quantifying prediction uncertainty: with high probability,x − x ≤ x A x

LinUCB: UCB for Linear Models [LCLS’10]

A

Prediction errorMeasures how similar x is to previous contexts8/24/2014 12Lihong Li

Page 13: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

LinUCB: Optimism in the Face of Uncertainty

to exploit to explore

A variant of LinUCB:  ( ) with matching lower bound [CLRS’11]

LinRel [Auer 2002] is similarly motivated but more complicated.

LinUCB chooses a∗ = argmax x + x A xRecall UCB1:   arg max +

8/24/2014 13Lihong Li

Page 14: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

LinUCB for News Recommendation [LCLS’10]

8/24/2014 Lihong Li 14

• UCB‐type algorithms do better than ‐greedy counterparts

• CTR improved significantly when features/contexts are considered

Page 15: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

LinUCB Variants

• Hybrid linear models for multi‐task learning [LCLS’10]• Beneficial when data is sparse

• Generalized linear model [LCLMW’12]• Greater flexibility of modeling rewards• Linear regression, logistic regression, probit regression, …

• Sparse linear model [Abbasi‐Yadkori & co’12]• Lower regret when { } are sparse

8/24/2014 15Lihong Li

Page 16: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Case 2: Online Advertising

8/24/2014 Lihong Li 16

Context: query, user info, …Action: displayed adsReward: revenue

Page 17: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Limitation of UCB in Online Advertising

• How to take advantage of prior information to avoid unnecessary exploration

• How to handle long delay of reward after taking an action

• How to enable complex models

8/24/2014 Lihong Li 17

Page 18: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Thompson Sampling• Old heuristic: “probability matching” (1933)

• Pr x = Pr is optimal for x prior, data)• Highly effective in practice [Scott’10] [CL’11]

• Inspired lots of theoretical study in last 2 years• Non‐contextual bandits [Agrawal, Goyal, Kaufmann, …]• Linear bandit [Agrawal & Goyal’13]• Generalized Thompson Sampling [L’13]• Bayes risk analyses [Russo & Van Roy]

8/24/2014 18Lihong Li

Page 19: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Thompson Sampling for Advertising [CL’12] 

8/24/2014 Lihong Li 19

Page 20: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Model‐agnostic Algorithms

• EXP4 [Auer et al’95] [BLLRS’11]• Optimal  ( ) regret bound• Works even when contexts and rewards are generated by adversarial• Computationally expensive in general

• ILOVETOCONBANDITS [AHKLLS’14]• Optimal  ( ) regret bound• Computationally efficient• Promising empirical results

8/24/2014 20Lihong Li

Page 21: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Outline

• Multi‐armed Bandits Algorithms

• Offline Evaluation

• Concluding Remarks

8/24/2014 21Lihong Li

Page 22: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Policy Evaluation x xn a policy  : x → , want to estimate its value:  = E x

Online evaluation• Run  on live users and average observed rewards (as in A/B tests)• Reliable but expensive

Offline evaluation• Estimate  ( ) from historical data set  = { x, , }• Fast and cheap (e.g., benchmark data sets for supervised learning)• Counterfactuality of rewards: no information to evaluate  if  x ≠

8/24/2014 22Lihong Li

Page 23: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Common Approach in Practice

Reward simulator:̂ x, ≈ E xclassification

regressio

nde

nsity

 estim

ation

this (difficult) step is often biased

In contrast, our approach

• avoids explicit user modeling   simple

• gives unbiased evaluation results   reliable

unreliable evaluation

policy Data, ,⋮, ,

8/24/2014 23Lihong Li

Page 24: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Our Approach: Unbiased Offline EvaluationRandomized data collection: at step  ,• Observe current context x• Randomly chooses  ∈ according to ( , , … , ) and receives End result: “exploration data”  = { x, , , }Key properties:

• Unbiasedness:   | | ∑ ⋅1( x )x, , , ∈ = ( )• Estimation error =  | |Related to causal inference (Neyman‐Rubin) and off‐policy learning [Precup & co]

8/24/2014 24Lihong Li

Page 25: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Case 1: News Recommendation [LCLW’11]

• Experiments run in 2009• Fixed an article‐selection policy 

• Run  on live users to measure online click rate• The ground truth

• Use exploration data to evaluate  ’s click rate• The offline estimate

Are they close?

8/24/2014 25Lihong Li

Page 26: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Estim

ated

 nCT

R

Recorded Online nCTR

Unbiasedness

8/24/2014 26Lihong Li

Page 27: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Estimation Error

8/24/2014 27Lihong Li

Page 28: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Case 2: Spelling Correction of Bing

8/24/2014 28Lihong Li

Page 29: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Accuracy of Offline Evaluator [LCKG’14]

1 2 3 4 5 6 7 8

OnlineOffline

Position‐specific click‐through rate

Position

8/24/2014 29Lihong Li

Page 30: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Quantifying Uncertainty in Offline Evaluation

8/24/2014 Lihong Li 30

Production

Page 31: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Case 3: Web Search Ranking

8/24/2014 Lihong Li 31

Search as a bandit:• Context: query• Action: ranked list• Reward: search success‐or‐not

Challenges:• Exponentially many actions large estimation variance

• Collecting enough randomizeddata can be too expensive

Page 32: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Trading off Bias and Variance [LKZ’14]

8/24/2014 Lihong Li 32

•Use natural exploration (uncontrolled diversity) of Bing to simulate randomized data collection Nearly unbiased [SLLK’11] Can use unlimited amount of data  lower variance

•Use approximate matching of actionsMay introduce some amount of bias Can dramatically reduce variance

Page 33: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Predicting Success of New Ranking Function

8/24/2014 Lihong Li 33

Page 34: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Metric Correlation based on Query Segments

8/24/2014 Lihong Li 34

Page 35: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Advanced Offline Evaluation Techniques

• Doubly robust estimation [DLL’11]

• Extends to evaluate learning algorithms (e.g., LinUCB) [DELL’12]• With adaptive importance sampling

Increasingly popular at industrial leaders.

8/24/2014 35Lihong Li

Page 36: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Outline

• Multi‐armed Bandits Algorithms

• Offline Evaluation

• Concluding Remarks

8/24/2014 36Lihong Li

Page 37: Successes, Lessons and ChallengesMulti‐armed Bandits on the Web Successes, Lessons and Challenges Lihong Li Microsoft Research 08/24/2014 2nd Workshop on User Engagement Optimization

Conclusions• Contextual bandits as a natural and versatile model 

• Better decision making  causality• Rich information enables better user understanding & decision making

• Additional challenges not seen in traditional bandits• Delayed rewards, decision making with constraints, …• Dueling bandit [Yue & co]• Gang of bandits [Cesa‐bianchi & co]• …

• Large amount of data makes offline evaluation feasible• Can validate offline evaluation precision by experiments• What is the optimal estimator? [LMS’14]

• Next big step: full RL (non‐myopic decision making)8/24/2014 Lihong Li 37