![Page 1: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/1.jpg)
Machine Learning in the Bandit Setting
Algorithms, Evaluation, and Case Studies
Lihong Li
Machine LearningYahoo! Research
SEWM2012-05-25
![Page 2: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/2.jpg)
2012-05-25SEWM 2
ACTION
Statistics, ML, DM, …
DATA
€
E = MC2
KNOWLEDGE
UTILITY
MOREDATA
ReinforcementLearning
![Page 3: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/3.jpg)
Outline
Introduction
Basic Solutions
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 3
![Page 4: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/4.jpg)
Yahoo-User Interaction
2012-05-25SEWM 4
ads, news, ranking, …
click, conversion, revenue, …
gender,age, …
ACTION
REWARD
CONTEXT servingstrategyPOLICY
GoalMaximize total REWARD
by optimizing POLICY
![Page 5: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/5.jpg)
Today Module @ Yahoo! Front Page
A small pool of articles chosen
by editors
“Featured Article”
2012-05-255SEWM
![Page 6: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/6.jpg)
Objectives and Challenges
• Objectives• (informally) choose most interesting articles for
individual users• (formally) maximize click-through rate (CTR)
• Challenges• Dynamic content pool fast learning• Sparse user visits transfer interests among users• Partial user feedback efficient explore/exploit
2012-05-256SEWM
![Page 7: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/7.jpg)
Challenge: Explore/Exploit
• Observation: only displayed articles get user click feedback
EXPLOIT(choose good articles)
Article CTR estimates
EXPLORE(choose novel articles)
How to trade off?
… with dynamic article pools… while considering user interests
2012-05-257SEWM
![Page 8: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/8.jpg)
Insufficient Exploration Example
always pays $5/round
pays $100 a quarter of the time(so $25/round on average)
1
2
3
4
5
6
7
8
$5
$5
$0
$0
$0
$5
$5
$5
2012-05-258SEWM
It turns out…
$100
$100
$100
$0
$0
$5
$5
$5
![Page 9: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/9.jpg)
Contextual Bandit Formulation
Multi-armed contextual bandit [LZ’08]
€
At : available articles at time t
x t : user features (age, gender, interests, ...)
at : the displayed article at time t
rt,a t: 1 for click, 0 for no - click
€
Formally, we want to maximize rt ,a t
t=1
T
∑
In Today Module:
2012-05-259SEWM
Select
€
at ∈ At
Observe K arms At and “context”
€
x t ∈Rd
Receive reward
€
rt,a t∈ [0,1]
€
t ← t +1
![Page 10: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/10.jpg)
2012-05-25SEWM 10
Another Example – Display Ads
€
At : eligible ads in current page view
x t : page/user features
at : the displayed ad(s)
rt,a t: $ if clicked/converted, 0 otherwise
![Page 11: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/11.jpg)
2012-05-25SEWM 11
€
At : possible document rankings for query qt
x t : query/document features
at : the displayed ranking for query qt
rt,a t: 1 if session succeeds, 0 otherwise
Yet Another Example - Ranking
![Page 12: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/12.jpg)
Related Work• Standard information retrieval and collaborative filtering
• Also concerns with (personalized) recommendation• But with (almost) static users/items
training often done in batch/offline mode
no need for online exploration
• Full reinforcement learning• General: including bandit problems as special cases• Need to tackle “temporal credit assignment”
2012-05-2512SEWM
![Page 13: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/13.jpg)
Outline
Introduction
Basic Solutions› Algorithms
› Evaluation
› Experiments
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 13
![Page 14: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/14.jpg)
2012-05-25SEWM 14
Prior Bandit Algorithms
Herbert Robbins Tze Leung Lai
Regret minimization(focus of this talk)
Bayesian optimal solution
John Gittins
![Page 15: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/15.jpg)
Traditional K-armed Bandits
Assumption: CTR (click-through rate) not affected by user features
€
CTR1 ≈ μ1
€
CTR2 ≈ μ2
€
CTR3 ≈ μ3
€
"ε − greedy":
with prob 1- ε : choose article argmaxa μa
with prob ε : choose a random article
€
"UCB1":
choose article argmaxa μa +α
Na
⎧ ⎨ ⎩
⎫ ⎬ ⎭
The more “a” has been displayed,the less uncertainty in CTRa
2012-05-2515SEWM
CTR estimates = #clicks / #impressions
No contexts no personalization
![Page 16: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/16.jpg)
• EXP4 [ACFS’02], EXP4.P [BLLRS’11], elimination [ADKLS’12]• Strong theoretical guarantees• But computationally expensive
• Epoch-greedy [LZ’08]
• Similar to e-greedy• Simple, general and less expensive• But not most effective
• This talk: algorithms with compact, parametric models• Both efficient and effective• Extension of UCB1 to linear models• … and to generalized linear models• Randomized algorithm with Thompson sampling
Contextual Bandit Algorithms
2012-05-2516SEWM
![Page 17: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/17.jpg)
• Linear model assumption:• Standard least-squares ridge regression
• Reward prediction for new user:
• Whether to explore requires quantifying parameter uncertainty
LinUCB: UCB for Linear Models
€
E ra | x[ ] = xTθ a
€
ˆ θ a = (DaTDa + I)−1Da
Tca
€
A a
€
where Da =
−x1T −
−x2T −
M
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥, ca =
r1r2
M
⎡
⎣
⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥
€
xT ˆ θ a − xTθ a ≤ α xTA a−1x (with high probability)€
xT ˆ θ a ≈ xTθ a
€
measures how "dissimilar" x is to previous users
2012-05-2517SEWM
prediction error
![Page 18: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/18.jpg)
LinUCB: UCB for Linear Models (II)
€
With high prob : xT ˆ θ a − xTθ a ≤ α xTA a−1x
LinUCB always selects an arm with highest UCB:
€
a* = argmaxa
xT ˆ θ a + α xTA a−1x{ }
to exploit to explore
€
UCB1: a* = argmaxa
ˆ μ a +α
Na
⎧ ⎨ ⎩
⎫ ⎬ ⎭
LinRel [Auer 2002] works similarly but in a more complicated way.
2012-05-2518SEWM
Recall...
![Page 19: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/19.jpg)
Outline
Introduction
Basic Solutions› Algorithms
› Evaluation
› Experiments
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 19
![Page 20: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/20.jpg)
Goal: estimate average reward of running p with iid x
• Static p
• Adaptive p
Golden standard
• Run p in real system and see how well it works
• …but expensive and risky
Evaluation of Bandit Algorithms
2012-05-2520SEWM
€
V (π,T) :=1
TE r x t ,π (x t ,ht )( )
t =1
T
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
€
V (π ) := Ex
r x,π (x)( )[ ]
![Page 21: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/21.jpg)
• Benefits• Cheap and risk-free!• Avoid frequent bucket tests• Replicable / fair comparisons
• Common in non-interactive learning problems (e.g., classification)• Benchmark data organized as (input, label) pairs
• … but not straightforward for interactive learning problems• Data in bandits usually consists of (context, arm, reward) triples• No reward signal for other arm’ ≠ arm
Offline Evaluation
2012-05-2521SEWM
![Page 22: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/22.jpg)
Common/Prior Evaluation Approaches
€
data
x1,a1,r1 M
xL ,aL ,rL
⎧
⎨ ⎪
⎩ ⎪
⎫
⎬ ⎪
⎭ ⎪
€
Reward simulator :
ˆ r (x,a) ≈ E r x,a[ ]
cla
ssifi
catio
nre
gres
sio
nd
ensi
ty e
stim
atio
n
this (difficult) step is often biased
In contrast, our approach
• avoids explicit user modeling simple
• gives unbiased evaluation results reliable
unreliable evaluation
bandit algorithm p
2012-05-2522SEWM
![Page 23: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/23.jpg)
€
reveal x i
€
choose ˆ a i = π (x i)€
For i =1,2,...,L :
€
reveal ri only if ˆ a i = ai (a "match")
Our Evaluation Method: “Replay”
€
data
x1,a1,r1 M
xL ,aL ,rL
⎧
⎨ ⎪
⎩ ⎪
⎫
⎬ ⎪
⎭ ⎪
bandit algorithm p
€
Finally, output ˆ V =K
Lri ⋅I( ˆ a i = ai)
i=1
L
∑
2012-05-2523SEWM
€
Want to estimate V (π ) := Ex
r x,π (x)( )[ ]
Key requirement for data collection:
€
ai ~ unif(A)
![Page 24: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/24.jpg)
2012-05-25SEWM 24
Theoretical Guarantees
Thm 1: Our estimator is unbiased Mathematically,
So on average reflects real, online performance
Thm 2: Estimation error 0 with more data Mathematically,
So accuracy guaranteed with large volume of data
€
V (π ) = E ˆ V [ ]
€
ˆ V
€
V (π ) − ˆ V = O K L( )
![Page 25: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/25.jpg)
Case Study in Today Module [LCLW’11] Data:
› Large volume of real user traffic in Today Module
Policies being evaluated:
› EMP [ACE’ 09]
› SEMP/CEMP: personalized EMP variants
› Use policies’ online bucket CTR as “truth”
Random bucket data for evaluation:
› 40M visits, K ~= 20 on average
› Use it to offline-evaluate policies’ CTR
2012-05-25SEWM 25
Are they close?
![Page 26: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/26.jpg)
Unbiasedness (Article nCTR)
Est
ima
ted
nC
TR
Recorded Online nCTR 2012-05-2526SEWM
The offline estimate is indeed unbiased!
![Page 27: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/27.jpg)
Unbiasedness (Daily nCTR)
Recorded Online nCTR
Estimated nCTR
Ten Days in November 2009 2012-05-2527SEWM
The offline estimate is indeed unbiased!
![Page 28: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/28.jpg)
Estimation Error
2012-05-2528SEWM €
1
L
Number of Data (L)
nC
TR
Est
ima
tion
Err
or
Recall our theoretical error bound:
€
Thm 2 (error bound) : V (π ) − ˆ V = O K L( )
![Page 29: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/29.jpg)
Unbiased Offline Evaluation: Recap What we have shown
› A principled method for benchmark data collection
› which allows reliable/unbiased evaluation
› of any bandit algorithms
Analogue: UCI, Caltech101 ... datasets for supervised learning
The first such benchmark was released by Yahoo!
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
2nd and 3rd versions available for PASCAL2 Challenge
› ICML 2012 workshop2012-05-25SEWM 29
![Page 30: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/30.jpg)
Outline
Introduction
Basic Solutions› Algorithms
› Evaluation
› Experiments
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 30
![Page 31: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/31.jpg)
Experiment Setup: Architecture
• Model updated every 5 minutes
• Main metric: overall normalized CTR in deployment bucket• nCTR = CTR * secretNumber
(to protect sensitive business information)
2012-05-2531SEWM
where E/E happens
exploitation only
“Learning Bucket”
“Deployment Bucket”
5%
95%
![Page 32: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/32.jpg)
Experiment Setup: Data
• May 1 2009 data for parameter tuning• May 3-9 2009 data for performance evaluation (33M visits)• Number of candidate articles per user visit is about 20• Dimension reduction on user features [CBP+’09]
• 6 features
• Data available from Yahoo! Research’s Webscope program
http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
€
E rt,a | x t ,a[ ] = x t ,aT θa
2012-05-2532SEWM
![Page 33: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/33.jpg)
“cheating”policy
(no feature)
CTR in Deployment Bucket [LCLS’10]
• UCB-type algorithms do better than e-greedy counterparts
• CTR improved significantly when features/contexts are considered
2012-05-2533SEWM
![Page 34: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/34.jpg)
Article CTR Lift
2012-05-2534SEWM
no context linear model
+ e-greedyo UCB
![Page 35: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/35.jpg)
Outline
Introduction
Basic Solutions
Advanced algorithms› Hybrid linear models
› Generalized linear models
› Thompson sampling
› Theory
Advanced Offline Evaluation
Conclusions2012-05-25SEWM 35
![Page 36: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/36.jpg)
Advantagelearns faster when there are few data
Challengeseems to require unbounded computation complexity
Good news!Efficient implementation made possible by block matrix manipulations
LinUCB for Hybrid Linear Models
€
New assumption : E ra | x[ ] = xTθ a + zaT β
€
Previous assumption : E ra | x[ ] = xTθ a
information shared by all articles(eg, teens like articles about Harry Potter)
article-specific information(eg, Californian males like this article)
2012-05-2536SEWM
![Page 37: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/37.jpg)
Overall CTR in Deployment Bucket
advantageof hybrid
model
2012-05-2537SEWM
• UCB-type algorithms do better than e-greedy counterparts
• CTR improved significantly when features/contexts are considered
• Hybrid model is better when data are scarce
![Page 38: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/38.jpg)
Outline
Introduction
Basic Solutions
Advanced algorithms› Hybrid linear models
› Generalized linear models
› Thompson sampling
› Theory
Advanced Offline Evaluation
Conclusions2012-05-25SEWM 38
![Page 39: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/39.jpg)
2012-05-25SEWM 39
Extensions to GLMs Linear models are unnatural for binary events
Generalized linear models (GLMs)
Logistic regression
Probit regression
€
E ra | x[ ] = xTθ a
€
E ra | x[ ] = g−1 xTθ a( )
€
E ra | x[ ] =1
1+ exp(−xTθ a )
€
E ra | x[ ] = Φ xTθ a( )
(F: CDF of standard Gaussian)
“inverse link function”
€
g−1 : R →[0,1]
logistic function
€
xTθ a
![Page 40: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/40.jpg)
2012-05-25SEWM 40
Model Fitting in GLMs
• Maintain a Bayesian posterior of parameter qa by N(ma, Sa) Use Bayes’ formula with new data (x,r):
€
p(θ a ) ∝ N(θ a;μa ,Σa )⋅ 1+ exp −(2r −1)xTθ a( )( )−1
Current posterior Likelihood
Laplace approximation
€
N μa ',Σa '( )New posterior
![Page 41: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/41.jpg)
2012-05-25SEWM 41
UCB Heuristics for GLMs
€
E ra | xa[ ] ≤
xaT μa + α xa
TΣaxa linear
1+ α exp xaTΣaxa −1( )
1+ exp −xaT μa( )
logistic
Φ xaT μa + α xa
TΣaxa( ) probit
⎧
⎨
⎪ ⎪ ⎪ ⎪
⎩
⎪ ⎪ ⎪ ⎪
• Use posterior N(ma, Sa) to derive (approximate) upper confidence bounds [LCLMW’12]
![Page 42: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/42.jpg)
Experiment Setup• One week data in from June 2009 (34M user visits)• About 20 candidate articles per user visit• Features: 20 features by PCA on raw binary user features• Model updated every 5 minutes
• Main metric: overall (normalized) CTR in deployment bucket
2012-05-2542SEWM
where E/E happens
exploitation only
“Learning Bucket”
“Deployment Bucket”
5%
95%
![Page 43: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/43.jpg)
2012-05-25SEWM 43
GLM Comparisons
Obs #1: active exploration is necessaryObs #2: Logistic/probit > linearObs #3: UCB > e-greedy
e-greedy exploration UCB exploration
linea
rlo
gisi
tcpr
obit
![Page 44: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/44.jpg)
Outline
Introduction
Basic Solutions
Advanced algorithms› Hybrid linear models
› Generalized linear models
› Thompson sampling
› Theory
Advanced Offline Evaluation
Conclusions2012-05-25SEWM 44
![Page 45: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/45.jpg)
2012-05-25SEWM 45
Limitations of UCB Exploration
Exploration can be too much
may explore the whole space exhaustively
difficult to use prior knowledge
Exploration is deterministic
Poor performance when rewards are delayed
Deriving an (approx.) UCB is not always easy
![Page 46: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/46.jpg)
2012-05-25SEWM 46
Thompson Sampling (1933)
Algorithmic idea: “probability matching”
Pr(a|x) = Pr(a is optimal for x)
Randomized action selection (by definition)
More robust to reward delay
Straightforward to implement [CL’12]
Maintain parameter posterior:
Draw random models:
Act accordingly:
Easily combined with other (non-)parametric models€
˜ θ a ~ Pr θ a D( )
€
Pr θ a D( )
€
a(x) = argmaxa
f (x,a; ˜ θ a )
![Page 47: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/47.jpg)
2012-05-25SEWM 47
Thompson Sampling
One-week data from Today Module on Yahoo!’s front pageLogistic regression with Gaussian posteriors
Obs #1: TS is competitive uniformlyObs #2: TS is more robust to reward delay
![Page 48: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/48.jpg)
Outline
Introduction
Basic Solutions
Advanced algorithms› Hybrid linear models
› Generalized linear models
› Thompson sampling
› Theory
Advanced Offline Evaluation
Conclusions2012-05-25SEWM 48
![Page 49: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/49.jpg)
Regret-based Competitive Analysis
€
Regret(T) = E rt,a t
*
t =1
T
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥− E rt,a t
t =1
T
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
the best we could do if we knew all
€
θa
achieved by algorithm
2012-05-2549SEWM
An algorithm “learns” if
An algorithm “learns fast” if is small
€
Regret(T) = O(Tα ) with α < 1
€
α
![Page 50: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/50.jpg)
Regret Bounds
• LinUCB [CLRS’11]: with matching lower bound
• Generalized LinUCB: still open• A variant [FCGSz’11]:
• Thompson sampling• A variant [L’12]:
€
Average reward converges to optimal at the rate O Kd T( ).
€
Example : K = 20, d = 50, T =10M, Kd T = 0.01
2012-05-2550SEWM
€
O KdT( )
€
O d T( )
€
O K1/ 3T 2 / 3( )
![Page 51: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/51.jpg)
Outline
Introduction
Basic Solutions
Advanced algorithms
Advanced Offline Evaluation› Importance weighting
› Doubly robust technique
Conclusions
2012-05-25SEWM 51
![Page 52: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/52.jpg)
Uniformly random data sometimes are a luxury…
› System/cost constraints, user experience considerations, …
Randomized log suffices (by importance weighting)
Variance reduction with the “doubly robust” technique [DLL’11]
Better bias/variance tradeoff by soft rejection sampling [DDLL’12]
Extensions
2012-05-25
SEWM 52
€
V (π ) = E(x,r )~D rπ (x )[ ] ≈1
S
ra ⋅ I(π (x) = a)
max ˆ p (a | x),τ{ }(x,a,ra )∈S
∑
t controls bias/variance trade-off [SLLK 2011]
![Page 53: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/53.jpg)
Offline Evaluation with Non-Uniform Data
2012-05-25SEWM 53
Key idea: importance reweighting
Can use weighted empirical average with estimated p(a|x)
€
ˆ V =1
S
ra ⋅ I(π (x) = a)
max ˆ p (a | x),τ{ }(x,a,ra )∈S
∑ ≈ V (π )
t controls bias/variance trade-off [SLLK 2011]
€
V (π ) = E(x,r )~D rπ (x )[ ] = E(x,r)~D ra ⋅ I(π (x) = a)a
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
= E(x,r)~D
ra ⋅ I(π (x) = a)
p(a | x)p(a | x)
a
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥= E(x,r )~D,a ~ p
ra ⋅ I(π (x) = a)
p(a | x)
⎡
⎣ ⎢
⎤
⎦ ⎥
![Page 54: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/54.jpg)
Results in Today Module Data [SLLK’11]
2012-05-25SEWM 54
![Page 55: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/55.jpg)
Outline
Introduction
Basic Solutions
Advanced algorithms
Advanced Offline Evaluation› Importance weighting
› Doubly robust technique
Conclusions
2012-05-25SEWM 55
![Page 56: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/56.jpg)
Doubly Robust Estimation
Importance weighted formula
Doubly robust technique
Usually DR estimate decreases variance [DLL’11]
2012-05-25SEWM 56
Estimation has high variance if p(a|x) is small
€
ˆ V DR =1
S
ra − ˆ r a( )⋅ I(π (x) = a)
max ˆ p (a | x),τ{ }+ ˆ r a
⎡
⎣ ⎢ ⎢
⎤
⎦ ⎥ ⎥(x,a,ra )∈S
∑
Unbiased if ˆ r a or ˆ p is correct.
€
V (π ) = E(x,r )~D,a ~ p
ra ⋅ I(π (x) = a)
p(a | x)
⎡
⎣ ⎢
⎤
⎦ ⎥≈
1
S
ra ⋅ I(π (x) = a)
max ˆ p (a | x),τ{ }(x,a,ra )∈S
∑
![Page 57: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/57.jpg)
2012-05-25SEWM 57
Multiclass Classification
K-class classification as a K-armed bandit
Training data› In usual (non-bandit) setting,
› In bandit setting,
€
x,c ⇒ x,r1,r2,K ,rK where ra =0 if a = c
1 otherwise
⎧ ⎨ ⎩
€
D = x i,c i{ }i=1,2,K m
€
D = x i,ai, pi,ri,a i{ }i=1,2,K m
usualsetting
banditsetting
123...
m
1 2 3 … K
observed loss
unobserved loss
Loss matrixwith rij in (i,j) entry
![Page 58: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/58.jpg)
Experimental Results on UCI Datasets Split data 50/50 for training (fully labeled) and testing (partially
labeled)
Train p on training data, evaluate p on test data
Repeated 500 times
2012-05-25SEWM 58
![Page 59: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/59.jpg)
Outline
Introduction
Basic Solutions
Advanced algorithms
Advanced Offline Evaluation
Conclusions
2012-05-25SEWM 59
![Page 60: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/60.jpg)
Conclusions
• Contextual bandit as a principled formulation for• News article recommendation• Internet advertising• Web search• ...
• An offline evaluation method of bandit algorithms• unbiased• accurate compared to online bucket results
• Encouraging results in significant applications• strong performance of UCB/TS exploration
2012-05-2560SEWM
![Page 61: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/61.jpg)
Future Work• Offline evaluation
• Better use of non-uniform data• Extension to full reinforcement learning
• Use of prior knowledge
• Variants of bandits• Bandits with budgets• Bandits with many arms• Bandits with multiple objectives• Bandits with submodular rewards• Bandits with delayed reward observations• …
2012-05-2561SEWM
![Page 62: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/62.jpg)
2012-05-25SEWM 62
References Offline policy evaluation
[LCLW] Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. WSDM, 2011
[SLLK] Learning from logged implicit exploration data. NIPS, 2010 [DLL] Doubly robust policy evaluation and learning. ICML, 2011 [DDLL] Sample-efficient nonstationary-policy evaluation for contextual
bandits. Under review.
Bandit algorithms [LCLS] A contextual-bandit approach to personalized news article
recommendation. WWW, 2010 [CLRS] Contextual bandits with linear payoff functions. AISTATS, 2011 [BLLRS] Contextual bandit algorithms with supervised learning
guarantees. AISTATS, 2011 [CL] An empirical evaluation of Thompson sampling. NIPS, 2011 [LCLMW] Unbiased offline evaluation of contextual bandit algorithms with
generalized linear models. JMLR W&PS, 2012
![Page 63: Machine Learning in the Bandit Setting Algorithms, Evaluation, and Case Studies Lihong Li Machine Learning Yahoo! Research SEWM 2012-05-25](https://reader035.vdocument.in/reader035/viewer/2022062222/56649ed15503460f94be0b00/html5/thumbnails/63.jpg)
2012-05-25SEWM 63
Thank You!