bayesian contextual multi-armed bandits contextual multi-armed bandits ... the epoch-greedy...
Post on 09-Mar-2018
255 Views
Preview:
TRANSCRIPT
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Bayesian Contextual Multi-armed Bandits
Xiaoting ZhaoJoint Work with Peter I. Frazier
School of Operations Research and Information EngineeringCornell University
October 22, 2012
1 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Outline
1 Motivating Examples
2 Modeling
3 Optimal Solution and Upper Bound
4 Conclusion
2 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Outline
1 Motivating Examples
2 Modeling
3 Optimal Solution and Upper Bound
4 Conclusion
3 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Motivation: Recommender System
Learning through Exploration in arXiv
Operate at a large scale:
-750, 000 eprints of scientific articles
-1.5 billions user accesses
Learn quickly and react fast
Exploration vs. Exploitation
Goal: Interactively combine content and use the observed feedback toimprove future content choices.
4 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Motivation: Optimal Experimental Design
Oil-drilling
Drill test wells to learn about the potential for oil or natural gas
Each drilling test can cost millions of dollars
Computational challenge: size of the search space
Goal: Choose the best price, temperature, pressure, and otherparameters to produce the best results for a business simulator.
5 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Outline
1 Motivating Examples
2 Modeling
3 Optimal Solution and Upper Bound
4 Conclusion
6 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Modeling: Contextual Multi-armed Bandits
At each time step n = 1, 2, ...
Given context Zn ∈ Rd, iid from known PZ
query keywords, user features, social media, medical records, etc.
Choose an action Xn ∈ {1, ...,K}arms/items: articles, advertisement, movies, treatments, etc.
Latent item features θXn :item features, item relevance, etc.
Receive partial rewards, Yn, with some noise εn ∈ R:
CTRs, generated revenue, ratings, relevance score, user engagement
Linear model: powerful and analytically tractable
Yn = θXn · Zn + εn ∈ R
Maximize the discounted total expected reward
supπ∈Π
Eπ
[ ∞∑n=0
γnYn
]7 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Modeling: Contextual Multi-armed Bandits
At each time step n = 1, 2, ...
Given context Zn ∈ Rd, iid from known PZ
query keywords, user features, social media, medical records, etc.
Choose an action Xn ∈ {1, ...,K}arms/items: articles, advertisement, movies, treatments, etc.
Latent item features θXn :item features, item relevance, etc.
Receive partial rewards, Yn, with some noise εn ∈ R:
CTRs, generated revenue, ratings, relevance score, user engagement
Linear model: powerful and analytically tractable
Yn = θXn · Zn + εn ∈ R
Maximize the discounted total expected reward
supπ∈Π
Eπ
[ ∞∑n=0
γnYn
]8 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Modeling: Contextual Multi-armed Bandits
At each time step n = 1, 2, ...
Given context Zn ∈ Rd, iid from known PZ
query keywords, user features, social media, medical records, etc.
Choose an action Xn ∈ {1, ...,K}arms/items: articles, advertisement, movies, treatments, etc.
Latent item features θXn :item features, item relevance, etc.
Receive partial rewards, Yn, with some noise εn ∈ R:
CTRs, generated revenue, ratings, relevance score, user engagement
Linear model: powerful and analytically tractable
Yn = θXn · Zn + εn ∈ R
Maximize the discounted total expected reward
supπ∈Π
Eπ
[ ∞∑n=0
γnYn
]9 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Modeling: Contextual Multi-armed Bandits
At each time step n = 1, 2, ...
Given context Zn ∈ Rd, iid from known PZ
query keywords, user features, social media, medical records, etc.
Choose an action Xn ∈ {1, ...,K}arms/items: articles, advertisement, movies, treatments, etc.
Latent item features θXn :item features, item relevance, etc.
Receive partial rewards, Yn, with some noise εn ∈ R:
CTRs, generated revenue, ratings, relevance score, user engagement
Linear model: powerful and analytically tractable
Yn = θXn · Zn + εn ∈ R
Maximize the discounted total expected reward
supπ∈Π
Eπ
[ ∞∑n=0
γnYn
]10 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Modeling: Contextual Multi-armed Bandits
At each time step n = 1, 2, ...
Given context Zn ∈ Rd, iid from known PZ
query keywords, user features, social media, medical records, etc.
Choose an action Xn ∈ {1, ...,K}arms/items: articles, advertisement, movies, treatments, etc.
Latent item features θXn :item features, item relevance, etc.
Receive partial rewards, Yn, with some noise εn ∈ R:
CTRs, generated revenue, ratings, relevance score, user engagement
Linear model: powerful and analytically tractable
Yn = θXn · Zn + εn ∈ R
Maximize the discounted total expected reward
supπ∈Π
Eπ
[ ∞∑n=0
γnYn
]11 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Intuitive Example
preference 1
pref
eren
ce 2
Example of Zn and θx
0 2 4 6 8 10 12 14 16 18 200
2
4
6
8
10
12
14
16
18
20prior θ1
prior θ2
prior θ3
Zn
posterior θ1
12 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Literature Review
Non-Bayesian Methods
The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits(Langford and Zhang 2007)
A Contextual-Bandit Approach to Personalized New ArticleRecommendation (Langford, et al., 2010)
An optimal high probability algorithm for the contextual banditproblem (Beygelzimer et al., 2010)
Efficient Optimal Learning for Contextual Bandits (Dudik et al.,2010)
PAC-Bayesian Analysis of Contextual Bandits (Seldin et al., 2011)
Bayesian Approach
Explore/Exploit Schemes for Web Content Optimization (Agarwal,Chen, and Elango 2009)
Collaborative Topic Modeling for Recommending Scientific Articles(Wang and Blei 2011)
13 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Modeling: Contextual Multi-armed Bandits
Prior distribution on θx:
θx ∼ N(µx0,Σx0
)
ObservationYn = θXn
· Zn + εn ∈ R
with noise εn ∼ N(0, σ2)
Decision Xn at time n is a function of
Historical observation: Hn−1 = (Xm, Ym, Zm : 1 ≤ m ≤ n− 1)
New context: Zn
14 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Optimality Equation
Goal: find the optimal policy π∗ s.t.
V ∗ = supπ∈Π
Eπ[ ∞∑n=1
γnZn · θXn
],
where a policy π is a sequence of π = (πn : n = 1, 2, ...) with
πn :({1, ...,K} × R× Rd
)n−1
× Rd → {1, ...,K},
Xn = πn(Hn−1, Zn)
15 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Outline
1 Motivating Examples
2 Modeling
3 Optimal Solution and Upper Bound
4 Conclusion
16 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Simple Case: Zn = ones(d), d = 1
Classical Multi-armed Bandits
A row of slot machines to play
Unknown expected payoff
V ∗ = supπ∈Π
Eπ[ ∞∑n=1
γnRn
]
Gittins Index Policy (Gittins 1974)
Assign index, νi(xi), for each state of each item i
νi(xi) = supτ>0
<∑τ−1n=0 γ
nR[X(n)] >X(0)=i
<∑τ−1n=0 γ
n >
Index Policy: simple and provide the best tradeoff
i∗ = arg maxiνi(xi)
17 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Simple Case: Zn = ones(d), d = 1
Classical Multi-armed Bandits
A row of slot machines to play
Unknown expected payoff
V ∗ = supπ∈Π
Eπ[ ∞∑n=1
γnRn
]
Gittins Index Policy (Gittins 1974)
Assign index, νi(xi), for each state of each item i
νi(xi) = supτ>0
<∑τ−1n=0 γ
nR[X(n)] >X(0)=i
<∑τ−1n=0 γ
n >
Index Policy: simple and provide the best tradeoff
i∗ = arg maxiνi(xi)
18 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
General Case: Contextual Multi-armed Bandits
V ∗ = supπ∈Π
Eπ[ ∞∑n=1
γnZn · θXn
]
Challenges
Random context, ZnGittins index policy fails to generalize
Curses of dimensionality
State space
Action space
Partial feedback revealedFigure: Graph inserted from YoshuaBengio’s homepage
19 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Main Ideas of the Heuristic Policy
Decomposition of the space into independent d sub-space
1
Inspiration from Whittle’s work on Restless Multi-armed Bandits
Relax the constraint on non-operating arms and average activity
Apply Lagrangian Relaxation
1Dimensionality Reduction Techniques for Face Recognition, Sharath et al., 201120 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Upper Bound: Heuristic Policy
Decompose to d sub-problems
φ1, ...φd be an orthonormal basis in Rd, i.e., φj ∈ Rd
φj · φj′ =
{1 if j = j′,
0 if j 6= j′.
Arriving context Zn as
Zn =
d∑i=1
Zniφi with Zni = φi · Zn
Find the optimal policy for each subproblem: a 2-dim state-space DP
supπ′∈Π′
Eπ′
[ ∞∑n=1
γnZniφin · θXni
∣∣∣ φi′ · θj ∀i′ 6= i,∀Xni ∈ {1, ...,K}
]
21 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Proof of the Upper Bound
supπ∈Π
EπP
[ ∞∑n=1
γnYn
]≤ supπ∈Π′
d∑i=1
EπQ
[ ∞∑n=1
γnZniφi · θXni
]
≤d∑i=1
supπ∈Π′
EπQ
[ ∞∑n=1
γnZniφi · θXni
]
≤d∑i=1
E
[supπ∈Π′
EπQ
[ ∞∑n=1
γnZniφin · θx | φi′ · θj ∀i′ 6= i,∀x
]].
22 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Solving each sub-problem
Vi(·) = supπ′∈Π′
Eπ′
[ ∞∑n=1
γnZniφin · θXni| φi′ · θj ∀i′ 6= i,∀j
]
Gittins construction
Non-operating arms stay static
But Zn could evolve even for passive arms
Restless Multi-arm Bandits (Whittle 1988)
Both active and passive arms could evolve
Whittle index policy: optimal under a relaxed constraint for averageactivity
23 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Whittle Index Policy Relaxation
W ∗ij = supπ′′′∈Π′′′
Eπ′′′
[ ∞∑n=1
∑∀x
γnZniθxUnix | φi′ · θj ∀i′ 6= i, x = 1, ...,K
]
subject to:Unix ∈ {0, 1},
Eπ[ ∞∑n=1
γn
(∑x
Unix
)ZniE[Zi]
]≤∞∑n=1
γn =1
1− γ
24 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Lagrangian Relaxation
Relax the constraint with a Lagrange multiplier ν:
L(ν, (µ0, σ
20 , Zni)
)=∑x
supπ
Eπ[ ∞∑n=1
γnZni
[θx −
ν
E[Zi]
]Unix
]+
ν
1− γ
subject to:Unix ∈ {0, 1}
Note: ν is a subsidy for resting an arm.
25 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Weak Duality
Dynamics of the state space (µn+1,i,j , σ2n+1,i,j , Zn+1,i)(µnij , σ
2nij , Zn+1,i) if Unij = 0,(
µnij + σ(σ2nij)Sn,
(σ−2nij + Z2
n,i/λ2)−1
, Zn+1,i
)if Unij = 1,
where Sn ∼ N (0, 1) and σ(s2) =√s2 − (s−2 + (λ/Zni)−2)−1.
Use Bisection algorithm to find ν∗
L(µ0, σ20 , Zni)
∗ = infν≥0
L(ν, (µ0, σ
20 , Zni)
)Lagrangian relaxation optimality
Upper bound for the restless multi-armed bandit
W ∗ ≤ L∗.
26 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Weak Duality
Dynamics of the state space (µn+1,i,j , σ2n+1,i,j , Zn+1,i)(µnij , σ
2nij , Zn+1,i) if Unij = 0,(
µnij + σ(σ2nij)Sn,
(σ−2nij + Z2
n,i/λ2)−1
, Zn+1,i
)if Unij = 1,
where Sn ∼ N (0, 1) and σ(s2) =√s2 − (s−2 + (λ/Zni)−2)−1.
Use Bisection algorithm to find ν∗
L(µ0, σ20 , Zni)
∗ = infν≥0
L(ν, (µ0, σ
20 , Zni)
)Lagrangian relaxation optimality
Upper bound for the restless multi-armed bandit
W ∗ ≤ L∗.
27 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Weak Duality
Dynamics of the state space (µn+1,i,j , σ2n+1,i,j , Zn+1,i)(µnij , σ
2nij , Zn+1,i) if Unij = 0,(
µnij + σ(σ2nij)Sn,
(σ−2nij + Z2
n,i/λ2)−1
, Zn+1,i
)if Unij = 1,
where Sn ∼ N (0, 1) and σ(s2) =√s2 − (s−2 + (λ/Zni)−2)−1.
Use Bisection algorithm to find ν∗
L(µ0, σ20 , Zni)
∗ = infν≥0
L(ν, (µ0, σ
20 , Zni)
)Lagrangian relaxation optimality
Upper bound for the restless multi-armed bandit
W ∗ ≤ L∗.
28 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Numerical Results:Bayesian exploration vs. exploitation on arXiv data
Exploration: topic model w/ a Bayesian multi-armed bandit analysis
j∗explore = arg maxj
E[ZTi θj ] + ν(Var[ZTi θj ], γi, σ)
Exploitation: recommend documents with the largest posterior mean
j∗exploit = arg maxj
E[ZTi θj ]
0 10 20 30 40 50
05
1015
20
Items Presented
Rel
evan
t Ite
ms
Pre
sent
ed
with explorationno exploration
29 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Outline
1 Motivating Examples
2 Modeling
3 Optimal Solution and Upper Bound
4 Conclusion
30 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Conclusion: Contextual Multi-armed Bandits
Many real life applications
A dynamic system with incomplete, large and noisy data
Optimal learning toward personalized matching
Solving generalized context multi-arm Bandit
Exploration/learning vs. Exploitation/performance
Relax the problem to ease computation
Whittle index policy to derive an upper bound
Future Direction
Implement the exploration algorithm in arXiv
Compare with Heuristic Policy for its closeness to optimality
31 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
References
Agarwal, D. and Chen, B. (2009). Regression-based latent factormodels.
Dayanik, S., Powell, W., Yamazaki, K. (2008). Index policies fordiscounted bandit problems with availability constraints.
Gittins, J.C. and Jones, D.M. (1974). A dynamic allocation indexfor the sequential design of experiments.
Salakhutdinov, R. and Mnih, A. (2008). Probabilistic matrixfactorization.
Wang, C. and Blei, D. (2011). Collaborative topic modeling forrecommmending scientific articles.
Whittle, P. (1980). Multi-armed bandits and the Gittins index.
Whittle, P. (1988). Restless bandits: Activity allocation in achanging world.
32 / 33
Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion
Thank You!
33 / 33
top related