bayesian contextual multi-armed bandits contextual multi-armed bandits ... the epoch-greedy...

Motivating Examples Modeling Optimal Solution and Upper Bound Conclusion

Bayesian Contextual Multi-armed Bandits

Xiaoting ZhaoJoint Work with Peter I. Frazier

School of Operations Research and Information EngineeringCornell University

October 22, 2012

1 / 33

Outline

1 Motivating Examples

2 Modeling

3 Optimal Solution and Upper Bound

4 Conclusion

2 / 33

Outline

2 Modeling

4 Conclusion

3 / 33

Motivation: Recommender System

Learning through Exploration in arXiv

Operate at a large scale:

-750, 000 eprints of scientific articles

-1.5 billions user accesses

Learn quickly and react fast

Exploration vs. Exploitation

Goal: Interactively combine content and use the observed feedback toimprove future content choices.

4 / 33

Motivation: Optimal Experimental Design

Oil-drilling

Drill test wells to learn about the potential for oil or natural gas

Each drilling test can cost millions of dollars

Computational challenge: size of the search space

Goal: Choose the best price, temperature, pressure, and otherparameters to produce the best results for a business simulator.

5 / 33

Outline

2 Modeling

4 Conclusion

6 / 33

Modeling: Contextual Multi-armed Bandits

At each time step n = 1, 2, ...

Given context Zn ∈ Rd, iid from known PZ

query keywords, user features, social media, medical records, etc.

Choose an action Xn ∈ {1, ...,K}arms/items: articles, advertisement, movies, treatments, etc.

Latent item features θXn :item features, item relevance, etc.

Receive partial rewards, Yn, with some noise εn ∈ R:

CTRs, generated revenue, ratings, relevance score, user engagement

Linear model: powerful and analytically tractable

Yn = θXn · Zn + εn ∈ R

Maximize the discounted total expected reward

supπ∈Π

[ ∞∑n=0

]7 / 33

supπ∈Π

[ ∞∑n=0

]8 / 33

supπ∈Π

[ ∞∑n=0

]9 / 33

supπ∈Π

[ ∞∑n=0

]10 / 33

supπ∈Π

[ ∞∑n=0

]11 / 33

Intuitive Example

preference 1

Example of Zn and θx

0 2 4 6 8 10 12 14 16 18 200

20prior θ1

prior θ2

prior θ3

posterior θ1

12 / 33

Literature Review

Non-Bayesian Methods

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits(Langford and Zhang 2007)

A Contextual-Bandit Approach to Personalized New ArticleRecommendation (Langford, et al., 2010)

An optimal high probability algorithm for the contextual banditproblem (Beygelzimer et al., 2010)

Efficient Optimal Learning for Contextual Bandits (Dudik et al.,2010)

PAC-Bayesian Analysis of Contextual Bandits (Seldin et al., 2011)

Bayesian Approach

Explore/Exploit Schemes for Web Content Optimization (Agarwal,Chen, and Elango 2009)

Collaborative Topic Modeling for Recommending Scientific Articles(Wang and Blei 2011)

13 / 33

Prior distribution on θx:

θx ∼ N(µx0,Σx0

ObservationYn = θXn

· Zn + εn ∈ R

with noise εn ∼ N(0, σ2)

Decision Xn at time n is a function of

Historical observation: Hn−1 = (Xm, Ym, Zm : 1 ≤ m ≤ n− 1)

New context: Zn

14 / 33

Optimality Equation

Goal: find the optimal policy π∗ s.t.

V ∗ = supπ∈Π

Eπ[ ∞∑n=1

γnZn · θXn

where a policy π is a sequence of π = (πn : n = 1, 2, ...) with

πn :({1, ...,K} × R× Rd

)n−1

× Rd → {1, ...,K},

Xn = πn(Hn−1, Zn)

15 / 33

Outline

2 Modeling

4 Conclusion

16 / 33

Simple Case: Zn = ones(d), d = 1

Classical Multi-armed Bandits

A row of slot machines to play

Unknown expected payoff

V ∗ = supπ∈Π

Eπ[ ∞∑n=1

Gittins Index Policy (Gittins 1974)

Assign index, νi(xi), for each state of each item i

νi(xi) = supτ>0

<∑τ−1n=0 γ

nR[X(n)] >X(0)=i

<∑τ−1n=0 γ

Index Policy: simple and provide the best tradeoff

i∗ = arg maxiνi(xi)

17 / 33

Simple Case: Zn = ones(d), d = 1

Classical Multi-armed Bandits

A row of slot machines to play

Unknown expected payoff

V ∗ = supπ∈Π

Eπ[ ∞∑n=1

Gittins Index Policy (Gittins 1974)

Assign index, νi(xi), for each state of each item i

νi(xi) = supτ>0

<∑τ−1n=0 γ

nR[X(n)] >X(0)=i

<∑τ−1n=0 γ

Index Policy: simple and provide the best tradeoff

i∗ = arg maxiνi(xi)

18 / 33

General Case: Contextual Multi-armed Bandits

V ∗ = supπ∈Π

Eπ[ ∞∑n=1

γnZn · θXn

Challenges

Random context, ZnGittins index policy fails to generalize

Curses of dimensionality

State space

Action space

Partial feedback revealedFigure: Graph inserted from YoshuaBengio’s homepage

19 / 33

Main Ideas of the Heuristic Policy

Decomposition of the space into independent d sub-space

Inspiration from Whittle’s work on Restless Multi-armed Bandits

Relax the constraint on non-operating arms and average activity

Apply Lagrangian Relaxation

1Dimensionality Reduction Techniques for Face Recognition, Sharath et al., 201120 / 33

Upper Bound: Heuristic Policy

Decompose to d sub-problems

φ1, ...φd be an orthonormal basis in Rd, i.e., φj ∈ Rd

φj · φj′ =

{1 if j = j′,

0 if j 6= j′.

Arriving context Zn as

d∑i=1

Zniφi with Zni = φi · Zn

Find the optimal policy for each subproblem: a 2-dim state-space DP

supπ′∈Π′

Eπ′

[ ∞∑n=1

γnZniφin · θXni

∣∣∣ φi′ · θj ∀i′ 6= i,∀Xni ∈ {1, ...,K}

21 / 33

Proof of the Upper Bound

supπ∈Π

[ ∞∑n=1

]≤ supπ∈Π′

d∑i=1

[ ∞∑n=1

γnZniφi · θXni

≤d∑i=1

supπ∈Π′

[ ∞∑n=1

γnZniφi · θXni

≤d∑i=1

[supπ∈Π′

[ ∞∑n=1

γnZniφin · θx | φi′ · θj ∀i′ 6= i,∀x

22 / 33

Solving each sub-problem

Vi(·) = supπ′∈Π′

Eπ′

[ ∞∑n=1

γnZniφin · θXni| φi′ · θj ∀i′ 6= i,∀j

Gittins construction

Non-operating arms stay static

But Zn could evolve even for passive arms

Restless Multi-arm Bandits (Whittle 1988)

Both active and passive arms could evolve

Whittle index policy: optimal under a relaxed constraint for averageactivity

23 / 33

Whittle Index Policy Relaxation

W ∗ij = supπ′′′∈Π′′′

Eπ′′′

[ ∞∑n=1

∑∀x

γnZniθxUnix | φi′ · θj ∀i′ 6= i, x = 1, ...,K

subject to:Unix ∈ {0, 1},

Eπ[ ∞∑n=1

)ZniE[Zi]

]≤∞∑n=1

γn =1

1− γ

24 / 33

Lagrangian Relaxation

Relax the constraint with a Lagrange multiplier ν:

L(ν, (µ0, σ

20 , Zni)

)=∑x

Eπ[ ∞∑n=1

γnZni

[θx −

1− γ

subject to:Unix ∈ {0, 1}

Note: ν is a subsidy for resting an arm.

25 / 33

Weak Duality

Dynamics of the state space (µn+1,i,j , σ2n+1,i,j , Zn+1,i)(µnij , σ

2nij , Zn+1,i) if Unij = 0,(

µnij + σ(σ2nij)Sn,

(σ−2nij + Z2

n,i/λ2)−1

, Zn+1,i

)if Unij = 1,

where Sn ∼ N (0, 1) and σ(s2) =√s2 − (s−2 + (λ/Zni)−2)−1.

Use Bisection algorithm to find ν∗

L(µ0, σ20 , Zni)

∗ = infν≥0

L(ν, (µ0, σ

20 , Zni)

)Lagrangian relaxation optimality

Upper bound for the restless multi-armed bandit

W ∗ ≤ L∗.

26 / 33

Weak Duality

(σ−2nij + Z2

n,i/λ2)−1

, Zn+1,i

)if Unij = 1,

L(µ0, σ20 , Zni)

∗ = infν≥0

L(ν, (µ0, σ

20 , Zni)

W ∗ ≤ L∗.

27 / 33

Weak Duality

(σ−2nij + Z2

n,i/λ2)−1

, Zn+1,i

)if Unij = 1,

L(µ0, σ20 , Zni)

∗ = infν≥0

L(ν, (µ0, σ

20 , Zni)

W ∗ ≤ L∗.

28 / 33

Numerical Results:Bayesian exploration vs. exploitation on arXiv data

Exploration: topic model w/ a Bayesian multi-armed bandit analysis

j∗explore = arg maxj

E[ZTi θj ] + ν(Var[ZTi θj ], γi, σ)

Exploitation: recommend documents with the largest posterior mean

j∗exploit = arg maxj

E[ZTi θj ]

0 10 20 30 40 50

Items Presented

with explorationno exploration

29 / 33

Outline

2 Modeling

4 Conclusion

30 / 33

Conclusion: Contextual Multi-armed Bandits

Many real life applications

A dynamic system with incomplete, large and noisy data

Optimal learning toward personalized matching

Solving generalized context multi-arm Bandit

Exploration/learning vs. Exploitation/performance

Relax the problem to ease computation

Whittle index policy to derive an upper bound

Future Direction

Implement the exploration algorithm in arXiv

Compare with Heuristic Policy for its closeness to optimality

31 / 33

References

Agarwal, D. and Chen, B. (2009). Regression-based latent factormodels.

Dayanik, S., Powell, W., Yamazaki, K. (2008). Index policies fordiscounted bandit problems with availability constraints.

Gittins, J.C. and Jones, D.M. (1974). A dynamic allocation indexfor the sequential design of experiments.

Salakhutdinov, R. and Mnih, A. (2008). Probabilistic matrixfactorization.

Wang, C. and Blei, D. (2011). Collaborative topic modeling forrecommmending scientific articles.

Whittle, P. (1980). Multi-armed bandits and the Gittins index.

Whittle, P. (1988). Restless bandits: Activity allocation in achanging world.

32 / 33

Thank You!

33 / 33

bayesian contextual multi-armed bandits contextual multi-armed bandits ... the epoch-greedy...

Documents

foraging and multi-armed bandits optimal foraging...

contextual interference single taks versus multi task...

contextual inter-modal attention for multi-modal sentiment

learning from multi-topic web documents for contextual...

mechanisms with learning for stochastic multi-armed bandit...

multi-armed bandit

multitasking, multi-armed bandits, and the italian...

cicero: multi-turn, contextual argumentation for accurate

introduction to multi-armed...

musical: multi-scale image contextual attention learning ......

improving experimentation velocity via multi-armed bandits

contextual multi-armed...

learning in generalized linear contextual bandits...

american options, multi–armed bandits, and optimal

an introduction to stochastic multi-armed bandits

multi-armed bandits for fun and profit

multi-user lax communications: a multi-armed bandit...

selecting multiple web adverts - a contextual multi-armed

multi-armed bandits: exploration versus exploitation

multi-armed bandits: applications to online advertising