multi-armed bandits: intro, examples and tricks

Multi-Armed Bandits:Intro, examples and tricks

Dr Ilias Flaounas Senior Data Scientist at Atlassian

Data Science Sydney meetup 22 March 2016

Motivation

Increase awareness of some very useful but less known techniques

Demo some current work at Atlassian

Connect it with some research from my past

Hopefully, there will be something useful for everybody — apologies for the few equations and loose notation

http://www.nancydixonblog.com/2012/05/-why-knowledge-management-didnt-save-general-motors-addressing-complex-issues-by-convening-conversat.html

http://www.nancydixonblog.com/2012/05/-why-knowledge-management-didnt-save-general-motors-addressing-complex-issues-by-convening-conversat.html

( rA,1 )

( rC,2 )

rB,3

+ rA,4 + rA,5

+ rC,6

+ rA,7 / nA

+ rC,8

/ nB

/ nc

µA =

µB =

µC =

1. e-greedy: the best arm is selected for a proportion of 1-e of the trials and a random arm in e trials.

2. e-greedy with variable e

3. Pure exploration first, then pure exploitation.

4. …

5. Thompson sampling (Draw from the estimated beta-distrom

6. Upper Confidence Bound (UCB)

Many solutions…

Disadvantages

Reaching significance for non-winning arms takes longer

Unclear stopping criteria

Hard to order non-winning arms and assess reliably their impact

Advantages

Reaching significance for the winning arm is faster

Best arm can change over time

There are no false positives in the long term

Optimizely recently introduced MAB rebranded as: “Traffic auto-allocation”

Let’s add some context

What happens if we want to assess 100 variations?

How about 1,000 or 10,000 variations?

Contextual Multi-Armed Bandits

rA, t = f(xA,1, xA,2, xA,3…)A -> {xA,1, xA,2, xA,3…}

rB,t = f(xB,1, xB,2, xB,3…)

rC,t = f(xC,1, xC,2, xC,3…)

Experiment parameters, e.g., price, #users, product, bundles, colour of UI elements…

B -> {xB,1, xB,2, xB,3…}

C -> {xC,1, xC,2, xC,3…}

We introduce a notion of proximity or similarity

between arms

A -> {xA,1, xA,2, xA,3…}B -> {xB,1, xB,2, xB,3…}

Contextual Multi-Armed Bandits

LinUCB

L. Li, W. Chu, J. Langford, R. E. Schapire, “A Contextual-Bandit Approach to Personalized News Article Recommendation”, WWW, 2010.

The UCB is some expectation plus some confidence level:

µ↵(t) + �↵(t)

We assume there is some unknown vector θ∗, the same for each arm, for which:

E[ra,t|xa,t] = x

Ta,t✓

⇤

✓̂t := C�1t XT

t yt

Xt := {xa(1),1, xa(2),2, . . . , xa(t),t}T

yt := {ra(1),1, ra(2),2, . . . , ra(t),t}T

Ct := XTt Xt

Using least squares:

µ̂a(t) := x

Ta,t✓̂t

E[ra,t|xa,t] = x

Ta,t✓

⇤ µ↵(t) + �↵(t)

µ̂a := x

Ta,tC

�1t X

Tt yt

The upper confidence bound is some expectation plus some confidence level:

µ↵(t) + �↵(t)

�̂(t) :=qx

Ta,tC

�1t xa,tµ̂a := x

Ta,tC

�1t X

Tt yt

L. Li, W. Chu, J. Langford, R. E. Schapire, A Contextual-Bandit Approach to Personalized News Article Recommendation, WWW, 2010.

Product onboarding…

Which arm would you pull?

• How can we locate the city of Bristol from tweets?

• 10K candidate locations organised in a 100x100 grid

• At every step we get tweets from one location and count mentions of “Bristol”

• Challenge: find the target in sub-linear time complexity!

Linear methods fail on this problem.

How can we go non-linear?

John-Shawe Taylor & Nello Cristianini, “Kernel Methods for Pattern Analysis”, Cambridge University press, 2004.

The Kernel trick! —no, it’s not just for SVMs

µ̂a(t) := x

Ta,t✓̂t µ̂

a

(t) = kTx,t

K�1t

yt

�̂a

(t) =q

tkTx,t

K�2t

kx,t

�̂(t) :=qx

Ta,tC

�1t xa,t

Ct := XTt Xt Kt = XtX

Tt

LinUCB:

M. Valko, N. Korda, R. Munos, I. Flaounas, N. Cristianini, “Finite-Time Analysis of Kernelised Contextual Bandits”, UAI, 2013.

KernelUCB:

• The last few steps of the algorithm before it locates Bristol.

• KernelUCB with RBF kernel converges after ~300 iterations (instead of >>10K).

Target is the red dot. We locate it using KernelUCB with RBF kernel.

KernelUCB code: http://www.complacs.org/pmwiki.php/CompLACS/KernelUCB

http://www.complacs.org/pmwiki.php/CompLACS/KernelUCB

What if we have a high-dimensional space?

Hashing trick

Implementation in Vowpal Wabbit, by J. Langford, et al.

ReferencesM. Valko, N. Korda, R. Munos, I. Flaounas, N. Cristianini, “Finite-Time Analysis of Kernelised Contextual Bandits”, UAI, 2013.

L. Li, W. Chu, J. Langford, R. E. Schapire, “A Contextual-Bandit Approach to Personalized News Article Recommendation”, WWW, 2010.

John-Shawe Taylor & Nello Cristianini, “Kernel Methods for Pattern Analysis”, Cambridge University press, 2004.

Implementation of KernelUCB in Complacs toolkit:http://www.complacs.org/pmwiki.php/CompLACS/KernelUCB

https://en.wikipedia.org/wiki/Multi-armed_bandit

https://github.com/JohnLangford/vowpal_wabbit/wiki/Contextual-Bandit-Example

http://www.complacs.org/pmwiki.php/CompLACS/KernelUCB

https://en.wikipedia.org/wiki/Multi-armed_bandit

https://github.com/JohnLangford/vowpal_wabbit/wiki/Contextual-Bandit-Example

Thank you - We are hiring!

Dr Ilias Flaounas Senior Data Scientist <first>.<last>@atlassian.com

http://atlassian.com

multi-armed bandits: intro, examples and tricks

Data & Analytics