Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Treatment Allocations Based on Multi-Armed BanditStrategies
Wei Qian and Yuhong Yang
Applied Economics and Statistics, University of DelawareSchool of Statistics, University of Minnesota
Innovative Statistics and Machine Learning for Precision MedicineSeptember 15, 2017
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
1 Bandit Problems
2 Methodology and Theory
3 Model Combining
4 Numerical Studies
5 Conclusion
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Standard Multi-Armed Bandit Problem
There is a wall of slot machines.
!! = !!%! !! = !!%! !! = !!%!!!Each machine has certain winning probability to receive $1.
Chances of winning are unknown to the game player.
At each time, one and only one machine can be played, and theimmediate result is observed.
Goal: maximize the total number of wins over N times of plays.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Standard Multi-Armed Bandit Problem
There is a wall of slot machines.
!! = !!%! !! = !!%! !! = !!%!!!Each machine has certain winning probability to receive $1.
Chances of winning are unknown to the game player.
At each time, one and only one machine can be played, and theimmediate result is observed.
Goal: maximize the total number of wins over N times of plays.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Exploration-Exploitation Tradeoff
Exploration: pull each arm as many times as possible to explore onthe true reward probabilities.
Exploitation: use the existing information and play the “best” arm.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Motivation: Ethical Clinical Studies
Slot machines: different treatments to a certain disease
Survival probability: unknown to the doctor
Goal: sequentially assign treatments to patients to maximize thesurvival rate
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
A Real Example: ECMO Trial
ECMO for treating newborns with persistent pulmonary hypertension?
Ethical dilemma of using a conventional randomized controlled trialcurrent patients versus future patientstwo hats on a participating doctor
A solution is response adaptive design. L.J. Wei’s randomized versionof the play the winner rule was used in a study.
The ECMO trial has generated a lot of discussions. See, e.g., twoStatistical Science papers in 1989 and 1991.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Motivation: Online Services
Web applications are generating massive data streams.
Online recommendation systems– recommend articles to online newspaper readers.– recommend products to customers of online retailers.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Motivation: Online Services
Web applications are generating massive data streams.
Online recommendation systems– recommend articles to online newspaper readers.– recommend products to customers of online retailers.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Motivation: Bandit Problem For Online Services
Slot machines: multiple articles
Each internet visit: one and only one article delivered
Clicking probability: unknown to the internet company
Goal: sequentially choose an article for internet users to maximize thetotal number of clicks or click-through-rate (CTR)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Bandit Problem With Covariates
Standard bandit problem assumes constant winning probabilities.
In practice, winning probability can be dependent on covariates.
Personalized medical serviceTreatment effects (e.g., survival probability) can be associated withpatients’ prognostic factors.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Bandit Problem With Covariates
Standard bandit problem assumes constant winning probabilities.
In practice, winning probability can be dependent on covariates.
Personalized medical serviceTreatment effects (e.g., survival probability) can be associated withpatients’ prognostic factors.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Personalized Web Service
Personalized online advertising, article recommendationInternet user’s interest in an ad or an article story can be associatedwith some user information.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Multi-Armed Bandit with Covariate (MABC) for Precision Medicine
An example scenario:
A few FDA approved drugs are available on the market for treating acertain disease
Currently the doctors perhaps choose among the available drugs basedon limited information and reading of scattered publications if any
Why not use the MABC framework for better medical practice?
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Two-Armed Bandit Problem with Covariates
Two treatments (news articles): A and B
Patient (user) covariate x ∈ [0, 1]
Recovering (clicking) probability: fA(x), fB(x)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
clic
king
pro
babl
ityfA(x)fB(x)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Problem Setup: Two-Armed Bandit with Covariates
Problem Setup:
Given a bandit problem with two arms: treatments A and B
Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)
Covariates Xn, i.i.d. from continuous distribution PX
At each time n,
1 observe patient covariate Xn ∼ PX ;
2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};
3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.
Question: how to design the sequential allocation algorithm?
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Problem Setup: Two-Armed Bandit with Covariates
Problem Setup:
Given a bandit problem with two arms: treatments A and B
Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)
Covariates Xn, i.i.d. from continuous distribution PX
At each time n,
1 observe patient covariate Xn ∼ PX ;
2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};
3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.
Question: how to design the sequential allocation algorithm?
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Problem Setup: Two-Armed Bandit with Covariates
Problem Setup:
Given a bandit problem with two arms: treatments A and B
Unknown recovering probabilities given covariate x ∈ [0, 1]d:fA(x), fB(x)
Covariates Xn, i.i.d. from continuous distribution PX
At each time n,
1 observe patient covariate Xn ∼ PX ;
2 Based on previous observations and Xn, apply asequential allocation algorithm to choose the treatment In ∈ {A, B};
3 observe result YIn,n ∼ Bernoulli(fIn(Xn)).recover: YIn,n = 1; otherwise: YIn,n = 0.
Question: how to design the sequential allocation algorithm?
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
A Measure of Performance: Regret
Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax
i∈{A,B}fi(x)
“optimal” recovering probability: f∗(x) := maxi∈{A,B}
fi(x)
Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.
regretn = f∗(Xn)− fIn(Xn).
To measure the overall performance, consider cumulative regret
RN :=
N∑n=1
(f∗(Xn)− fIn(Xn)
)An algorithm is strongly consistent if RN = o(N) almost surely.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
A Measure of Performance: Regret
Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax
i∈{A,B}fi(x)
“optimal” recovering probability: f∗(x) := maxi∈{A,B}
fi(x)
Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.
regretn = f∗(Xn)− fIn(Xn).
To measure the overall performance, consider cumulative regret
RN :=
N∑n=1
(f∗(Xn)− fIn(Xn)
)An algorithm is strongly consistent if RN = o(N) almost surely.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
A Measure of Performance: Regret
Given patient covariate x,“optimal” strategy: give the treatment I∗(x) := argmax
i∈{A,B}fi(x)
“optimal” recovering probability: f∗(x) := maxi∈{A,B}
fi(x)
Suppose at time n, the patient covariate Xn is observed.– “optimal” choice: I∗(Xn)– the algorithm chooses treatment In.
regretn = f∗(Xn)− fIn(Xn).
To measure the overall performance, consider cumulative regret
RN :=N∑n=1
(f∗(Xn)− fIn(Xn)
)An algorithm is strongly consistent if RN = o(N) almost surely.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Model Assumptions of fA and fB
Parametric framework– Woodroofe, 1979; Auer, 2002; Li et al., 2010; Goldenshluger and Zeevi,
2009, 2013; Bastani and Bayati, 2016– Linear models
Nonparametric framework– Yang and Zhu, 2002; Rigollet and Zeevi, 2010; Perchet and Rigollet,
2013
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithms
Two articles A and B with clicking probabilities fA(x) and fB(x)
1 Deliver each article an equal number of times (e.g., each is deliveredn0 = 20 times):I1 = A, I2 = B, · · · , I2n0−1 = A, I2n0 = B.
2 For the next internet visit (n = 2n0 + 1), observe the internet usercovariate Xn.
3 Estimate fA and fB using previous data to obtain fA,n and fB,n.
4 Find the more promising option: in = argmaxi∈{A,B} fi,n(Xn);Deliver article with randomization scheme:
In =
{in, with probability 1− πn,i, with probability πn, i 6= in.
Observe the result YIn,n.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Kernel Estimation
Given article A, at each time point n, define
JA,n = {j : Ij = A, 1 ≤ j ≤ n− 1}
Nadaraya-Watson estimator of fA(x):
fA,n(x) =
∑j∈JA,n
YA,jK(x−Xj
hn
)∑
j∈JA,n
K(x−Xj
hn
)kernel function K(u) : Rd → R; bandwidth hn
Epanechnikov quadratic kernel:
K(u) =3
4(1− u2)I(‖u‖ ≤ 1)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
An UCB-Type Kernel Estimator
Upper Confidence Bound (UCB) kernel estimator
fA,n(x) =
∑j∈JA,n
YA,jK(x−Xj
hn
)∑
j∈JA,n
K(x−Xj
hn
) + UA,n(x)
A “standard error” quantity
UA,n(x) =
c
√(logN)
∑j∈JA,n
K2(x−Xjhn
)∑j∈JA,n
K(x−Xjhn
)Under uniform kernel K(u) = I(‖u‖∞ ≤ 1) withNA,n(x) =
∑j∈JA,n
I(‖Xj − x‖∞ ≤ h),
UA,n(x) = c
√logN
NA,n(x)
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times. X1 = 0.93, article A
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 1, nA = 1, nB = 0
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times. X1 = 0.93, article A
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 1, nA = 1, nB = 0
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times. X2 = 0.88, article B
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 2, nA = 1, nB = 1
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times. X2 = 0.88, article B
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 2, nA = 1, nB = 1
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Deliver each article 20 times.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
X41 = 0.52. Estimate fA(X41) and fB(X41) by kernel estimation.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Estimate fA(X41)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Estimate fA(X41): consider a window [X41 − h,X41 + h].Similar information may give similar clicking probability.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Estimate fA(X41): consider a window [X41 − h,X41 + h].fA(X41) = 0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Estimate fB(X41): consider a window [X41 − h,X41 + h].fB(X41) = 0.7996
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Article B looks more promising: fA(X41) < fB(X41).πn = 20%: P(I41 = B|H41) = 80%, P(I41 = A|H41) = 20%
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 40, nA = 20, nB = 20
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Algorithm Illustration
Continue the process with decreasing hn and πn to the end.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Time n = 800, nA = 349, nB = 451
x
clic
king
pro
babl
ity
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Challenges and Contributions
Partial information in bandit problem
Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply
Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods
Stong consistency and finite-time analysis
Dimension reduction and model combination
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Challenges and Contributions
Partial information in bandit problem
Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply
Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods
Stong consistency and finite-time analysis
Dimension reduction and model combination
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Challenges and Contributions
Partial information in bandit problem
Breakdown of i.i.d. assumptions:Existing consistency results for kernel estimation under i.i.d. or weakdependence assumption do not apply
Technical tools to develop new arguments– Martingale theories– Hoeffding-type inequalities– “Chaining” methods
Stong consistency and finite-time analysis
Dimension reduction and model combination
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Asymptotic Performance
Theorem (Qian and Yang, JMLR, 2016a)
If fi’s (i ∈ {A,B}) are uniformly continuous, and hn and πn are chosen tosatisfy hn → 0, πn → 0 and nh2d
n π4n/(logn)3 →∞,
then Nadaraya-Watson estimators are uniformly strong consistent, that is,for each i ∈ {A,B},
supx∈[0,1]d
(fi,n(x)− fi(x)
)→ 0 a.s. as n→∞.
Estimation uniform strong consistency implies thatRN = o(N) almost surely.
Equivalently, ∑Nn=1 YIn,n∑Nn=1 Y
∗n
→ 1 a.s. as N →∞
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Finite-Time Regret Analysis
Modulus of continuity: ω(h; f) = sup‖x1−x2‖≤h
|f(x1)− f(x2)|
Holder continuity: ω(h; fi) ≤ ρhκ (0 < κ ≤ 1)
Theorem (Qian and Yang, JMLR, 2016a)
There exists nδ � N such that with probability larger than 1− 2δ,
RN < C1nδ +N∑
n=nδ
(2 maxi∈{A,B}
ω(hn; fi) +
√C2 log(N)
nhdnπn+ πn
)+ C3
√N log
(1
δ
).
Upper bound of f∗(Xn)− fIn(Xn)– Estimation bias: ω(hn; fi)– Estimation variance: C2 log(N)/(nhdnπn)– Exploration price: πn
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Finite-Time Regret Analysis
Modulus of continuity: ω(h; f) = sup‖x1−x2‖≤h
|f(x1)− f(x2)|
Holder continuity: ω(h; fi) ≤ ρhκ (0 < κ ≤ 1)
Theorem (Qian and Yang, JMLR, 2016a)
There exists nδ � N such that with probability larger than 1− 2δ,
RN < C1nδ +N∑
n=nδ
(2 maxi∈{A,B}
ω(hn; fi) +
√C2 log(N)
nhdnπn+ πn
)+ C3
√N log
(1
δ
).
Upper bound of f∗(Xn)− fIn(Xn)– Nonparametric estimation: Bias-Variance tradeoff– Bandit problem: Exploration-Exploitation tradeoff
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Finite-Time Regret Upper Bounds
Under Holder continuity, when using the kernel UCB-type estimator,
ERN < CN1− 1
2+d/κ (logN)c.
– Larger d and smaller κ gives larger power index.– Matches minimax rate of Perchet and Rigollet (2013)
up to a logarithmic factor.
Adaptive performance (Qian and Yang, EJS, 2016b): near minimaxrate can be achieved without having κ a priori (0 < c∗ ≤ κ ≤ 1).
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Finite-Time Regret Upper Bounds
Under Holder continuity, when using the kernel UCB-type estimator,
ERN < CN1− 1
2+d/κ (logN)c.
– Larger d and smaller κ gives larger power index.– Matches minimax rate of Perchet and Rigollet (2013)
up to a logarithmic factor.
Adaptive performance (Qian and Yang, EJS, 2016b): near minimaxrate can be achieved without having κ a priori (0 < c∗ ≤ κ ≤ 1).
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Model Combining
Different regression methods– kernel estimation, histogram, K-nearest neighbors– linear regression
Model combining: weighted average of different statistical models
AFTER (Yang, 2004):combines different forecasting procedures
Data-driven algorithm with robust performance
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Model Combining – Illustration
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
clic
king
pro
babl
ity
fA(x)fB(x)
fA(x) = 0.7e−30(x−0.2)2 + 0.7e−30(x−0.8)2
fB(x) = 0.65− 0.3x
Time horizon N = 800, πn = 1log2 n
Model Combining1 Nadaraya-Watson estimation (h1 and h2)2 Linear regression
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Model Combining – Adaptive Performance
Per-round regret rn = Rn/n
0 200 400 600 800
0.04
0.05
0.06
0.07
0.08
0.09
n
r ncombinedNadaraya-Watson-h1Nadaraya-Watson-h2linear regression
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Yahoo! Front Page Today Module Dataset
46 million internet visit events with user response and five usercovariates in ten days.
Contains a pool of about 10 editor-picked news articles.
Raw data file is 8GB each day.
Algorithms are implemented efficiently in C++.
Potentially adapted for online applications.
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Evaluation Results
Algorithms evaluated by click-through-rate (CTR).– Complete random– Naive simple average (no covariates)– LinUCB (Chapelle and Li, 2011):
Bayesian logistic regression based algorithm– Model combining:
Kernel estimation (h1 = n−1/6, h2 = n−1/8, h3 = n−1/10)Naive simple average
random Naive LinUCB Combining
avg. normalized CTR 1.00 1.189 1.225 1.237
std. dev. – 0.005 0.041 0.018
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Conclusion
Precision medicine demands “online” learning for optimal treatmentresults
MABC provides a framework for designing effective treatmentallocation rules in a way that integrates the learning fromexperimentation with maximizing the benefits to the patients along theprocess
Many theoretical and practical issues need to be addressed
Bandit Problems Methodology and Theory Model Combining Numerical Studies Conclusion
Some References
Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). “Finite-time analysis of themultiarmed bandit problem,” Machine Learning, 47, 235-256.
Lai, T. L. and Robbins, H. (1985), “Asymptotically efficient adaptive allocationrules,” Advances in Applied Mathematics, 6, 4-22.
Perchet, V. and Rigollet, P. (2013), “The multi-armed bandit problem withcovariates,” The Annals of Statistics, 41, 693-721.
Qian, W. and Yang, Y. (2016a), “Kernel estimation and model combination in abandit problem with covariates,” Journal of Machine Learning Research, 17,1-37.
Qian, W. and Yang, Y. (2016b), “Randomized allocation with arm elimination ina bandit problem with covariates,” Electronic Journal of Statistics, 10, 242-270.
Robbins, H. (1954), “Some aspects of the sequential design of experiments,”.Bulletin of the American Mathematical Society, 58, 527-535.
Woodroofe, M. (1979), “A one-armed bandit problem with a concomitantvariable,” Journal of the American Statistical Association, 74, 799-806.
Yang, Y. (2004), “Combining forecasting procedures: some theoretical results,”Econometric Theory, 20, 176-222.
Yang, Y. and Zhu, D. (2002), “Randomized allocation with nonparametricestimation for a multi-armed bandit problem with covariates,” The Annals ofStatistics, 30, 100-121.
Yahoo! Academic Relations. (2011) Yahoo! front page today module user click logdataset, version 1.0.(Available from http://webscope.sandbox.yahoo.com.)