sequential off-line learning with knowledge gradients peter frazier warren powell savas dayanik...
DESCRIPTION
3 Measurement Phase Alternative 1 N opportunities to do experiments Alternative M Experimental outcomes Alternative Alternative 2TRANSCRIPT
Sequential Off-line Learning with Knowledge Gradients
Peter FrazierWarren PowellSavas Dayanik
Department of Operations Research and Financial EngineeringPrinceton University
2
Overview
• Problem Formulation• Knowledge Gradient Policy• Theoretical Results• Numerical Results
3
Measurement Phase
Alternative 1
N opportunities todo experiments
Alternative M
+1+.2
-1 +1
Experimental outcomes
Alternative 3 -.2
Alternative 2 -1
4
Implementation Phase
Alternative 1
Alternative M
RewardAlternative 3
Alternative 2
5
Reward StructureOn-line Off-line
Sequential XBatchS
ampl
ing
Taxonomy of Sampling Problems
6
Example 1
• One common experimental design is to spread measurements equally across the alternatives.
xn = n (mod M ) +1
Alternative
Quality
7
Example 1
xn = n (mod M ) +1Round Robin Exploration
8
• How might we improve round-robin exploration for use with this prior?
Example 2
xn = argmaxx ¾nx
9
Example 2
xn = argmaxx ¾nxLargest Variance Exploration
10
Example 3
Exploitation: xn = argmaxx ¹ nx
11
Model
• xn is the alternative tested at time n.• Measure the value of alternative xn
• At time n, , independent• Error n+1 is independent N(0,()2).• At time N, choose an alternative.• Maximize
yn+1 = Yxn +"n+1
Yx » N (¹ nx ; (¾n
x )2)
IE [YxN ]= IE£maxx ¹ Nx¤
12
State Transition
• At time n measure alternative xn. We update our estimate of based on the measurement
• Estimates of other Yx do not change.
Yxn yn+1
¹ n+1x = (¾n
x )¡ 2¹ nx +(¾²)¡ 2yn+1
(¾nx )¡ 2 +(¾²)¡ 2
(¾n+1x )¡ 2 = (¾n
x )¡ 2 +(¾²)¡ 2for x = xn
¹ n+1x = ¹ n
x for x 6= xn
¾n+1x = ¾n
x for x 6= xn for x 6= xn
13
• At time n, n+1x is a normal random variable
with mean nx and variance satisfying
uncertainty about Yx before the measurement
uncertainty about Yx after the measurement
change in best estimate of Yx due to the measurement
(~¾n+1x )2
(¾n+1x )2 = (¾n
x )2 ¡ (~¾n+1x )2
VN (¹ N ;¾N ) = maxx
¹ Nx
Vn(¹ n ;¾n) = maxxn IE£Vn+1(¹ n+1;¾n+1)¤n
• The value of the optimal policy satisfies Bellman’s equation
14
Utility of Information
Consider our “utility of information”,U(¹ n;¾n) := maxx ¹ n
x and consider the random change in utility
due to a measurement at time n¢Un+1 := U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)
15
Knowledge Gradient Definition
• The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,
xn = argmaxx IEn£¢Un+1¤
16
Knowledge Gradient
• We may compute the knowledge gradient policy via
IEn£¢Un+1¤ := IEn
£U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)¤
= IEnh³
maxx
¹ n+1x
´¡ max
x¹ n
xi
= IEn
· µ¹ n+1
xn _ maxx6=xn
¹ nx
¶¡ max
x¹ n
x
¸:
which is the expectation of the maximum of a normal and a constant.
17
Knowledge Gradient
~¾(s) := s2=p
s2 +(¾²)2³nx := ¡ j¹ n
x ¡ maxx06=x
¹ nx0j=~¾(¾n
x )f (z) := z©(z) + ' (z)
The computation becomesargmaxx ~¾(¾n
x )f (³nx )
where is the normal cdf, is the normal pdf, and
18
Optimality Results
• If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.
19
Optimality Results
• The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity.
• This is really a convergence result.
20
Optimality Results• The knowledge gradient policy has sub-optimality
bounded by
where VKG,n gives the value of the knowledge gradient policy and Vn the value of the optimal policy.
· (2¼)¡ 1=2(N ¡ n ¡ 1)maxx ~¾(¾nx )
Vn(¹ n;¾n) ¡ VK G;n(¹ n;¾n)
21
Optimality Results
• If there are exactly 2 alternatives (M=2), the knowledge gradient policy is optimal.
• In this case, the optimal policy reduces toxn = argmaxx ¾n
x
22
Optimality Results
• If there is no measurement noise and alternatives may be reordered so that
then the knowledge gradient policy is optimal.
¹ 01 ¸ ¹ 0
2 ¸ :: : ¸ ¹ 0M
¾01 ¸ ¾0
2 ¸ :: : ¸ ¾0M
23
Numerical Experiments
• 100 randomly generated problems• M Uniform {1,...100}• N Uniform {M, 3M, 10M}• 0
x Uniform [-1,1]
• (0x)2 = 1 with probability 0.9
= 10-3 with probability 0.1• = 1
24
Numerical Experiments
25
• Compare alternatives via a linear combination of mean and standard deviation.
• The parameter z/2 controls the tradeoff between exploration and exploitation.
Interval Estimation
xn = argmaxx ¹ nx +z®=2¾n
x
26
KG / IE Comparison
27
KG / IE Comparison
Val
ue o
f KG
– V
alue
of I
E
28
IE and “Sticking”
¹ = [2:5;0;: : : ;0], ¾= [0;1;: : : ;1]
Alternative 1is known perfectly
xn = argmaxx ¹ nx +z®=2¾n
x ; z®=2 = 2:5
29
IE and “Sticking”
V I E = 2:5; V = IE[max(2:5;Z1; : : : ;ZM )]% 1 as M % 1
30
Thank YouAny Questions?
31
Numerical Example 1
xn = argmaxx ~¾nx f (³n
x ) = 1
x ¹ x ¾2x ~¾x = ¾2
x=p
¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x
¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 0 4 1.789 0 0 0.7142 0 2 1.155 0 0 0.4613 0 1 0.707 0 0 0.282
32
Numerical Example 2
xn = argmaxx ~¾nx f (³n
x ) = 2
x ¹ x ¾2x ~¾x = ¾2
x=p
¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x
¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 0 0 1 ¡ 1 02 0 1 0.707 1 -1.414 0.0253 -1 1 0.707 2 -2.828 0.001
33
Numerical Example 3
xn = argmaxx ~¾nx f (³n
x ) = 2
x ¹ x ¾2x ~¾x = ¾2
x=p
¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x
¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 1 0.707 1 -1.41 0.02512 0 4 1.789 1 -0.56 0.32233 -1 1 0.707 2 -2.83 0.0005
34
Knowledge Gradient Example
35
Interval Estimation Example
xn = argmaxx ¹ nx +3¾n
x
36
Boltzmann Exploration
Parameterized by a declining sequence of temperatures (T0,...TN-1).
IPn f xn = xg= exp(¹ nx =T n )P M
x 0=1 exp(¹ nx 0=T n )