sequential off-line learning with knowledge gradients peter frazier warren powell savas dayanik...

Sequential Off-line Learning with Knowledge Gradients

Peter FrazierWarren PowellSavas Dayanik

Department of Operations Research and Financial EngineeringPrinceton University

2

Overview

• Problem Formulation• Knowledge Gradient Policy• Theoretical Results• Numerical Results

3

Measurement Phase

Alternative 1

N opportunities todo experiments

Alternative M

+1+.2

-1 +1

Experimental outcomes

Alternative 3 -.2

Alternative 2 -1

4

Implementation Phase

Alternative 1

Alternative M

RewardAlternative 3

Alternative 2

5

Reward StructureOn-line Off-line

Sequential XBatchS

ampl

ing

Taxonomy of Sampling Problems

6

Example 1

• One common experimental design is to spread measurements equally across the alternatives.

xn = n (mod M ) +1

Alternative

Quality

7

Example 1

xn = n (mod M ) +1Round Robin Exploration

8

• How might we improve round-robin exploration for use with this prior?

Example 2

xn = argmaxx ¾nx

9

Example 2

xn = argmaxx ¾nxLargest Variance Exploration

10

Example 3

Exploitation: xn = argmaxx ¹ nx

11

Model

• xn is the alternative tested at time n.• Measure the value of alternative xn

• At time n, , independent• Error n+1 is independent N(0,()2).• At time N, choose an alternative.• Maximize

yn+1 = Yxn +"n+1

Yx » N (¹ nx ; (¾n

x )2)

IE [YxN ]= IE£maxx ¹ Nx¤

12

State Transition

• At time n measure alternative xn. We update our estimate of based on the measurement

• Estimates of other Yx do not change.

Yxn yn+1

¹ n+1x = (¾n

x )¡ 2¹ nx +(¾²)¡ 2yn+1

(¾nx )¡ 2 +(¾²)¡ 2

(¾n+1x )¡ 2 = (¾n

x )¡ 2 +(¾²)¡ 2for x = xn

¹ n+1x = ¹ n

x for x 6= xn

¾n+1x = ¾n

x for x 6= xn for x 6= xn

13

• At time n, n+1x is a normal random variable

with mean nx and variance satisfying

uncertainty about Yx before the measurement

uncertainty about Yx after the measurement

change in best estimate of Yx due to the measurement

(~¾n+1x )2

(¾n+1x )2 = (¾n

x )2 ¡ (~¾n+1x )2

VN (¹ N ;¾N ) = maxx

¹ Nx

Vn(¹ n ;¾n) = maxxn IE£Vn+1(¹ n+1;¾n+1)¤n

• The value of the optimal policy satisfies Bellman’s equation

14

Utility of Information

Consider our “utility of information”,U(¹ n;¾n) := maxx ¹ n

x and consider the random change in utility

due to a measurement at time n¢Un+1 := U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)

15

Knowledge Gradient Definition

• The knowledge gradient policy chooses the measurement that maximizes this expected increase in utility,

xn = argmaxx IEn£¢Un+1¤

16

Knowledge Gradient

• We may compute the knowledge gradient policy via

IEn£¢Un+1¤ := IEn

£U(¹ n+1;¾n+1) ¡ U(¹ n;¾n)¤

= IEnh³

maxx

¹ n+1x

´¡ max

x¹ n

xi

= IEn

· µ¹ n+1

xn _ maxx6=xn

¹ nx

¶¡ max

x¹ n

x

¸:

which is the expectation of the maximum of a normal and a constant.

17

Knowledge Gradient

~¾(s) := s2=p

s2 +(¾²)2³nx := ¡ j¹ n

x ¡ maxx06=x

¹ nx0j=~¾(¾n

x )f (z) := z©(z) + ' (z)

The computation becomesargmaxx ~¾(¾n

x )f (³nx )

where is the normal cdf, is the normal pdf, and

18

Optimality Results

• If our measurement budget allows only one measurement (N=1), the knowledge gradient policy is optimal.

19

Optimality Results

• The knowledge gradient policy is optimal in the limit as the measurement budget N grows to infinity.

• This is really a convergence result.

20

Optimality Results• The knowledge gradient policy has sub-optimality

bounded by

where VKG,n gives the value of the knowledge gradient policy and Vn the value of the optimal policy.

· (2¼)¡ 1=2(N ¡ n ¡ 1)maxx ~¾(¾nx )

Vn(¹ n;¾n) ¡ VK G;n(¹ n;¾n)

21

Optimality Results

• If there are exactly 2 alternatives (M=2), the knowledge gradient policy is optimal.

• In this case, the optimal policy reduces toxn = argmaxx ¾n

x

22

Optimality Results

• If there is no measurement noise and alternatives may be reordered so that

then the knowledge gradient policy is optimal.

¹ 01 ¸ ¹ 0

2 ¸ :: : ¸ ¹ 0M

¾01 ¸ ¾0

2 ¸ :: : ¸ ¾0M

23

Numerical Experiments

• 100 randomly generated problems• M Uniform {1,...100}• N Uniform {M, 3M, 10M}• 0

x Uniform [-1,1]

• (0x)2 = 1 with probability 0.9

= 10-3 with probability 0.1• = 1

24

Numerical Experiments

25

• Compare alternatives via a linear combination of mean and standard deviation.

• The parameter z/2 controls the tradeoff between exploration and exploitation.

Interval Estimation

xn = argmaxx ¹ nx +z®=2¾n

x

26

KG / IE Comparison

27

KG / IE Comparison

Val

ue o

f KG

– V

alue

of I

E

28

IE and “Sticking”

¹ = [2:5;0;: : : ;0], ¾= [0;1;: : : ;1]

Alternative 1is known perfectly

xn = argmaxx ¹ nx +z®=2¾n

x ; z®=2 = 2:5

29

IE and “Sticking”

V I E = 2:5; V = IE[max(2:5;Z1; : : : ;ZM )]% 1 as M % 1

30

Thank YouAny Questions?

31

Numerical Example 1

xn = argmaxx ~¾nx f (³n

x ) = 1

x ¹ x ¾2x ~¾x = ¾2

x=p

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 0 4 1.789 0 0 0.7142 0 2 1.155 0 0 0.4613 0 1 0.707 0 0 0.282

32

Numerical Example 2


x ) = 2

x ¹ x ¾2x ~¾x = ¾2

x=p

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 0 0 1 ¡ 1 02 0 1 0.707 1 -1.414 0.0253 -1 1 0.707 2 -2.828 0.001

33

Numerical Example 3


x ) = 2

x ¹ x ¾2x ~¾x = ¾2

x=p

¾2x +(¾²)2 ¢ x = j¹ x ¡ maxx06=x

¹ x0j ³x = ¡ ¢ x=~¾x ~¾xf (³x)1 1 1 0.707 1 -1.41 0.02512 0 4 1.789 1 -0.56 0.32233 -1 1 0.707 2 -2.83 0.0005

34

Knowledge Gradient Example

35

Interval Estimation Example

xn = argmaxx ¹ nx +3¾n

x

36

Boltzmann Exploration

Parameterized by a declining sequence of temperatures (T0,...TN-1).

IPn f xn = xg= exp(¹ nx =T n )P M

x 0=1 exp(¹ nx 0=T n )

sequential off-line learning with knowledge gradients peter frazier warren powell savas dayanik...

Documents