lecture 22 - princeton university computer science(t) = (1- ฮณ) wi(t) ๐พwj(t) ๐‘—=1 + ฮณ. 1 ๐พ,...

Post on 17-Oct-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 22

Exploration & Exploitation in Reinforcement Learning: MAB, UCB,

Exp3

COS 402 โ€“ Machine

Learning and Artificial Intelligence

Fall 2016

How to balance exploration and exploitation in reinforcement learning

โ€ข Exploration: โ€“ try out each action/option to find the best one, gather more

information for long term benefit

โ€ข Exploitation: โ€“ take the best action/option believed to give the best reward/payoff,

get the maximum immediate reward given current information.

โ€ข Questions: Exploration-exploitation problems in real world?

โ€“ Select a restaurant for dinner

โ€“ โ€ฆ

Balancing exploration and exploitation

โ€ข Exploration: โ€“ try out each action/option to find the best one, gather more information

for long term benefit

โ€ข Exploitation: โ€“ take the best action/option believed to give the best reward/payoff, get

the maximum immediate reward given current information.

โ€ข Questions: Exploration-exploitation problems in real world?

โ€“ Select a restaurant for dinner

โ€“ Medical treatment in clinical trials

โ€“ Online advertisement

โ€“ Oil drilling

โ€“ โ€ฆ

Picture is taken from Wikipedia

Whatโ€™s in the picture?

Multi-armed bandit problem

โ€ข A gambler is facing at a row of slot machines. At each time step, he chooses one of the slot machines to play and receives a reward. The goal is to maximize his return.

A row of slot machines in Las Vegas

Multi-armed bandit problem

โ€ข Stochastic bandits

โ€“ Problem formulation

โ€“ Algorithms

โ€ข Adversarial bandits

โ€“ Problem formulation

โ€“ Algorithms

โ€ข Contextual bandits

Multi-armed bandit problem

โ€ข Stochastic bandits: โ€“ K possible arms/actions: 1 โ‰ค i โ‰ค K,

โ€“ Rewards xi(t) at each arm i are drawn iid, with an expectation/mean ui, unknown to the agent/gambler

โ€“ xi(t) is a bounded real-valued reward.

โ€“ Goal : maximize the return(the accumulative reward.) or minimize the expected regret:

โ€“ Regret = u* T - ๐ธ[๐‘ฅ๐‘–๐‘ก(๐‘ก)๐‘‡๐‘ก=1 ] , where

โ€ข u* =maxi[ui], expectation from the best action

Multi-armed bandit problem: algorithms?

โ€ข Stochastic bandits:

โ€“ Example: 10-armed bandits

โ€“ Question: what is your strategy? Which arm to pull at each time step t?

Multi-armed bandit problem: algorithms

โ€ข Stochastic bandits: 10-armed bandits

โ€ข Question: what is your strategy? Which arm to pull at each time step t?

โ€ข 1. Greedy method: โ€“ At time step t, estimate a value for each action

โ€ข Qt(a)=๐‘ ๐‘ข๐‘š ๐‘œ๐‘“ ๐‘Ÿ๐‘’๐‘ค๐‘Ž๐‘Ÿ๐‘‘๐‘  ๐‘คโ„Ž๐‘’๐‘› ๐‘Ž ๐‘ก๐‘Ž๐‘˜๐‘’๐‘› ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๐‘ก๐‘œ ๐‘ก

๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘ก๐‘–๐‘š๐‘’๐‘  ๐‘Ž ๐‘ก๐‘Ž๐‘˜๐‘’๐‘› ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๐‘ก๐‘œ ๐‘ก

โ€“ Select the action with the maximum value.

โ€ข At= argmax๐‘Ž

Qt(a)

โ€“ Weaknesses?

Multi-armed bandit problem: algorithms

โ€ข 1. Greedy method:

โ€“ At time step t, estimate a value for each action

โ€ข Qt(a)=๐‘ ๐‘ข๐‘š ๐‘œ๐‘“ ๐‘Ÿ๐‘’๐‘ค๐‘Ž๐‘Ÿ๐‘‘๐‘  ๐‘คโ„Ž๐‘’๐‘› ๐‘Ž ๐‘ก๐‘Ž๐‘˜๐‘’๐‘› ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๐‘ก๐‘œ ๐‘ก

๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘ก๐‘–๐‘š๐‘’๐‘  ๐‘Ž ๐‘ก๐‘Ž๐‘˜๐‘’๐‘› ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๐‘ก๐‘œ ๐‘ก

โ€“ Select the action with the maximum value.

โ€ข At= argmax๐‘Ž

Qt(a)

โ€ข Weaknesses of the greedy method:

โ€“ Always exploit current knowledge, no exploration.

โ€“ Can stuck with a suboptimal action.

Multi-armed bandit problem: algorithms

โ€ข 2. ๐œ€-Greedy methods:

โ€“ At time step t, estimate a value for each action

โ€ข Qt(a)=๐‘ ๐‘ข๐‘š ๐‘œ๐‘“ ๐‘Ÿ๐‘’๐‘ค๐‘Ž๐‘Ÿ๐‘‘๐‘  ๐‘คโ„Ž๐‘’๐‘› ๐‘Ž ๐‘ก๐‘Ž๐‘˜๐‘’๐‘› ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๐‘ก๐‘œ ๐‘ก

๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘ก๐‘–๐‘š๐‘’๐‘  ๐‘Ž ๐‘ก๐‘Ž๐‘˜๐‘’๐‘› ๐‘๐‘Ÿ๐‘–๐‘œ๐‘Ÿ ๐‘ก๐‘œ ๐‘ก

โ€“ With probability 1- ๐œ€, Select the action with the maximum value.

โ€ข At= argmax๐‘Ž

Qt(a)

โ€“ With probability ๐œ€, select an action randomly from all the actions with equal probability.

Experiments:

โ€ข Set up: (one run)

โ€“ 10-armed bandits

โ€“ Draw ui from Gaussian(0,1), i = 1,โ€ฆ,10 โ€ข the expectation/mean of rewards for action i

โ€“ Rewards of action i at time t: xi(t)

โ€ข xi(t) ~ Gaussian(ui, 1)

โ€“ Play 2000 rounds/steps

โ€“ Average return at each time step

โ€ข Average over 1000 runs

Experimental results: average over 1000 runs

Observations? What will happen if we run more steps?

Observations: โ€ข Greedy method improved faster at the very

beginning, but level off at a lower level. โ€ข ๐œ€- Greedy methods continue to Explore and

eventually perform better. โ€ข The๐œ€ = 0.01 method improves slowly,but eventually performs better than the

๐œ€ = 0.1 method.

Improve the Greedy method with optimistic initialization

Observations:

Greedy with optimistic initialization

Observations: โ€ข Big initial Q values force the Greedy method to

explore more in the beginning. โ€ข No exploration afterwards.

Improve ๐œ€-greedy with decreasing ๐œ€ over time

Decreasing over time: โ€ข ๐œ€๐‘ก = ๐œ€๐‘ก * (๐›ผ)/(t+ ๐›ผ) โ€ข Improves faster in the beginning, also outperforms

fixed ๐œ€-greedy methods in the long run.

Weaknesses of ๐œ€-Greedy methods:

โ€ข ๐œ€-Greedy methods: โ€“ At time step t, estimate a value Qt(a) for each action

โ€“ With probability 1- ๐œ€, Select the action with the maximum value.

โ€ข At= argmax๐‘Ž

Qt(a)

โ€“ With probability ๐œ€, select an action randomly from all the actions with equal probability.

โ€ข Weaknesses:

Weaknesses of ๐œ€-Greedy methods:

โ€ข ๐œ€-Greedy methods: โ€“ At time step t, estimate a value Qt(a) for each action

โ€“ With probability 1- ๐œ€, Select the action with the maximum value.

โ€ข At= argmax๐‘Ž

Qt(a)

โ€“ With probability ๐œ€, select an action randomly from all the actions with equal probability.

โ€ข Weaknesses: โ€“ Randomly selects a action to explore, does not

explore more โ€œpromisingโ€ actions.

โ€“ Does not consider confidence interval. If an action has been taken many times, no need to explore it.

Upper Confidence Bound algorithm

โ€ข Take each action once.

โ€ข At any time t > K,

โ€“ At= argmax๐‘Ž(Qt(a)+

2๐‘™๐‘›๐‘ก

๐‘๐‘Ž(๐‘ก)), where

โ€ข Qt(a) is the average reward obtained from action a,

โ€ข ๐‘๐‘Ž(๐‘ก) is the number of times action a has been taken.

Upper Confidence Bounds

โ€ข At= argmax๐‘Ž(Qt(a)+

2๐‘™๐‘›๐‘ก

๐‘๐‘Ž(๐‘ก))

โ€ข2๐‘™๐‘›๐‘ก

๐‘๐‘Ž(๐‘ก)

is confidence interval for the average reward,

โ€ข Small ๐‘๐‘Ž(๐‘ก)-> large confidence interval(Qt(a) is more uncertain) โ€ข large ๐‘๐‘Ž(๐‘ก)-> small confidence interval(Qt(a) is more accurate) โ€ข select action maximizing upper confidence bound.

โ€“ Explore actions which are more uncertain, exploit actions with high average rewards obtained.

โ€“ UCB: balance exploration and exploitation. โ€“ As t -> infinity, select optimal action.

How good is UCB? Experimental results

Observations:

How good is UCB? Experimental results

Observations: After initial 10 steps, improves quickly and outperforms ๐œ€-Greedy methods.

How good is UCB? Theoretical guarantees

โ€ข Expected regret is bounded by

8* ๐‘™๐‘›๐‘ก

u*- ๐‘ข๐‘–๐‘ข๐‘– <u* + 1 +

๐œ‹2

3 u*- ๐‘ข๐พ๐‘–=1

โ€ข Achieves the optimal regret up to a multiplicative constant.

โ€ข Expected regret grows in O(lnt)

How is the UCB derived?

โ€ข Hoeffdingโ€™s Inequality Theorem:

โ€ข Let be Xi, โ€ฆ, Xt i.i.d. random variables in [0,1], and let Xt ๐‘๐‘’ ๐‘กโ„Ž๐‘’ ๐‘ ๐‘Ž๐‘š๐‘๐‘™๐‘’ ๐‘š๐‘’๐‘Ž๐‘›, ๐‘กโ„Ž๐‘’๐‘›

โ€“ P[E[Xt ]โ‰ฅ Xt +u]โ‰ค ๐‘’โˆ’2๐‘ก๐‘ข

2

โ€ข Apply to the bandit rewards, โ€“ P[Q(a) โ‰ฅ Qt(a)+Vt(a)]โ‰ค ๐‘’โˆ’2๐‘๐‘ก(๐‘Ž)๐‘‰๐‘ก(๐‘Ž)

2

,

โ€“ 2*Vt(a) is size of the confidence interval for Q(a)

โ€“ p=๐‘’โˆ’2๐‘๐‘ก(๐‘Ž)๐‘‰๐‘ก(๐‘Ž)2 is the probability of making an error

How is the UCB derived?

โ€ข Apply to the bandit rewards, โ€“ P[Q(a)> Qt(a)+Vt(a)]โ‰ค ๐‘’โˆ’2๐‘๐‘ก(๐‘Ž)๐‘‰๐‘ก(๐‘Ž)

2

โ€ข Let ๐‘ = ๐‘’โˆ’2๐‘๐‘ก(๐‘Ž)๐‘‰๐‘ก(๐‘Ž)2

=t-4,

โ€ข Then ๐‘‰๐‘ก(๐‘Ž)= 2๐‘™๐‘›๐‘ก

๐‘๐‘Ž(๐‘ก)

โ€ข Variations of UCB1(the basic one here): โ€“ UCB1_tuned, UCB3, UCBv, etc. โ€“ Provide and prove different upper bounds for the expected

regret. โ€“ Add a parameter to control the confidence level

Multi-armed bandit problem

โ€ข Stochastic bandits

โ€“ Problem formulation

โ€“ Algorithms:

โ€ข Greedy, ๐œ€- greedy methods, UCB

โ€ข Boltzmann exploration(Softmax)

โ€ข Adversarial bandits

โ€“ Problem formulation

โ€“ Algorithms

โ€ข Contextual bandits

Multi-armed bandit problem

โ€ข Boltzmann exploration(Softmax) โ€“ Pick a action with a probability that is proportional to its

average reward.

โ€“ Actions with greater average rewards are picked with higher probability.

โ€ข The algorithm: โ€“ Given initial empirical means u1(0),โ€ฆ,uK(0)

โ€“ pi(t+1) = ๐‘’ui(t)/๐œ

๐‘’uj(t)/ฯ„๐พ๐‘—=1

, i =1,โ€ฆ,K

โ€“ ๐‘Šโ„Ž๐‘Ž๐‘ก ๐‘–๐‘  ๐‘กโ„Ž๐‘’ ๐‘ข๐‘ ๐‘’ ๐‘œ๐‘“ ๐œ ?

โ€“ What happens as ๐œ โ†’ ๐‘–๐‘›๐‘“๐‘–๐‘›๐‘–๐‘ก๐‘ฆ?

Multi-armed bandit problem

โ€ข Boltzmann exploration(Softmax)

โ€“ Pick a action with a probability that is proportional to its average reward.

โ€“ Actions with greater average rewards are picked with higher probability.

โ€ข The algorithm:

โ€“ Given initial empirical means u1(0),โ€ฆ,uK(0)

โ€“ pi(t+1) = ๐‘’ui(t)/๐œ

๐‘’uj(t)/ฯ„๐พ๐‘—=1

, i =1,โ€ฆ,K

โ€“ ๐œ controls the choice.

โ€“ ๐‘Ž๐‘  ๐œ โ†’ ๐‘–๐‘›๐‘“๐‘–๐‘›๐‘–๐‘ก๐‘ฆ, ๐‘ ๐‘’๐‘™๐‘’๐‘๐‘ก๐‘  ๐‘ข๐‘›๐‘–๐‘“๐‘œ๐‘Ÿ๐‘š๐‘™๐‘ฆ.

Multi-armed bandit problem

โ€ข Stochastic bandits

โ€“ Problem formulation

โ€“ Algorithms:

โ€ข ๐œ€- greedy methods, UCB

โ€ข Boltzmann exploration(Softmax)

โ€ข Adversarial bandits

โ€“ Problem formulation

โ€“ Algorithms

โ€ข Contextual bandits

Non-stochastic/adversarial multi-armed bandit problem

โ€ข Setting: โ€“ K possible arms/actions: 1 โ‰ค i โ‰ค K,

โ€“ each action is denoted by an assignment of rewards. x(1), x(2), โ€ฆ, of vectors x(t)=(x1(t), x2(t),โ€ฆ,xK(t))

โ€“ xi(t) ฯต [0,1] : reward received if action i is chosen at time step t (can generalize to in the range[a, b])

โ€“ Goal : maximize the return(the accumulative reward.)

โ€ข Example. K=3, an adversary controls the reward.

k t=1 t=2 t=3 t=4 โ€ฆ

1 0.2 0.7 0.4 0.3

2 0.5 0.1 0.6 0.2

3 0.8 0.4 0.5 0.9

Adversarial multi-armed bandit problem

โ€ข Weak regret = Gmax(T)-GA(T)

โ€ข Gmax(T)=max๐‘— ๐‘ฅ๐‘—(๐‘ก)๐‘‡๐‘ก=1 , return of the single global best

action at time horizon T

โ€ข GA(T)= ๐‘ฅ๐‘–๐‘ก(๐‘ก)๐‘‡๐‘ก=1 , return at time horizon T of algorithm A

choosing actions i1, i2,..

โ€ข Example. Gmax(4)=?, GA(4)=? โ€“ (*indicates the reward received by algorithm A at a time step t)

K/T t=1 t=2 t=3 t=4 โ€ฆ

1 0.2 0.7 0.4 0.3*

2 0.5* 0.1 0.6* 0.2

3 0.8 0.4* 0.5 0.9

Adversarial multi-armed bandit problem

โ€ข Example. Gmax(4)=?, GA(4)=?

โ€“ Gmax(4)= max๐‘— ๐‘ฅ๐‘—(๐‘ก)๐‘‡๐‘ก=1 = max {G1(4),G2(4),G3(4)}

=max{1.6, 1.4, 2.4}=2.4

โ€“ GA(4) )= ๐‘ฅ๐‘–๐‘ก(๐‘ก)๐‘‡๐‘ก=1 =0.5+0.4+0.6+0.3=1.8

โ€ข Evaluate a randomized algorithm A: Bound on the expected regret for a A: Gmax(T)-E[GA(T)]

K/T t=1 t=2 t=3 t=4 โ€ฆ

1 0.2 0.7 0.4 0.3*

2 0.5* 0.1 0.6* 0.2

3 0.8 0.4* 0.5 0.9

Exp3: exponential-weight algorithm for exploration and exploitation

โ€ข Parameter: ฮณ ฯต (0,1] โ€ข Initialize:

โ€“ wi(1) = 1 for i =1, ..,K

โ€ข for t=1, 2, โ€ฆ

โ€“ Set pi(t) = (1- ฮณ)wi(t) wj(t)๐พ๐‘—=1

+ ฮณ. 1๐พ

, i = 1,โ€ฆ,K

โ€“ Draw it randomly from p1(t), โ€ฆ, pK(t)

โ€“ Receive reward ๐‘ฅ๐‘–๐‘ก ฯต[0,1]

โ€“ For j =1,โ€ฆ,K

โ€ข ๐‘ฅ๐‘—(๐‘ก) = ๐‘ฅ๐‘—(๐‘ก) /pj(t) if j=it โ€ข ๐‘ฅ๐‘—(๐‘ก) = 0 otherwise

โ€ข wj(t+1) = wj(t)*exp(ฮณ*๐‘ฅ๐‘—(๐‘ก) /K) Question: what happens at time step 1? At time step 2? What happens as ฮณ --> 1? Why ๐‘ฅ๐‘—(๐‘ก) /pj(t)?

Exp3: Balancing exploration and exploitation

โ€ข pi(t) = (1- ฮณ)wi(t) wj(t)๐พ๐‘—=1

+ ฮณ. 1๐พ

, i = 1,โ€ฆ,K

โ€ข Balancing exploration and exploitation in Exp3

โ€“ The distribution P(t) is a mixture of the uniform distribution and a distribution

which assigns to each action a probability mass exponential in the estimated cumulative reward for that action.

โ€“ Uniform distribution encourages exploration

โ€“ The other probability encourages exploitation

โ€“ The parameter ฮณ controls the exploration.

โ€“ ๐‘ฅ๐‘—(๐‘ก) /pj(t) compensate the reward of actions that are unlikely to be chosen

Exp3: How good is it?

โ€ข Theorem: For any K>0 and for any ฮณ ฯต (0,1],

Gmax โ€“ E[GExp3] โ‰ค (e-1) *ฮณ*Gmax+K*ln(K)/ฮณ

Holds for any assignment of rewards and for any T>0

โ€ข Corollary: โ€ข For any T > 0 , assume that g โ‰ฅ Gmax and that algorithm

Exp3 is run with the parameter

โ€ข ฮณ = min {1, ๐พ๐‘™๐‘›๐พ

๐‘’โˆ’1 ๐‘”},

โ€ข Then Gmax โ€“ E[GExp3] โ‰ค 2 (๐‘’ โˆ’ 1) ๐‘”๐พ๐‘™๐‘›๐‘˜ โ‰ค 2.63 ๐‘”๐พ๐‘™๐‘›๐‘˜

A family of Exp3 algorithms

โ€ข Exp3, Exp3.1, Exp3.P, Exp4, โ€ฆ

โ€ข Better performance

โ€“ tighter upper bound,

โ€“ achieve small weak regret with high probability, etc.

โ€ข Changes based on Exp3

โ€“ How to choose ฮณ

โ€“ How to initialize and update the weights w

Summary: Exploration & exploitation in Reinforcement learning

โ€ข Stochastic bandits โ€“ Problem formulation

โ€“ Algorithms โ€ข ๐œ€- greedy methods, UCB, Boltzmann exploration(Softmax)

โ€ข Adversarial bandits โ€“ Problem formulation

โ€“ Algorithms: Exp3

โ€ข Contextual bandits โ€“ At each time step t, observes a feature/context vector.

โ€“ Uses both context vector and rewards in the past to make a decision

โ€“ Learns how context vector and rewards relate to each other.

u*- ๐‘ข

References:

โ€ข โ€œReinforcement Learning: An Introductionโ€ by Sutton and Barto โ€“ https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html

โ€ข โ€œThe non-stochastic multi-armed bandit problemโ€ by Auer, Cesa-Bianchi, Freund, and Schapire โ€“ https://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf

โ€ข โ€œMulti-armed banditsโ€ by Michael โ€“ http://blog.thedataincubator.com/2016/07/multi-armed-bandits-2/

โ€ข โ€ฆ

Take home questions?

โ€ข 1. What is the difference between MABs(multi-armed bandits) and MDPs(Markov Decision Processes)?

โ€ข 2. How are they related?

top related