lecture 22 - princeton university computer science(t) = (1- ฮณ) wi(t) ๐พwj(t) ๐=1 + ฮณ. 1 ๐พ,...
Post on 17-Oct-2020
0 Views
Preview:
TRANSCRIPT
Lecture 22
Exploration & Exploitation in Reinforcement Learning: MAB, UCB,
Exp3
COS 402 โ Machine
Learning and Artificial Intelligence
Fall 2016
How to balance exploration and exploitation in reinforcement learning
โข Exploration: โ try out each action/option to find the best one, gather more
information for long term benefit
โข Exploitation: โ take the best action/option believed to give the best reward/payoff,
get the maximum immediate reward given current information.
โข Questions: Exploration-exploitation problems in real world?
โ Select a restaurant for dinner
โ โฆ
Balancing exploration and exploitation
โข Exploration: โ try out each action/option to find the best one, gather more information
for long term benefit
โข Exploitation: โ take the best action/option believed to give the best reward/payoff, get
the maximum immediate reward given current information.
โข Questions: Exploration-exploitation problems in real world?
โ Select a restaurant for dinner
โ Medical treatment in clinical trials
โ Online advertisement
โ Oil drilling
โ โฆ
Picture is taken from Wikipedia
Whatโs in the picture?
Multi-armed bandit problem
โข A gambler is facing at a row of slot machines. At each time step, he chooses one of the slot machines to play and receives a reward. The goal is to maximize his return.
A row of slot machines in Las Vegas
Multi-armed bandit problem
โข Stochastic bandits
โ Problem formulation
โ Algorithms
โข Adversarial bandits
โ Problem formulation
โ Algorithms
โข Contextual bandits
Multi-armed bandit problem
โข Stochastic bandits: โ K possible arms/actions: 1 โค i โค K,
โ Rewards xi(t) at each arm i are drawn iid, with an expectation/mean ui, unknown to the agent/gambler
โ xi(t) is a bounded real-valued reward.
โ Goal : maximize the return(the accumulative reward.) or minimize the expected regret:
โ Regret = u* T - ๐ธ[๐ฅ๐๐ก(๐ก)๐๐ก=1 ] , where
โข u* =maxi[ui], expectation from the best action
Multi-armed bandit problem: algorithms?
โข Stochastic bandits:
โ Example: 10-armed bandits
โ Question: what is your strategy? Which arm to pull at each time step t?
Multi-armed bandit problem: algorithms
โข Stochastic bandits: 10-armed bandits
โข Question: what is your strategy? Which arm to pull at each time step t?
โข 1. Greedy method: โ At time step t, estimate a value for each action
โข Qt(a)=๐ ๐ข๐ ๐๐ ๐๐๐ค๐๐๐๐ ๐คโ๐๐ ๐ ๐ก๐๐๐๐ ๐๐๐๐๐ ๐ก๐ ๐ก
๐๐ข๐๐๐๐ ๐๐ ๐ก๐๐๐๐ ๐ ๐ก๐๐๐๐ ๐๐๐๐๐ ๐ก๐ ๐ก
โ Select the action with the maximum value.
โข At= argmax๐
Qt(a)
โ Weaknesses?
Multi-armed bandit problem: algorithms
โข 1. Greedy method:
โ At time step t, estimate a value for each action
โข Qt(a)=๐ ๐ข๐ ๐๐ ๐๐๐ค๐๐๐๐ ๐คโ๐๐ ๐ ๐ก๐๐๐๐ ๐๐๐๐๐ ๐ก๐ ๐ก
๐๐ข๐๐๐๐ ๐๐ ๐ก๐๐๐๐ ๐ ๐ก๐๐๐๐ ๐๐๐๐๐ ๐ก๐ ๐ก
โ Select the action with the maximum value.
โข At= argmax๐
Qt(a)
โข Weaknesses of the greedy method:
โ Always exploit current knowledge, no exploration.
โ Can stuck with a suboptimal action.
Multi-armed bandit problem: algorithms
โข 2. ๐-Greedy methods:
โ At time step t, estimate a value for each action
โข Qt(a)=๐ ๐ข๐ ๐๐ ๐๐๐ค๐๐๐๐ ๐คโ๐๐ ๐ ๐ก๐๐๐๐ ๐๐๐๐๐ ๐ก๐ ๐ก
๐๐ข๐๐๐๐ ๐๐ ๐ก๐๐๐๐ ๐ ๐ก๐๐๐๐ ๐๐๐๐๐ ๐ก๐ ๐ก
โ With probability 1- ๐, Select the action with the maximum value.
โข At= argmax๐
Qt(a)
โ With probability ๐, select an action randomly from all the actions with equal probability.
Experiments:
โข Set up: (one run)
โ 10-armed bandits
โ Draw ui from Gaussian(0,1), i = 1,โฆ,10 โข the expectation/mean of rewards for action i
โ Rewards of action i at time t: xi(t)
โข xi(t) ~ Gaussian(ui, 1)
โ Play 2000 rounds/steps
โ Average return at each time step
โข Average over 1000 runs
Experimental results: average over 1000 runs
Observations? What will happen if we run more steps?
Observations: โข Greedy method improved faster at the very
beginning, but level off at a lower level. โข ๐- Greedy methods continue to Explore and
eventually perform better. โข The๐ = 0.01 method improves slowly,but eventually performs better than the
๐ = 0.1 method.
Improve the Greedy method with optimistic initialization
Observations:
Greedy with optimistic initialization
Observations: โข Big initial Q values force the Greedy method to
explore more in the beginning. โข No exploration afterwards.
Improve ๐-greedy with decreasing ๐ over time
Decreasing over time: โข ๐๐ก = ๐๐ก * (๐ผ)/(t+ ๐ผ) โข Improves faster in the beginning, also outperforms
fixed ๐-greedy methods in the long run.
Weaknesses of ๐-Greedy methods:
โข ๐-Greedy methods: โ At time step t, estimate a value Qt(a) for each action
โ With probability 1- ๐, Select the action with the maximum value.
โข At= argmax๐
Qt(a)
โ With probability ๐, select an action randomly from all the actions with equal probability.
โข Weaknesses:
Weaknesses of ๐-Greedy methods:
โข ๐-Greedy methods: โ At time step t, estimate a value Qt(a) for each action
โ With probability 1- ๐, Select the action with the maximum value.
โข At= argmax๐
Qt(a)
โ With probability ๐, select an action randomly from all the actions with equal probability.
โข Weaknesses: โ Randomly selects a action to explore, does not
explore more โpromisingโ actions.
โ Does not consider confidence interval. If an action has been taken many times, no need to explore it.
Upper Confidence Bound algorithm
โข Take each action once.
โข At any time t > K,
โ At= argmax๐(Qt(a)+
2๐๐๐ก
๐๐(๐ก)), where
โข Qt(a) is the average reward obtained from action a,
โข ๐๐(๐ก) is the number of times action a has been taken.
Upper Confidence Bounds
โข At= argmax๐(Qt(a)+
2๐๐๐ก
๐๐(๐ก))
โข2๐๐๐ก
๐๐(๐ก)
is confidence interval for the average reward,
โข Small ๐๐(๐ก)-> large confidence interval(Qt(a) is more uncertain) โข large ๐๐(๐ก)-> small confidence interval(Qt(a) is more accurate) โข select action maximizing upper confidence bound.
โ Explore actions which are more uncertain, exploit actions with high average rewards obtained.
โ UCB: balance exploration and exploitation. โ As t -> infinity, select optimal action.
How good is UCB? Experimental results
Observations:
How good is UCB? Experimental results
Observations: After initial 10 steps, improves quickly and outperforms ๐-Greedy methods.
How good is UCB? Theoretical guarantees
โข Expected regret is bounded by
8* ๐๐๐ก
u*- ๐ข๐๐ข๐ <u* + 1 +
๐2
3 u*- ๐ข๐พ๐=1
โข Achieves the optimal regret up to a multiplicative constant.
โข Expected regret grows in O(lnt)
How is the UCB derived?
โข Hoeffdingโs Inequality Theorem:
โข Let be Xi, โฆ, Xt i.i.d. random variables in [0,1], and let Xt ๐๐ ๐กโ๐ ๐ ๐๐๐๐๐ ๐๐๐๐, ๐กโ๐๐
โ P[E[Xt ]โฅ Xt +u]โค ๐โ2๐ก๐ข
2
โข Apply to the bandit rewards, โ P[Q(a) โฅ Qt(a)+Vt(a)]โค ๐โ2๐๐ก(๐)๐๐ก(๐)
2
,
โ 2*Vt(a) is size of the confidence interval for Q(a)
โ p=๐โ2๐๐ก(๐)๐๐ก(๐)2 is the probability of making an error
How is the UCB derived?
โข Apply to the bandit rewards, โ P[Q(a)> Qt(a)+Vt(a)]โค ๐โ2๐๐ก(๐)๐๐ก(๐)
2
โข Let ๐ = ๐โ2๐๐ก(๐)๐๐ก(๐)2
=t-4,
โข Then ๐๐ก(๐)= 2๐๐๐ก
๐๐(๐ก)
โข Variations of UCB1(the basic one here): โ UCB1_tuned, UCB3, UCBv, etc. โ Provide and prove different upper bounds for the expected
regret. โ Add a parameter to control the confidence level
Multi-armed bandit problem
โข Stochastic bandits
โ Problem formulation
โ Algorithms:
โข Greedy, ๐- greedy methods, UCB
โข Boltzmann exploration(Softmax)
โข Adversarial bandits
โ Problem formulation
โ Algorithms
โข Contextual bandits
Multi-armed bandit problem
โข Boltzmann exploration(Softmax) โ Pick a action with a probability that is proportional to its
average reward.
โ Actions with greater average rewards are picked with higher probability.
โข The algorithm: โ Given initial empirical means u1(0),โฆ,uK(0)
โ pi(t+1) = ๐ui(t)/๐
๐uj(t)/ฯ๐พ๐=1
, i =1,โฆ,K
โ ๐โ๐๐ก ๐๐ ๐กโ๐ ๐ข๐ ๐ ๐๐ ๐ ?
โ What happens as ๐ โ ๐๐๐๐๐๐๐ก๐ฆ?
Multi-armed bandit problem
โข Boltzmann exploration(Softmax)
โ Pick a action with a probability that is proportional to its average reward.
โ Actions with greater average rewards are picked with higher probability.
โข The algorithm:
โ Given initial empirical means u1(0),โฆ,uK(0)
โ pi(t+1) = ๐ui(t)/๐
๐uj(t)/ฯ๐พ๐=1
, i =1,โฆ,K
โ ๐ controls the choice.
โ ๐๐ ๐ โ ๐๐๐๐๐๐๐ก๐ฆ, ๐ ๐๐๐๐๐ก๐ ๐ข๐๐๐๐๐๐๐๐ฆ.
Multi-armed bandit problem
โข Stochastic bandits
โ Problem formulation
โ Algorithms:
โข ๐- greedy methods, UCB
โข Boltzmann exploration(Softmax)
โข Adversarial bandits
โ Problem formulation
โ Algorithms
โข Contextual bandits
Non-stochastic/adversarial multi-armed bandit problem
โข Setting: โ K possible arms/actions: 1 โค i โค K,
โ each action is denoted by an assignment of rewards. x(1), x(2), โฆ, of vectors x(t)=(x1(t), x2(t),โฆ,xK(t))
โ xi(t) ฯต [0,1] : reward received if action i is chosen at time step t (can generalize to in the range[a, b])
โ Goal : maximize the return(the accumulative reward.)
โข Example. K=3, an adversary controls the reward.
k t=1 t=2 t=3 t=4 โฆ
1 0.2 0.7 0.4 0.3
2 0.5 0.1 0.6 0.2
3 0.8 0.4 0.5 0.9
Adversarial multi-armed bandit problem
โข Weak regret = Gmax(T)-GA(T)
โข Gmax(T)=max๐ ๐ฅ๐(๐ก)๐๐ก=1 , return of the single global best
action at time horizon T
โข GA(T)= ๐ฅ๐๐ก(๐ก)๐๐ก=1 , return at time horizon T of algorithm A
choosing actions i1, i2,..
โข Example. Gmax(4)=?, GA(4)=? โ (*indicates the reward received by algorithm A at a time step t)
K/T t=1 t=2 t=3 t=4 โฆ
1 0.2 0.7 0.4 0.3*
2 0.5* 0.1 0.6* 0.2
3 0.8 0.4* 0.5 0.9
Adversarial multi-armed bandit problem
โข Example. Gmax(4)=?, GA(4)=?
โ Gmax(4)= max๐ ๐ฅ๐(๐ก)๐๐ก=1 = max {G1(4),G2(4),G3(4)}
=max{1.6, 1.4, 2.4}=2.4
โ GA(4) )= ๐ฅ๐๐ก(๐ก)๐๐ก=1 =0.5+0.4+0.6+0.3=1.8
โข Evaluate a randomized algorithm A: Bound on the expected regret for a A: Gmax(T)-E[GA(T)]
K/T t=1 t=2 t=3 t=4 โฆ
1 0.2 0.7 0.4 0.3*
2 0.5* 0.1 0.6* 0.2
3 0.8 0.4* 0.5 0.9
Exp3: exponential-weight algorithm for exploration and exploitation
โข Parameter: ฮณ ฯต (0,1] โข Initialize:
โ wi(1) = 1 for i =1, ..,K
โข for t=1, 2, โฆ
โ Set pi(t) = (1- ฮณ)wi(t) wj(t)๐พ๐=1
+ ฮณ. 1๐พ
, i = 1,โฆ,K
โ Draw it randomly from p1(t), โฆ, pK(t)
โ Receive reward ๐ฅ๐๐ก ฯต[0,1]
โ For j =1,โฆ,K
โข ๐ฅ๐(๐ก) = ๐ฅ๐(๐ก) /pj(t) if j=it โข ๐ฅ๐(๐ก) = 0 otherwise
โข wj(t+1) = wj(t)*exp(ฮณ*๐ฅ๐(๐ก) /K) Question: what happens at time step 1? At time step 2? What happens as ฮณ --> 1? Why ๐ฅ๐(๐ก) /pj(t)?
Exp3: Balancing exploration and exploitation
โข pi(t) = (1- ฮณ)wi(t) wj(t)๐พ๐=1
+ ฮณ. 1๐พ
, i = 1,โฆ,K
โข Balancing exploration and exploitation in Exp3
โ The distribution P(t) is a mixture of the uniform distribution and a distribution
which assigns to each action a probability mass exponential in the estimated cumulative reward for that action.
โ Uniform distribution encourages exploration
โ The other probability encourages exploitation
โ The parameter ฮณ controls the exploration.
โ ๐ฅ๐(๐ก) /pj(t) compensate the reward of actions that are unlikely to be chosen
Exp3: How good is it?
โข Theorem: For any K>0 and for any ฮณ ฯต (0,1],
Gmax โ E[GExp3] โค (e-1) *ฮณ*Gmax+K*ln(K)/ฮณ
Holds for any assignment of rewards and for any T>0
โข Corollary: โข For any T > 0 , assume that g โฅ Gmax and that algorithm
Exp3 is run with the parameter
โข ฮณ = min {1, ๐พ๐๐๐พ
๐โ1 ๐},
โข Then Gmax โ E[GExp3] โค 2 (๐ โ 1) ๐๐พ๐๐๐ โค 2.63 ๐๐พ๐๐๐
A family of Exp3 algorithms
โข Exp3, Exp3.1, Exp3.P, Exp4, โฆ
โข Better performance
โ tighter upper bound,
โ achieve small weak regret with high probability, etc.
โข Changes based on Exp3
โ How to choose ฮณ
โ How to initialize and update the weights w
Summary: Exploration & exploitation in Reinforcement learning
โข Stochastic bandits โ Problem formulation
โ Algorithms โข ๐- greedy methods, UCB, Boltzmann exploration(Softmax)
โข Adversarial bandits โ Problem formulation
โ Algorithms: Exp3
โข Contextual bandits โ At each time step t, observes a feature/context vector.
โ Uses both context vector and rewards in the past to make a decision
โ Learns how context vector and rewards relate to each other.
u*- ๐ข
References:
โข โReinforcement Learning: An Introductionโ by Sutton and Barto โ https://webdocs.cs.ualberta.ca/~sutton/book/the-book-2nd.html
โข โThe non-stochastic multi-armed bandit problemโ by Auer, Cesa-Bianchi, Freund, and Schapire โ https://cseweb.ucsd.edu/~yfreund/papers/bandits.pdf
โข โMulti-armed banditsโ by Michael โ http://blog.thedataincubator.com/2016/07/multi-armed-bandits-2/
โข โฆ
Take home questions?
โข 1. What is the difference between MABs(multi-armed bandits) and MDPs(Markov Decision Processes)?
โข 2. How are they related?
top related