cpsc540 - university of british columbia

31
CPSC540 Bayesian optimization, Nando de Freitas February 2013 Bayesian optimization, bandits and Thompson sampling

Upload: others

Post on 06-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CPSC540 - University of British Columbia

CPSC540

Bayesian optimization,

Nando de FreitasFebruary 2013

Bayesian optimization,

bandits and

Thompson sampling

Page 2: CPSC540 - University of British Columbia

Multi-armed bandit problem

money!

Page 3: CPSC540 - University of British Columbia

Multi-armed bandit problem

Actions

Reward(s)

Sequence of trials

• Trade-off between Exploration and Exploitation• Regret = Player reward – Reward of best action

Page 4: CPSC540 - University of British Columbia
Page 5: CPSC540 - University of British Columbia

CPSC 340 5

Acq

uisi

tion

func

tion

Parameter

Page 6: CPSC540 - University of British Columbia

Exploration-exploitation tradeoff

Recall the expressions for GP prediction:

We should choose the next point x where the mean is high (exploitation) and the variance is high (exploration).

We could balance this tradeoff with an acquisition function as follows:

Page 7: CPSC540 - University of British Columbia

Acquisition functions

Page 8: CPSC540 - University of British Columbia

An acquisition function: Probability of Improvement

Page 9: CPSC540 - University of British Columbia

People as Bayesian reasoners

Page 10: CPSC540 - University of British Columbia

Utilitarian view: We need models to make the right decisions under uncertainty. Inference and decision making are intertwined.

Learned posterior Cost/Reward model u(x,a)

Bayes and decision theory

P(x=healthy|data) = 0.9

P(x=cancer|data) = 0.1

We choose the action that maximizes the expected utility, or equivalently, which minimizes the expected cost.

EU(a=no treatment) =

EU(a=treatment) =

EU(a) = u(x,a) P(x|data)xΣΣΣΣ

Page 11: CPSC540 - University of British Columbia

An expected utility criterionAt iteration n+1, choose the point that minimizes the distance to the objective evaluated at the maximum x*:

We don’t know the true objective at the maximum. To overcome this, Mockus proposed the following acquisition function:

Page 12: CPSC540 - University of British Columbia

Expected improvement

For this acquisition, we can obtain an analytical expression:

Page 13: CPSC540 - University of British Columbia

A third criterion: GP-UCB

Define the regret and cumulative regret as follows:

The GP-UCB criterion is as follows:The GP-UCB criterion is as follows:

Beta is set using a simple concentration bound:

[Srinivas et al, 2010]

Page 14: CPSC540 - University of British Columbia

A fourth criterion: Thompson sampling

Page 15: CPSC540 - University of British Columbia

Acquisition functions

Page 16: CPSC540 - University of British Columbia
Page 17: CPSC540 - University of British Columbia

Portfolios of acquisition functions help

Page 18: CPSC540 - University of British Columbia

Why Bayesian Optimization works

Page 19: CPSC540 - University of British Columbia

Intelligent user interfaces

Page 20: CPSC540 - University of British Columbia

Example: Tuning NP hard problem solvers

Page 21: CPSC540 - University of British Columbia

Why random tuning works sometimes

Page 22: CPSC540 - University of British Columbia

Example: Tuning random forests

Page 23: CPSC540 - University of British Columbia

Example: Tuning hybrid Monte Carlo

Page 24: CPSC540 - University of British Columbia

24

Page 25: CPSC540 - University of British Columbia

The games industry, rich in sophisticated large-scale simulators, is a great environment for the design and

study of automatic decision making systems.

Page 26: CPSC540 - University of British Columbia

Hierarchical policy example

– High-levelmodel-based learning for deciding when to navigate, park, pickup and dropoff passengers.

– Mid-levelactive path learning for navigating a topological map.

– Low-level active policy optimizer to learn control of continuous non-linear vehicle dynamics.

Page 27: CPSC540 - University of British Columbia

Active Path Finding in Middle Level• Mid-level Navigate policy generates sequence of waypoints on

a topological map to navigate from a location to a destination. V(θ) value function represents the path length from the current node, to the target.

Page 28: CPSC540 - University of British Columbia

Low-Level: Trajectory following

Vx

VyYerr

Ωerr

trajectory

TORCS: 3D game engine that implements complex vehicle dynamics complete with manual and automatic transmission, engine, clutch, tire, and suspension models.

Page 29: CPSC540 - University of British Columbia

Hierarchical systems apply to many robot tasks – key to build large systems

We used TORCS: A 3D game engine that implements complex vehicle dynamics complete with manual and automatic transmission, engine, clutch, tire, and suspension models.

Page 30: CPSC540 - University of British Columbia

Gaze planning

Digits Experiment:

Face Experiment:

30

Page 31: CPSC540 - University of British Columbia

Next lecture

In the next lecture, we embark on our quest to learn all about random forests. We will begin by learning about decision trees.