search and planning for inference and learning in computer vision iasonas kokkinos, sinisa todorovic...

Post on 27-Dec-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Search and Planning for Inference and Learning

in Computer Vision 

Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu

Markov Decision Processes & Reinforcement Learning

Sinisa Todorovic and Iasonas KokkinosJune 7, 2015

Multi-Armed Bandit Problem• A gambler faces K slot-machines ("armed bandits")• Each machine provides a random reward from an

unknown distribution specific to that machine• Problem:

In which order to play each machine to maximize the sum of rewards of a sequence of lever pulls

s

a1 a2 ak

R(s,a1) R(s,a2) R(s,ak)

… Robbins 1952

Outline

• Stochastic Process• Markov Property• Markov Chain• Markov Decision Process• Reinforcement Learning

Discrete Stochastic Process

• A collection of indexed random variables with well-defined ordering

• Characterized by probabilities that the variables take given values, called states

Andrey Makrov

Stochastic Process Example• Classic: Random Walk

– Start at state X0 at time t0

– At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - p

– At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png

Markov Property

• Also thought of as the “memoryless” property

• If the probability that Xn+1 has any given value depends only on Xn

Markov Chain

• Discrete-time stochastic process with the Markov property

• Example: Google’s PageRankLikelihood of random linking ending up on a page

http://en.wikipedia.org/wiki/PageRank

Markov Decision Process (MDP)

• Discrete time stochastic control process• Extension of Markov chains• Differences:

– Addition of actions (choice)– Addition of rewards (motivation)

• If the actions are fixed, an MDP reduces to a Markov chain

Description of MDPs

• Tuple (S, A, P(.,.), R(.)))– S -> state space– A -> action space– Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)– R(s) = immediate reward at state s

• Goal: maximize a cumulative function of the rewards = utility function

Example MDP

state node

action node

Solution to an MDP = Policy π

• Given a state, selects the optimal action regardless of history

Value function

Learning Policy

• Value Iteration

• Policy Iteration

• Modified Policy Iteration

• Prioritized Sweeping

Value Iteration

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

1

2

3

4

5

6

0 0 10 10

0 4.5 14.5 19

2.03 8.55 18.55 24.18

4.76 11.79 19.26 29.23

7.45 15.30 20.81 31.82

10.23 17.67 22.72 33.68

Why So Interesting?

• Straightforward if the transition probabilities are known, but...

• If the transition probabilities are unknown, then this problem is reinforcement learning.

A Typical Agent

• In reinforcement learning (RL), an agent observes a state and takes an action.

• Afterwards, the agent receives a reward.

Mission: Optimize Reward

• Rewards are calculated in the environment• Used to teach the agent how to reach a goal

state• Must signal what we ultimately want

achieved, not necessarily subgoals• May be discounted over time• In general, seek to maximize the expected

return

Monte Carlo Methods

• Instead of

• Compute:

• Qπ(s, a): Expected reward when starting in state s, taking action a, and thereafter following policy π

Monte-Carlo Tree Search

• Builds a tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”

• Key Idea: Use statistics of previous trajectories to expand the tree in most promising direction

• No heuristic functions, unlike A*, and branch-and-bound methods

Kocsis & Szepesvari, 2006 Browne et. al., 2012

Monte-Carlo Tree Search

select the best state

so far

take an actionand move

to a new state

simulation backpropagationof the total reward

of simulation

Repeated until the maximum tree depth is reached

Monte-Carlo Tree Search• During construction each tree node s stores:

– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)

• Repeat until time is up1. Select action a

2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation

Monte-Carlo Tree Search• During construction each tree node s stores:

– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)

• Repeat until time is up1. Select action a

2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation

Monte-Carlo Tree Search

exploitationexploration

Theoretically, guaranteed to converge to optimal solutions if run long enough.

Practically, it often shows good anytime behavior.

Kocsis & Szepesvari, 2006 Browne et. al., 2012

24

Acknowledgements

NSF IIS 1302700

DARPA MSEE FA 8650-11-1-7149

top related