Transcript
Page 1: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

1

Search and Planning for Inference and Learning

in Computer Vision 

Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu

Page 2: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Markov Decision Processes & Reinforcement Learning

Sinisa Todorovic and Iasonas KokkinosJune 7, 2015

Page 3: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Multi-Armed Bandit Problem• A gambler faces K slot-machines ("armed bandits")• Each machine provides a random reward from an

unknown distribution specific to that machine• Problem:

In which order to play each machine to maximize the sum of rewards of a sequence of lever pulls

s

a1 a2 ak

R(s,a1) R(s,a2) R(s,ak)

… Robbins 1952

Page 4: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Outline

• Stochastic Process• Markov Property• Markov Chain• Markov Decision Process• Reinforcement Learning

Page 5: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Discrete Stochastic Process

• A collection of indexed random variables with well-defined ordering

• Characterized by probabilities that the variables take given values, called states

Andrey Makrov

Page 6: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Stochastic Process Example• Classic: Random Walk

– Start at state X0 at time t0

– At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - p

– At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png

Page 7: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Markov Property

• Also thought of as the “memoryless” property

• If the probability that Xn+1 has any given value depends only on Xn

Page 8: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Markov Chain

• Discrete-time stochastic process with the Markov property

• Example: Google’s PageRankLikelihood of random linking ending up on a page

http://en.wikipedia.org/wiki/PageRank

Page 9: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Markov Decision Process (MDP)

• Discrete time stochastic control process• Extension of Markov chains• Differences:

– Addition of actions (choice)– Addition of rewards (motivation)

• If the actions are fixed, an MDP reduces to a Markov chain

Page 10: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Description of MDPs

• Tuple (S, A, P(.,.), R(.)))– S -> state space– A -> action space– Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)– R(s) = immediate reward at state s

• Goal: maximize a cumulative function of the rewards = utility function

Page 11: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Example MDP

state node

action node

Page 12: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Solution to an MDP = Policy π

• Given a state, selects the optimal action regardless of history

Value function

Page 13: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Learning Policy

• Value Iteration

• Policy Iteration

• Modified Policy Iteration

• Prioritized Sweeping

Page 14: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Value Iteration

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

1

2

3

4

5

6

0 0 10 10

0 4.5 14.5 19

2.03 8.55 18.55 24.18

4.76 11.79 19.26 29.23

7.45 15.30 20.81 31.82

10.23 17.67 22.72 33.68

Page 15: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Why So Interesting?

• Straightforward if the transition probabilities are known, but...

• If the transition probabilities are unknown, then this problem is reinforcement learning.

Page 16: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

A Typical Agent

• In reinforcement learning (RL), an agent observes a state and takes an action.

• Afterwards, the agent receives a reward.

Page 17: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Mission: Optimize Reward

• Rewards are calculated in the environment• Used to teach the agent how to reach a goal

state• Must signal what we ultimately want

achieved, not necessarily subgoals• May be discounted over time• In general, seek to maximize the expected

return

Page 18: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte Carlo Methods

• Instead of

• Compute:

• Qπ(s, a): Expected reward when starting in state s, taking action a, and thereafter following policy π

Page 19: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search

• Builds a tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”

• Key Idea: Use statistics of previous trajectories to expand the tree in most promising direction

• No heuristic functions, unlike A*, and branch-and-bound methods

Kocsis & Szepesvari, 2006 Browne et. al., 2012

Page 20: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search

select the best state

so far

take an actionand move

to a new state

simulation backpropagationof the total reward

of simulation

Repeated until the maximum tree depth is reached

Page 21: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search• During construction each tree node s stores:

– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)

• Repeat until time is up1. Select action a

2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation

Page 22: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search• During construction each tree node s stores:

– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)

• Repeat until time is up1. Select action a

2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation

Page 23: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search

exploitationexploration

Theoretically, guaranteed to converge to optimal solutions if run long enough.

Practically, it often shows good anytime behavior.

Kocsis & Szepesvari, 2006 Browne et. al., 2012

Page 24: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

24

Acknowledgements

NSF IIS 1302700

DARPA MSEE FA 8650-11-1-7149


Top Related