search and planning for inference and learning in computer vision iasonas kokkinos, sinisa todorovic...

24
Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Upload: meagan-quinn

Post on 27-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

1

Search and Planning for Inference and Learning

in Computer Vision 

Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu

Page 2: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Markov Decision Processes & Reinforcement Learning

Sinisa Todorovic and Iasonas KokkinosJune 7, 2015

Page 3: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Multi-Armed Bandit Problem• A gambler faces K slot-machines ("armed bandits")• Each machine provides a random reward from an

unknown distribution specific to that machine• Problem:

In which order to play each machine to maximize the sum of rewards of a sequence of lever pulls

s

a1 a2 ak

R(s,a1) R(s,a2) R(s,ak)

… Robbins 1952

Page 4: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Outline

• Stochastic Process• Markov Property• Markov Chain• Markov Decision Process• Reinforcement Learning

Page 5: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Discrete Stochastic Process

• A collection of indexed random variables with well-defined ordering

• Characterized by probabilities that the variables take given values, called states

Andrey Makrov

Page 6: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Stochastic Process Example• Classic: Random Walk

– Start at state X0 at time t0

– At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - p

– At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png

Page 7: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Markov Property

• Also thought of as the “memoryless” property

• If the probability that Xn+1 has any given value depends only on Xn

Page 8: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Markov Chain

• Discrete-time stochastic process with the Markov property

• Example: Google’s PageRankLikelihood of random linking ending up on a page

http://en.wikipedia.org/wiki/PageRank

Page 9: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Markov Decision Process (MDP)

• Discrete time stochastic control process• Extension of Markov chains• Differences:

– Addition of actions (choice)– Addition of rewards (motivation)

• If the actions are fixed, an MDP reduces to a Markov chain

Page 10: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Description of MDPs

• Tuple (S, A, P(.,.), R(.)))– S -> state space– A -> action space– Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)– R(s) = immediate reward at state s

• Goal: maximize a cumulative function of the rewards = utility function

Page 11: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Example MDP

state node

action node

Page 12: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Solution to an MDP = Policy π

• Given a state, selects the optimal action regardless of history

Value function

Page 13: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Learning Policy

• Value Iteration

• Policy Iteration

• Modified Policy Iteration

• Prioritized Sweeping

Page 14: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Value Iteration

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)

1

2

3

4

5

6

0 0 10 10

0 4.5 14.5 19

2.03 8.55 18.55 24.18

4.76 11.79 19.26 29.23

7.45 15.30 20.81 31.82

10.23 17.67 22.72 33.68

Page 15: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Why So Interesting?

• Straightforward if the transition probabilities are known, but...

• If the transition probabilities are unknown, then this problem is reinforcement learning.

Page 16: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

A Typical Agent

• In reinforcement learning (RL), an agent observes a state and takes an action.

• Afterwards, the agent receives a reward.

Page 17: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Mission: Optimize Reward

• Rewards are calculated in the environment• Used to teach the agent how to reach a goal

state• Must signal what we ultimately want

achieved, not necessarily subgoals• May be discounted over time• In general, seek to maximize the expected

return

Page 18: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte Carlo Methods

• Instead of

• Compute:

• Qπ(s, a): Expected reward when starting in state s, taking action a, and thereafter following policy π

Page 19: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search

• Builds a tree rooted at current state by repeated Monte-Carlo simulation of a “rollout policy”

• Key Idea: Use statistics of previous trajectories to expand the tree in most promising direction

• No heuristic functions, unlike A*, and branch-and-bound methods

Kocsis & Szepesvari, 2006 Browne et. al., 2012

Page 20: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search

select the best state

so far

take an actionand move

to a new state

simulation backpropagationof the total reward

of simulation

Repeated until the maximum tree depth is reached

Page 21: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search• During construction each tree node s stores:

– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)

• Repeat until time is up1. Select action a

2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation

Page 22: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search• During construction each tree node s stores:

– state-visitation count n(s) – action counts n(s,a) – action values Q(s,a)

• Repeat until time is up1. Select action a

2. Update statistics of each node s on trajectory• Increment n(s) and n(s,a) for selected action a• Update Q(s,a) by total reward of simulation

Page 23: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

Monte-Carlo Tree Search

exploitationexploration

Theoretically, guaranteed to converge to optimal solutions if run long enough.

Practically, it often shows good anytime behavior.

Kocsis & Szepesvari, 2006 Browne et. al., 2012

Page 24: Search and Planning for Inference and Learning in Computer Vision Iasonas Kokkinos, Sinisa Todorovic and Matt (Tianfu) Wu 1

24

Acknowledgements

NSF IIS 1302700

DARPA MSEE FA 8650-11-1-7149