perceptual and sensory augmented computing machine learning, summer’10 machine learning –...

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Machine Learning – Lecture 16

Approximate Inference

07.07.2010

Bastian LeibeRWTH Aachenhttp://www.mmp.rwth-aachen.de

[email protected]

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Announcements

• Exercise 5 is available Junction tree algorithm st-mincut Graph cuts Sampling MCMC

2B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Course Outline

• Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation

• Discriminative Approaches (4 weeks) Lin. Discriminants, SVMs, Boosting Dec. Trees, Random Forests, Model Sel.

• Generative Models (5 weeks) Bayesian Networks + Applications Markov Random Fields + Applications Exact Inference Approximate Inference

• Unifying Perspective (1 week)

B. Leibe3

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Recap: MRF Structure for Images

• Basic structure

• Two components Observation model

– How likely is it that node xi has label Li given observation yi?

– This relationship is usually learned from training data.

Neighborhood relations– Simplest case: 4-neighborhood– Serve as smoothing terms. Discourage neighboring pixels to have different labels.– This can either be learned or be set to fixed “penalties”.

4B. Leibe

“True” image content

Noisy observations

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Recap: How to Set the Potentials?

• Unary potentials E.g. color model, modeled with a Mixture of

Gaussians

Learn color distributions for each label

5B. Leibe

Á(xi ;yi ;µÁ) = logX

k

µÁ(xi ;k)p(kjxi )N (yi ; ¹yk;§ k)

Á(xp = 1;yp)

Á(xp = 0;yp)

yp y

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Recap: How to Set the Potentials?

• Pairwise potentials Potts Model

– Simplest discontinuity preserving model.– Discontinuities between any pair of labels are penalized

equally.– Useful when labels are unordered or number of labels is

small.

Extension: “contrast sensitive Potts model”

where

– Discourages label changes except in places where there is also a large change in the observations.

6B. Leibe

22 / i javg y y

2

( ) i jy y

ijg y e

Ã(xi ;xj ;µÃ ) = µÃ±(xi 6= xj )

Ã(xi ;xj ;gi j (y);µÃ ) = µÃgi j (y)±(xi 6= xj )

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Recap: Graph Cuts for Binary Problems

7B. Leibe

pqw

n-links

s

t a cut)(tDp

t-lin

k

)(sDp

t-link

22 2/||||exp)( spp IIsD

22 2/||||exp)( tpp IItD

EM-style optimization

“expected” intensities of

object and background

can be re-estimated

ts II and

[Boykov & Jolly, ICCV’01]Slide credit: Yuri Boykov

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Recap: s-t-Mincut Equivalent to Maxflow

8B. Leibe

Source

Sink

v1 v2

2

5

9

42

1

Slide credit: Pushmeet Kohli

Augmenting Path Based Algorithms

1. Find path from source to sink with positive capacity

2. Push maximum possible flow through this path

3. Repeat until no path can be found

Algorithms assume non-negative capacity

Flow = 0

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Recap: When Can s-t Graph Cuts Be Applied?

• s-t graph cuts can only globally minimize binary energies that are submodular.

• Submodularity is the discrete equivalent to convexity.

Implies that every local energy minimum is a global minimum.

Solution will be globally optimal.9

B. Leibe

Npq

qpp

pp LLELELE ),()()(

},{ tsLp t-links n-links

Boundary termRegional term

E(L) can be minimized by s-t

graph cuts

),(),(),(),( stEtsEttEssE Submodularity (“convexity”)

[Boros & Hummer, 2002, Kolmogorov & Zabih, 2004]

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Recap: Simple Binary Image Denoising Model

• MRF Structure

Example: simple energy function (“Potts model”)

– Smoothness term: fixed penalty ¯ if neighboring labels disagree. – Observation term: fixed penalty ´ if label and observation

disagree.

10B. Leibe

“True” image content

Noisy observations

“Smoothness constraints”

Observation process

E (x;y) = hX

i

xi + ¯X

f i ;j g

±(xi 6= xj ) ¡ ´X

i

xi yi

Prior Smoothness Observation

Image source: C. Bishop, 2006

xi ;yi 2 f ¡ 1;1g

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Converting an MRF into an s-t Graph

• Conversion:

• Energy:

Unary potentials are straightforward to set. – How?

– Just insert xi = 1 and xi = -1 into the unary terms above...

11B. Leibe

E (x;y) = hX

i

xi + ¯X

f i ;j g

±(xi 6= xj ) ¡ ´X

i

xi yi

Source

Sink

xi xj

L i = 1

L i = ¡ 1

h ¡ ´yi

¡ h + ´yi

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Converting an MRF into an s-t Graph

• Conversion:

• Energy:

Unary potentials are straightforward to set. How? Pairwise potentials are more tricky, since we don’t know xi!

– Trick: the pairwise energy only has an influence if xi xj.

– (Only!) in this case, the cut will go through the edge {xi,xj}.

12B. Leibe

E (x;y) = hX

i

xi + ¯X

f i ;j g

±(xi 6= xj ) ¡ ´X

i

xi yi

Source

Sink

xi xj

L i = 1

L i = ¡ 1

h ¡ ´yi

¡ h + ´yi

¯

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Dealing with Non-Binary Cases

• Limitation to binary energies is often a nuisance. E.g. binary segmentation only…

• We would like to solve also multi-label problems. The bad news: Problem is NP-hard with 3 or more labels!

• There exist some approximation algorithms which extend graph cuts to the multi-label case:

-Expansion -Swap

• They are no longer guaranteed to return the globally optimal result.

But -Expansion has a guaranteed approximation quality

(2-approx) and converges in a few iterations.

13B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

-Expansion Move

• Basic idea: Break multi-way cut computation into a sequence of

binary s-t cuts.

No longer globally optimal result, but guaranteed approximation quality and typically converges in few iterations.

14B. Leibe

other labels

Slide credit: Yuri Boykov

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

-Expansion Algorithm

1. Start with any initial solution2. For each label “” in any (e.g. random)

order:

1. Compute optimal -expansion move (s-t graph cuts).2. Decline the move if there is no energy decrease.

3. Stop when no expansion move would decrease energy.

15B. LeibeSlide credit: Yuri Boykov

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Example: Stereo Vision

16B. Leibe

Original pair of “stereo” images

Depth map

ground truth

Slide credit: Yuri Boykov

Depth map

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

-Expansion Moves

• In each -expansion a given label “” grabs space from other labels

17B. Leibe

initial solution

-expansion

-expansion

-expansion

-expansion

-expansion

-expansion

-expansion

For each move, we choose the expansion that gives the largest decrease in the energy: binary optimization problemSlide credit: Yuri Boykov

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

GraphCut Applications: “GrabCut”

User segmentation cues

Additional segmentation

cues

• Interactive Image Segmentation [Boykov & Jolly, ICCV’01]

Rough region cues sufficient Segmentation boundary can be extracted from edges

• Procedure User marks foreground and background regions with

a brush. This is used to create an initial segmentation

which can then be corrected by additional brush strokes.

Slide credit: Matthieu Bray18

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Iterated Graph Cuts

19B. Leibe

Energy after each iteration

Result

Foreground &Background

Background G

RForeground

Background G

R

1 2 3 4

Color model(Mixture of Gaussians)

Slide credit: Carsten Rother

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

GrabCut: Example Results

20B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Applications: Interactive 3D Segmentation

21B. LeibeSlide credit: Yuri Boykov [Y. Boykov, V. Kolmogorov, ICCV’03]

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Topics of This Lecture

• Approximate Inference Variational methods Sampling approaches

• Sampling approaches Sampling from a distribution Ancestral Sampling Rejection Sampling Importance Sampling

• Markov Chain Monte Carlo Markov Chains Metropolis Algorithm Metropolis-Hastings Algorithm Gibbs Sampling

22B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Approximate Inference

• Exact Bayesian inference is often intractable. Often infeasible to evaluate the posterior distribution

or to compute expectations w.r.t. the distribution.– E.g. because the dimensionality of the latent space is too

high.– Or because the posterior distribution has a too complex

form.

Problems with continuous variables– Required integrations may not have closed-form solutions.

Problems with discrete variables– Marginalization involves summing over all possible

configurations of the hidden variables.– There may be exponentially many such states.

We need to resort to approximation schemes.23

B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Two Classes of Approximation Schemes

• Deterministic approximations (Variational methods) Based on analytical approximations to the posterior

distribution– E.g. by assuming that it factorizes in a certain form– Or that it has a certain parametric form (e.g. a Gaussian).

Can never generate exact results, but are often scalable to large applications.

• Stochastic approximations (Sampling methods) Given infinite computationally resources, they can generate

exact results. Approximation arises from the use of a finite amount of

processor time. Enable the use of Bayesian techniques across many

domains. But: computationally demanding, often limited to small-

scale problems.24

B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10





25B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Sampling Idea

• Objective: Evaluate expectation of a function f(z)

w.r.t. a probability distribution p(z).

• Sampling idea Draw L independent samples z(l) with l = 1,…,L from p(z). This allows the expectation to be approximated by a finite

sum

As long as the samples z(l) are drawn independently from p(z), then

Unbiased estimate, independent of the dimension of z!

26B. LeibeSlide adapted from Bernt Schiele

f̂ =1L

LX

l=1

f (zl)

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Sampling – Challenges

• Problem 1: Samples might not be independent Effective sample size might be much smaller than

apparent sample size.

• Problem 2: If f(z) is small in regions where p(z) is large and vice

versa, the expectation may be dominated by regions of small probability.

Large sample sizes necessary to achieve sufficient accuracy.

27B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Parametric Density Model

• Example: A simple multivariate (d-dimensional) Gaussian

model

This is a “generative” modelin the sense that we can generatesamples x according to the distribution.

28B. LeibeSlide adapted from Bernt Schiele

p(xj¹ ;§ ) =1

(2¼)D =2j§ j1=2exp

½¡

12(x ¡ ¹ )T § ¡ 1(x ¡ ¹ )

¾

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Sampling from a Gaussian

• Given: 1-dim. Gaussian pdf (probability density function) p(x|¹,¾2) and the corresponding cumulative distribution:

• To draw samples from a Gaussian, we can invert the cumulative distribution function:

29B. Leibe

F¹ ;¾2 (x) =Z x

¡ 1p(xj¹ ;¾2)dx

u » Uniform(0;1) ) F ¡ 1¹ ;¾2 (u) » p(xj¹ ;¾2)

F¹ ;¾2 (x)p(xj¹ ;¾2)

Slide credit: Bernt Schiele

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Sampling from a pdf (Transformation method)

• In general, assume we are given the pdf p(x) and the corresponding cumulative distribution:

• To draw samples from this pdf, we can invert the cumulative distribution function:

30B. Leibe

F (x) =Z x

¡ 1p(z)dz

u » Uniform(0;1) ) F ¡ 1(u) » p(x)


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Ancestral Sampling

• Generalization of this idea to directed graphical models.

Joint probability factorizes into conditional probabilities:

• Ancestral sampling Assume the variables are ordered such that there are

no links from any node to a lower-numbered node. Start with lowest-numbered node and draw a sample

from its distribution.

Cycle through each of the nodes in order and draw samples from the conditional distribution (where the parent variable is set to its sampled value).

31B. Leibe

x̂1 » p(x1)

x̂n » p(xn jpan)

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Discussion

• Transformation method Limited applicability, as we need to invert the

indefinite integral of the required distribution p(z).

• More general Rejection Sampling Importance Sampling

32B. LeibeSlide credit: Bernt Schiele

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Rejection Sampling

• Assumptions Sampling directly from p(z) is difficult. But we can easily evaluate p(z) (up to some

normalization factor Zp):

• Idea We need some simpler distribution q(z) (called

proposal distribution) from which we can draw samples.

Choose a constant k such that:

33B. Leibe

p(z) =1

Zp~p(z)

8z : kq(z) ¸ ~p(z)


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Rejection Sampling

• Sampling procedure Generate a number z0 from q(z). Generate a number u0 from the

uniform distribution over [0,kq(z0)]. If reject sample, otherwise accept.

– Sample is rejected if it lies in the grey shaded area.

– The remaining pairs (u0,z0) have uniform distribution under the curve .

• Discussion Original values of z are generated from the distribution

q(z). Samples are accepted with probability

k should be as small as possible!34

B. LeibeSlide credit: Bernt Schiele

u0 > ~p(z0)

~p(z)

~p(z)=kq(z)

p(accept) =Z

~p(z)=kq(z)q(z)dz =1k

Z~p(z)dz

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Rejection Sampling – Discussion

• Limitation: high-dimensional spaces For rejection sampling to be of practical value, we

require that kq(z) be close to the required distribution, so that the rate of rejection is minimal.

• Artificial example Assume that p(z) is Gaussian with covariance matrix Assume that q(z) is Gaussian with covariance matrix Obviously: In D dimensions: k = (¾q/¾p)D.

– Assume ¾q is just 1% larger than ¾p.

– D = 1000 k = 1.011000 ¸ 20,000– And

Often impractical to find good proposal distributions for high dimensions!

35B. Leibe

¾2pI

¾2qI

¾2q ¸ ¾2

p

p(accept) ·1

20000


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Importance Sampling

• Approach Approximate expectations directly

(but does not enable to draw samples from p(z) directly).

Goal:

• Simplistic strategy: Grid sampling Discretize z-space into a uniform grid. Evaluate the integrand as a sum of the form

But: number of terms grows exponentially with number of dimensions!


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Importance Sampling

• Idea Use a proposal distribution q(z) from which it is easy

to draw samples. Express expectations in the form of a finite sum over

samples {z(l)} drawn from q(z).

with importance weights


rl =p(z(l))q(z(l))

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Importance Sampling

• Typical setting: p(z) can only be evaluated up to an unknown

normalization constant

q(z) can also be treated in a similar fashion.

Then

with:


p(z) = ~p(z)=Zp

q(z) = ~q(z)=Zq

~rl =~p(z(l))~q(z(l))

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Importance Sampling

• Ratio of normalization constants can be evaluated

• and therefore

• with


Zp

Zq=

1Zq

Z~p(z)dz =

Z~p(z(l))~q(z(l))

q(z)dz '1L

LX

l=1

~rl

wl =~rlPm ~rm

=~p(z( l ) )~q(z( l ) )

Pm

~p(z(m ) )~q(z(m ) )

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Importance Sampling – Discussion

• Observations Success of importance sampling depends crucially on how

well the sampling distribution q(z) matches the desired distribution p(z).

Often, p(z)f(z) is strongly varying and has a significant propor-tion of its mass concentrated over small regions of z-space.

Weights rl may be dominated by a few weights having large values.

Practical issue: if none of the samples falls in the regions where p(z)f(z) is large…

– The results may be arbitrary in error.– And there will be no diagnostic indication (no large variance in

rl)!

Key requirement for sampling distribution q(z):– Should not be small or zero in regions where p(z) is significant!


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10





41B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Independent Sampling vs. Markov Chains

• So far We’ve considered two methods, Rejection Sampling

and Importance Sampling, which were both based on independent samples from q(z).

However, for many problems of practical interest, it is difficult or impossible to find q(z) with the necessary properties.

• Different approach We abandon the idea of independent sampling. Instead, rely on a Markov Chain to generate

dependent samples from the target distribution. Independence would be a nice thing, but it is not

necessary for the Monte Carlo estimate to be valid.

42B. LeibeSlide credit: Zoubin Ghahramani

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

p(z) =~p(z)Zp

• Overview Allows to sample from a large class of distributions. Scales well with the dimensionality of the sample space.

• Idea We maintain a record of the current state z(¿) The proposal distribution depends on the current state: q(z|

z(¿)) The sequence of samples forms a Markov chain z(1), z(2),…

• Setting We can evaluate p(z) (up to some normalizing factor Zp):

At each time step, we generate a candidate sample from the proposal distribution and accept the sample according to a criterion.

MCMC – Markov Chain Monte Carlo


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

MCMC – Metropolis Algorithm

• Metropolis algorithm [Metropolis et al., 1953]

Proposal distribution is symmetric: The new candidate sample z* is accepted with probability

• Implementation Choose random number u uniformly from unit interval (0,1). Accept sample if .

• Note New candidate samples always accepted if .

– I.e. when new sample has higher probability than the previous one.

The algorithm sometimes accepts a state with lower probability.

44B. Leibe

q(zA jzB ) = q(zB jzA )

A(z?;z(¿)) = minµ

1;~p(z?)

~p(z(¿))

¶

A(z?;z(¿)) > u

~p(z?) ¸ ~p(z(¿))


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10


• Two cases If new sample is accepted: Otherwise:

This is in contrast to rejection sampling, where rejected samples are simply discarded.

Leads to multiple copies of the same sample!

45B. Leibe

z(¿+1) = z?

z(¿+1) = z(¿)


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10


• Property When q(zA|zB) > 0 for all z, the distribution of z¿ tends

to p(z) as ¿ ! 1.

• Note Sequence z(1), z(2),… is not a set of independent

samples from p(z), as successive samples are highly correlated.

We can obtain (largely) independent samples by just retaining every Mth sample.

• Example: Sampling from a Gaussian Proposal: Gaussian with ¾ = 0.2. Green: accepted samples Red: rejected samples


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Markov Chains

• Question How can we show that z¿ tends to p(z) as ¿ ! 1?

• Markov chains First-order Markov chain:

Marginal probability

47B. Leibe

p³z(m+1) jz(1); : : : ;z(m)

´= p

³z(m+1) jz(m)

´

p³z(m+1)

´=

X

z(m )

p³z(m+1) jz(m)

´p

³z(m)

´


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Markov Chains – Properties

• Invariant distribution A distribution is said to be invariant (or stationary) w.r.t.

a Markov chain if each step in the chain leaves that distribution invariant.

Transition probabilities:

Distribution p*(z) is invariant if:

• Detailed balance Sufficient (but not necessary) condition to ensure that a

distribution is invariant:

A Markov chain which respects detailed balance is reversible.

48B. Leibe

T³z(m);z(m+1)

´= p

³z(m+1) jz(m)

´

p?(z) =X

z0

T (z0;z) p?(z0)

p?(z)T (z;z0) = p?(z0)T (z0;z)


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

MCMC – Metropolis-Hastings Algorithm

• Metropolis-Hastings Algorithm Generalization: Proposal distribution not required to

be symmetric. The new candidate sample z* is accepted with

probability

where k labels the members of the set of possible transitions considered.

• Note When the proposal distributions are symmetric,

Metropolis-Hastings reduces to the standard Metropolis algorithm.


A(z?;z(¿)) = minµ

1;~p(z?)qk(z(¿) jz?)

~p(z(¿))qk(z? jz(¿))

¶

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10


• Properties We can show that p(z) is an invariant distribution of

the Markov chain defined by the Metropolis-Hastings algorithm.

We show detailed balance:


p(z)qk(zjz0)Ak(z0;z) = minfp(z)qk(zjz0);p(z0)qk(z0jz)g

= minfp(z0)qk(z0jz);p(z)qk(zjz0)g

= p(z0)qk(z0jz)Ak(z;z0)

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10


• Schematic illustration For continuous state spaces, a common

choice of proposal distribution is a Gaussian centered on the current state.

What should be the variance of theproposal distribution?

– Large variance: rejection rate will be high for complex problems.

– The scale ½ of the proposal distribution should be as large as possible without incurring high rejection rates.

½ should be of the same order as the smallest length scale ¾min.

This causes the system to explore the distribution by means of a random walk.

– Undesired behavior: number of steps to arrive at state that is independent of original state is of order (¾max/¾min)2.

– Strong correlations can slow down the Metropolis algorithm!51

B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Gibbs Sampling

• Approach MCMC-algorithm that is simple and widely applicable. May be seen as a special case of Metropolis-Hastings.

• Idea Sample variable-wise: replace zi by a value drawn

from the distribution p(zi|z\i).– This means we update one coordinate at a time.

Repeat procedure either by cycling through all variables or by choosing the next variable.


Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Gibbs Sampling

• Example Assume distribution p(z1, z2, z3). Replace with new value drawn from

Replace with new value drawn from

Replace with new value drawn from And so on…


z(¿)1

z(¿)2

z(¿)3

z(¿+1)1 » p(z1jz

(¿)2 ;z(¿)

3 )

z(¿+1)2 » p(z2jz

(¿+1)1 ;z(¿)

3 )

z(¿+1)3 » p(z3jz

(¿+1)1 ;z(¿+1)

2 )

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Gibbs Sampling

• Properties The factor that determines the acceptance

probability in the Metropolis-Hastings is determined by

I.e. we get an algorithm which always accepts!

If you can compute (and sample from) the conditionals, you can apply Gibbs sampling.

The algorithm is completely parameter free. Can also be applied to subsets of variables.

54B. Leibe

A(z?;z) =p(z?)qk(zjz?)p(z)qk(z? jz)

=p(z?

k jz?nk)p(z?

nk)p(z?k jz?

nk)

p(zkjznk)p(znk)p(zkjznk)= 1

Slide credit: Zoubin Ghahramani

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Gibbs Sampling

• Example 20 iterations of Gibbs sampling on a bivariate

Gaussian.

Note: strong correlations can slow down Gibbs sampling.

55B. LeibeSlide credit: Zoubin Ghahramani

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

Summary: Approximate Inference

• Exact Bayesian Inference often intractable.

• Rejection and Importance Sampling Generate independent samples. Impractical in high-dimensional state spaces.

• Markov Chain Monte Carlo (MCMC) Simple & effective (even though typically

computationally expensive). Scales well with the dimensionality of the state space. Issues of convergence have to be considered carefully.

• Gibbs Sampling Used extensively in practice. Parameter free Requires sampling conditional distributions.

56B. Leibe

Perc

ep

tual an

d S

en

sory

Au

gm

en

ted

Com

pu

tin

gM

ach

ine L

earn

ing

, S

um

mer’

10

References and Further Reading

• Sampling methods for approximate inference are described in detail in Chapter 11 of Bishop’s book.

• Another good introduction to Monte Carlo methods can be found in Chapter 29 of MacKay’s book (also available online: http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html)

B. Leibe57

Christopher M. BishopPattern Recognition and Machine LearningSpringer, 2006

David MacKayInformation Theory, Inference, and Learning AlgorithmsCambridge University Press, 2003

perceptual and sensory augmented computing machine learning, summer’10 machine learning –...

Documents