computer vision lab. snu young ki baik an introduction to mcmc for machine learning (markov chain...

Computer Vision Lab. SNU

Young Ki Baik

An Introduction to MCMC for Machine Learning (Markov Chain Monte Carlo)

References

• An introduction to MCMC for Machine Learning

• Andrieu et al. (Machine Learning 2003)

• Introduction to Monte Carlo methods• David MacKay.

• Markov Chain Monte Carlo for Computer Vision

• Zhu, Delleart and Tu. (a tutorial at ICCV05)• http://civs.stat.ucla.edu/MCMC/MCMC_tutorial.htm

• Various PPTs for MCMC in the web

Contents

• MCMC• Metropolis-Hasting algorithm• Mixture and cycles of MCMC kernels• Auxiliary variable sampler• Adaptive MCMC• Other application of MCMC• Convergence problem and Trick of MCMC• Remained Problems

• Conclusion

MCMC

• Problem of MC(Monte Carlo)• Assembling the entire distribution for MC is usually hard:

• Complicated energy landscapes• High-dimensional system.• Extraordinarily difficult normalization

• Solution : MCMC• Build up distribution from Markov chain• Choose local transition probabilities which generate

distribution of interest (ensure detailed balance) • Each random variable is chosen based on the previous

variable in the chain• “Walk” along the Markov chain until convergence reached

• Result : Normalization not required, calculation are local…

MCMC

• What is Markov Chain?• A Markov chain is a mathematical model for stochastic

system that generates random variable X1, X2, …, Xt, where the distribution

• The distribution of the next random variable depends only on the current random variable.

• The entire chain represents a stationary probability distribution.

tx 1tx1tx

)|(),,,|( 1121 tttt xxpxxxxp

MCMC

• What is Markov Chain Monte Carlo?• MCMC is general purpose technique for generating

fair samples from a probability in high-dimensional space, using random numbers (dice) drawn from uniform probability in certain range.

tx 1tx1tx

1tz tz 1tz

Markov chain states

Independent trials of dice

xpxt ~

],[~ baunifzt

MCMC

• MCMC as a general purpose computing technique• Task 1: simulation: draw fair (typical) samples from a

probability which governs a system.

• Task 2: Integration/computing in very high dimensions, i.e. to compute

• Task 3: Optimization with an annealing scheme

• Task 4: Learning: un supervised learning with hidden variables (simulated from

posterior) or MLE learning of parameters p(x;Θ) needs simulations as well.

xpx ~

dxxpxfxfE )()(

)(maxarg* xpx

MCMC

• Some notation

• The stochastic process is called a Markov chain if

• The chain is homogeneous if remains invariant for all i, with for any i.

• Chain depends solely on the current state of the chain and a fixed transition matrix.

state ofindex :

space state:

sample:

s

x i

ix

1| ii xxTT 1| 1

ix

ii xxT

si xxxx ,...,, 21

111 |,...,| iiii xxTxxxp

MCMC

• Example • Transition graph for Markov chain

with three state (s=3)• Transition matrix

• For the initial state

• This stability result plays a fundamental role in MCMC.

1x

2x

3x

1

0.1

0.9

0.4

0.6

04.06.0

9.01.00

010

T

3.0,2.0,5.01 x 2.0,6.0,2.01 Tx

4.0,4.0,2.01 tTx

xp

~~,~,1 TTx

MCMC

• Convergence properties • For any starting point, the chain will convergence to the invariant

distribution p(x), as long as T is a stochastic transition matrix that obeys the following properties:

1) Irreducibility That is every state must be (eventually) reachable from every

other state.

2) Aperiodicity This stops the chain from oscillating between different states

3) Reversibility (detailed balance) condition This holds the system remain its stationary distribution.

.

111 || iiiiii xxTxpxxTxp

1

11 |ix

iiii xxTxpxp

111 | iiiii dxxxKxpxp

discrete

continuous

Kernal, proposal distribution

MCMC

• Eigen-analysis • From the spectral theory, p(x) is the left eigenvector of

the matrix T with corresponding eigenvalue 1.

• The second largest eigenvalue determines the rate of convergence of the chain, and should be as small as possible.

1000.03000.03000.0

3000.02000.06000.0

6000.05000.01000.0TT

5345.00000.04264.0

8018.07071.06396.0

2673.07071.06396.0

E

2000.00000.00000.0

0000.07071.00000.0

0000.00000.00000.1

D

EDET T

Eigenvalue v1 always 1

Stationary distribution

11 / esumep

Metropolis-Hastings algorithm

• The MH algorithm • The most popular MCMC method

• Invariant distgribution p(x)• Proposal distribution q(x*|x)• Candidate value x*• Acceptance probability A(x,x*)

• Kernel K

.

1. Initialize .2. For i=0 to N-1 - Sample - Sample - If

else

0x

]1,0[~Uu ixxqx |~ **

ii

ii

xxqxp

xxqxpxxAu

|

|,1min,

*

***

*1 xx i

ii xx 1

111 || iiMH

iiiMH

i xxKxpxxKxp

ii

x

iiiiiiMH xrxxxAxxqxxK i ,|| 1111

*** ,1| dxxxAxxqxr iii


• Results of running the MH algorithm• Target distribution :

100,|* ii xNxxq

22 102.0exp7.02.0exp3.0 xxxp

Proposal distribution


• Different choices of the proposal standard deviation

• MH requires careful design of the proposal distribution.

• If is narrow, only 1 mode of p(x) might be visited.

• If is too wide, the rejection rate can be high.

• If all the modes are visited while the acceptance probability is high, the chain is said to “mix” well.

*

*

*

Mixture and cycles of MCMC kernels

• Mixture and cycle• It is possible to combine several samplers into mixture

and cycles of individual samplers.

• If transition kernels K1, K2 have invariant distribution, then cycle hybrid kernel and mixture hybrid kernel are also transition kernels.

10 ,1 21 vforKvvK


• Mixtures of kernels• Incorporate global proposals to explore vast region

of the state space and local proposals to discover finer details of the target distribution.

-> target distribution with many narrow peaks (= reversible jump MCMC algorithm)


• Cycles of kernels• Split a multivariate state vector into components (block)

-> It can be updated separately. -> Blocking highly correlated variables (= Gibbs sampling algorithm)

Auxiliary variable samplers

• Auxiliary variable • It is often easier to sample from an augmented

distribution p(x,u), where u is an auxiliary variable.• It is possible to obtain marginal samples x by sampling

(x, u), and ignoring the sample u.• Hybrid Monte Carlo (HMC)

• Use gradient approximation• Slice sampling

Adaptive MCMC

• Adaptive selection of proposal distribution • The variance of proposal distribution is important.• To automate the process of choosing the proposal distribution

as much as possible.

• Problem• Adaptive MCMC can disturb the stationary distribution.• Gelfand and Sahu(1994)

• Station distribution is disturbed despite the fact that each participating kernel has the same stationary distribution.

• Avoidance• Carry out adaptation only initial fixed number of step.• Parallel chains• And so on… -> inefficient, much more research is required.

Other application of MCMC

• Simulated annealing method for global optimization

• To find global maximum of p(x)

• Monte Carlo EM• To find fast approximation for E-step

• Sequential Monte Carlo method and particle filters

• To carry out on-line approximation of probability distributions using samples.

->using parallel sampling

iNix

xpxi

maxarg,...,1;

ˆ

Convergence problem and Trick of MCMC

• Convergence problem• Determining the length of the Markov chain is a difficult task.

• Trick• Initial set problem (for starting biases)

• Discards an initial set of samples (Burn-in)• Set initial sample value manually.

• Markov chain test• Apply several graphical and statistical tests to assess if the

chain has stabilized. -> It doesn’t provide entirely satisfactory diagnostics.

• Study about convergence problem

20

0

5

10

15

20

25

0 100 200 300 400 500 600 700 800 900 1000

Remained problems

• Large dimension model• The combination of sampling algorithm with either

gradient optimization or exact one.

• Massive data set• A few solution based on importance sampling have been

proposed.

• Many and varied applications…

-> But there is still great room for innovation in this area.

Conclusion

• MCMC• The Markov Chain Monte Carlo methods cover a variety

of different fields and applications.• There are great opportunities for combining existing

sub-optimal algorithms with MCMC in many machine learning problems.

• Some areas are already benefiting from sampling methods include:

Tracking, restoration, segmentationProbabilistic graphical modelsClassificationData association for localizationClassical mixture models.

computer vision lab. snu young ki baik an introduction to mcmc for machine learning (markov chain...

Documents

mcmc mcmc

trick of mcmc

entire chain

mcmc convergence properties

mcmc eigenanalysis

mcmc problem of mcmonte

proposal distribution

mcmc example transition