computer vision lab. snu young ki baik an introduction to mcmc for machine learning (markov chain...
TRANSCRIPT
Computer Vision Lab. SNU
Young Ki Baik
An Introduction to MCMC for Machine Learning (Markov Chain Monte Carlo)
References
• An introduction to MCMC for Machine Learning
• Andrieu et al. (Machine Learning 2003)
• Introduction to Monte Carlo methods• David MacKay.
• Markov Chain Monte Carlo for Computer Vision
• Zhu, Delleart and Tu. (a tutorial at ICCV05)• http://civs.stat.ucla.edu/MCMC/MCMC_tutorial.htm
• Various PPTs for MCMC in the web
Contents
• MCMC• Metropolis-Hasting algorithm• Mixture and cycles of MCMC kernels• Auxiliary variable sampler• Adaptive MCMC• Other application of MCMC• Convergence problem and Trick of MCMC• Remained Problems
• Conclusion
MCMC
• Problem of MC(Monte Carlo)• Assembling the entire distribution for MC is usually hard:
• Complicated energy landscapes• High-dimensional system.• Extraordinarily difficult normalization
• Solution : MCMC• Build up distribution from Markov chain• Choose local transition probabilities which generate
distribution of interest (ensure detailed balance) • Each random variable is chosen based on the previous
variable in the chain• “Walk” along the Markov chain until convergence reached
• Result : Normalization not required, calculation are local…
MCMC
• What is Markov Chain?• A Markov chain is a mathematical model for stochastic
system that generates random variable X1, X2, …, Xt, where the distribution
• The distribution of the next random variable depends only on the current random variable.
• The entire chain represents a stationary probability distribution.
tx 1tx1tx
)|(),,,|( 1121 tttt xxpxxxxp
MCMC
• What is Markov Chain Monte Carlo?• MCMC is general purpose technique for generating
fair samples from a probability in high-dimensional space, using random numbers (dice) drawn from uniform probability in certain range.
tx 1tx1tx
1tz tz 1tz
Markov chain states
Independent trials of dice
xpxt ~
],[~ baunifzt
MCMC
• MCMC as a general purpose computing technique• Task 1: simulation: draw fair (typical) samples from a
probability which governs a system.
• Task 2: Integration/computing in very high dimensions, i.e. to compute
• Task 3: Optimization with an annealing scheme
• Task 4: Learning: un supervised learning with hidden variables (simulated from
posterior) or MLE learning of parameters p(x;Θ) needs simulations as well.
xpx ~
dxxpxfxfE )()(
)(maxarg* xpx
MCMC
• Some notation
• The stochastic process is called a Markov chain if
• The chain is homogeneous if remains invariant for all i, with for any i.
• Chain depends solely on the current state of the chain and a fixed transition matrix.
state ofindex :
space state:
sample:
s
x i
ix
1| ii xxTT 1| 1
ix
ii xxT
si xxxx ,...,, 21
111 |,...,| iiii xxTxxxp
MCMC
• Example • Transition graph for Markov chain
with three state (s=3)• Transition matrix
• For the initial state
• This stability result plays a fundamental role in MCMC.
1x
2x
3x
1
0.1
0.9
0.4
0.6
04.06.0
9.01.00
010
T
3.0,2.0,5.01 x 2.0,6.0,2.01 Tx
4.0,4.0,2.01 tTx
xp
~~,~,1 TTx
MCMC
• Convergence properties • For any starting point, the chain will convergence to the invariant
distribution p(x), as long as T is a stochastic transition matrix that obeys the following properties:
1) Irreducibility That is every state must be (eventually) reachable from every
other state.
2) Aperiodicity This stops the chain from oscillating between different states
3) Reversibility (detailed balance) condition This holds the system remain its stationary distribution.
.
111 || iiiiii xxTxpxxTxp
1
11 |ix
iiii xxTxpxp
111 | iiiii dxxxKxpxp
discrete
continuous
Kernal, proposal distribution
MCMC
• Eigen-analysis • From the spectral theory, p(x) is the left eigenvector of
the matrix T with corresponding eigenvalue 1.
• The second largest eigenvalue determines the rate of convergence of the chain, and should be as small as possible.
1000.03000.03000.0
3000.02000.06000.0
6000.05000.01000.0TT
5345.00000.04264.0
8018.07071.06396.0
2673.07071.06396.0
E
2000.00000.00000.0
0000.07071.00000.0
0000.00000.00000.1
D
EDET T
Eigenvalue v1 always 1
Stationary distribution
11 / esumep
Metropolis-Hastings algorithm
• The MH algorithm • The most popular MCMC method
• Invariant distgribution p(x)• Proposal distribution q(x*|x)• Candidate value x*• Acceptance probability A(x,x*)
• Kernel K
.
1. Initialize .2. For i=0 to N-1 - Sample - Sample - If
else
0x
]1,0[~Uu ixxqx |~ **
ii
ii
xxqxp
xxqxpxxAu
|
|,1min,
*
***
*1 xx i
ii xx 1
111 || iiMH
iiiMH
i xxKxpxxKxp
ii
x
iiiiiiMH xrxxxAxxqxxK i ,|| 1111
*** ,1| dxxxAxxqxr iii
Metropolis-Hastings algorithm
• Results of running the MH algorithm• Target distribution :
100,|* ii xNxxq
22 102.0exp7.02.0exp3.0 xxxp
Proposal distribution
Metropolis-Hastings algorithm
• Different choices of the proposal standard deviation
• MH requires careful design of the proposal distribution.
• If is narrow, only 1 mode of p(x) might be visited.
• If is too wide, the rejection rate can be high.
• If all the modes are visited while the acceptance probability is high, the chain is said to “mix” well.
*
*
*
Mixture and cycles of MCMC kernels
• Mixture and cycle• It is possible to combine several samplers into mixture
and cycles of individual samplers.
• If transition kernels K1, K2 have invariant distribution, then cycle hybrid kernel and mixture hybrid kernel are also transition kernels.
10 ,1 21 vforKvvK
Mixture and cycles of MCMC kernels
• Mixtures of kernels• Incorporate global proposals to explore vast region
of the state space and local proposals to discover finer details of the target distribution.
-> target distribution with many narrow peaks (= reversible jump MCMC algorithm)
Mixture and cycles of MCMC kernels
• Cycles of kernels• Split a multivariate state vector into components (block)
-> It can be updated separately. -> Blocking highly correlated variables (= Gibbs sampling algorithm)
Auxiliary variable samplers
• Auxiliary variable • It is often easier to sample from an augmented
distribution p(x,u), where u is an auxiliary variable.• It is possible to obtain marginal samples x by sampling
(x, u), and ignoring the sample u.• Hybrid Monte Carlo (HMC)
• Use gradient approximation• Slice sampling
Adaptive MCMC
• Adaptive selection of proposal distribution • The variance of proposal distribution is important.• To automate the process of choosing the proposal distribution
as much as possible.
• Problem• Adaptive MCMC can disturb the stationary distribution.• Gelfand and Sahu(1994)
• Station distribution is disturbed despite the fact that each participating kernel has the same stationary distribution.
• Avoidance• Carry out adaptation only initial fixed number of step.• Parallel chains• And so on… -> inefficient, much more research is required.
Other application of MCMC
• Simulated annealing method for global optimization
• To find global maximum of p(x)
• Monte Carlo EM• To find fast approximation for E-step
• Sequential Monte Carlo method and particle filters
• To carry out on-line approximation of probability distributions using samples.
->using parallel sampling
iNix
xpxi
maxarg,...,1;
ˆ
Convergence problem and Trick of MCMC
• Convergence problem• Determining the length of the Markov chain is a difficult task.
• Trick• Initial set problem (for starting biases)
• Discards an initial set of samples (Burn-in)• Set initial sample value manually.
• Markov chain test• Apply several graphical and statistical tests to assess if the
chain has stabilized. -> It doesn’t provide entirely satisfactory diagnostics.
• Study about convergence problem
20
0
5
10
15
20
25
0 100 200 300 400 500 600 700 800 900 1000
Remained problems
• Large dimension model• The combination of sampling algorithm with either
gradient optimization or exact one.
• Massive data set• A few solution based on importance sampling have been
proposed.
• Many and varied applications…
-> But there is still great room for innovation in this area.
Conclusion
• MCMC• The Markov Chain Monte Carlo methods cover a variety
of different fields and applications.• There are great opportunities for combining existing
sub-optimal algorithms with MCMC in many machine learning problems.
• Some areas are already benefiting from sampling methods include:
Tracking, restoration, segmentationProbabilistic graphical modelsClassificationData association for localizationClassical mixture models.