part 2: unsupervised learning machine learning techniques

Part 2: Unsupervised Learning

Machine Learning Techniques

for Computer Vision

Microsoft Research Cambridge

ECCV 2004, Prague

Christopher M. Bishop

Machine Learning Techniques for Computer Vision (ECCV 2004)


Overview of Part 2

• Mixture models• EM• Variational Inference• Bayesian model complexity• Continuous latent variables



The Gaussian Distribution

• Multivariate Gaussian

• Maximum likelihood

meancovariance



Gaussian Mixtures

• Linear super-position of Gaussians

• Normalization and positivity require



Example: Mixture of 3 Gaussians



Maximum Likelihood for the GMM

• Log likelihood function

• Sum over components appears inside the log– no closed form ML solution



EM Algorithm – Informal Derivation




• M step equations




• E step equation




• Can interpret the mixing coefficients as prior probabilities

• Corresponding posterior probabilities (responsibilities)



Old Faithful Data Set

Duration of eruption (minutes)

Time betweeneruptions (minutes)



Latent Variable View of EM

• To sample from a Gaussian mixture:– first pick one of the components with probability – then draw a sample from that component– repeat these two steps for each new data point




• Goal: given a data set, find • Suppose we knew the colours

– maximum likelihood would involve fitting each component to the corresponding cluster

• Problem: the colours are latent (hidden) variables



Incomplete and Complete Data

completeincomplete



Latent Variable Viewpoint



Latent Variable Viewpoint

• Binary latent variables describing which component generated each data point

• Conditional distribution of observed variable

• Prior distribution of latent variables

• Marginalizing over the latent variables we obtain



Graphical Representation of GMM




• Suppose we knew the values for the latent variables– maximize the complete-data log likelihood

– trivial closed-form solution: fit each component to the corresponding set of data points

• We don’t know the values of the latent variables– however, for given parameter values we can compute

the expected values of the latent variables



Posterior Probabilities (colour coded)



Over-fitting in Gaussian Mixture Models

• Infinities in likelihood function when a component ‘collapses’ onto a data point:

with

• Also, maximum likelihood cannot determine the number K of components



Cross Validation

• Can select model complexity using an independent validation data set

• If data is scarce use cross-validation:– partition data into S subsets– train on S1 subsets – test on remainder– repeat and average

• Disadvantages– computationally expensive – can only determine one or

two complexity parameters



Bayesian Mixture of Gaussians

• Parameters and latent variables appear on equal footing• Conjugate priors



Data Set Size

• Problem 1: learn the functionfor from 100 (slightly) noisy examples– data set is computationally small but statistically large

• Problem 2: learn to recognize 1,000 everyday objects from 5,000,000 natural images– data set is computationally large but statistically small

• Bayesian inference – computationally more demanding than ML or MAP

(but see discussion of Gaussian mixtures later)– significant benefit for statistically small data sets



Variational Inference

• Exact Bayesian inference intractable• Markov chain Monte Carlo

– computationally expensive– issues of convergence

• Variational Inference – broadly applicable deterministic approximation– let denote all latent variables and parameters– approximate true posterior using a simpler

distribution – minimize Kullback-Leibler divergence



General View of Variational Inference

• For arbitrary

where

• Maximizing over would give the true posterior– this is intractable by definition



Variational Lower Bound



Factorized Approximation

• Goal: choose a family of q distributions which are:– sufficiently flexible to give good approximation– sufficiently simple to remain tractable

• Here we consider factorized distributions

• No further assumptions are required!• Optimal solution for one factor, keeping the remainder fixed

– coupled solutions so initialize then cyclically update– message passing view (Winn and Bishop, 2004)



Lower Bound

• Can also be evaluated• Useful for maths/code verification• Also useful for model comparison:



Illustration: Univariate Gaussian

• Likelihood function

• Conjugate prior • Factorized variational distribution



Initial Configuration



After Updating



Converged Solution



Variational Mixture of Gaussians

• Assume factorized posterior distribution

• No other approximations needed!



Variational Equations for GMM



Lower Bound for GMM



VIBES

Bishop, Spiegelhalter and Winn (2002)



ML Limit

• If instead we choose

we recover the maximum likelihood EM algorithm



Bound vs. K for Old Faithful Data



Bayesian Model Complexity



Sparse Bayes for Gaussian Mixture

• Corduneanu and Bishop (2001)• Start with large value of K

– treat mixing coefficients as parameters– maximize marginal likelihood– prunes out excess components



Summary: Variational Gaussian Mixtures

• Simple modification of maximum likelihood EM code• Small computational overhead compared to EM• No singularities• Automatic model order selection



Continuous Latent Variables

• Conventional PCA– data covariance matrix

– eigenvector decomposition

• Minimizes sum-of-squares projection– not a probabilistic model– how should we choose L ?



Probabilistic PCA

• Tipping and Bishop (1998)• L dimensional continuous latent space

• D dimensional data space

PCA

factor analysis



Probabilistic PCA

• Marginal distribution

• Advantages– exact ML solution– computationally efficient EM algorithm– captures dominant correlations with few parameters– mixtures of PPCA– Bayesian PCA– building block for more complex models



EM for PCA



Bayesian PCA

• Bishop (1998)• Gaussian prior over columns of

• Automatic relevance determination (ARD)

ML PCA Bayesian PCA



Non-linear Manifolds

• Example: images of a rigid object



Bayesian Mixture of BPCA Models



Flexible Sprites

• Jojic and Frey (2001)• Automatic decomposition of video sequence into

– background model– ordered set of masks (one per object per frame)– foreground model (one per object per frame)



Transformed Component Analysis

• Generative model

• Now include transformations (translations)

• Extend to L layers• Inference intractable so

use variational framework



Bayesian Constellation Model

• Li, Fergus and Perona (2003)• Object recognition from small training sets• Variational treatment of fully Bayesian model



Bayesian Constellation Model



Summary of Part 2

• Discrete and continuous latent variables – EM algorithm

• Build complex models from simple components– represented graphically– incorporates prior knowledge

• Variational inference– Bayesian model comparison

part 2: unsupervised learning machine learning techniques

Documents

small data sets

data set size problem

colours maximum likelihood

new data point

marginal likelihood

natural images data

noisy examples data

gaussian mixture corduneanu