pattern recognition and machine learning
Post on 22-Feb-2016
57 Views
Preview:
DESCRIPTION
TRANSCRIPT
Lars Kasper, December 15th 2010
PATTERN RECOGNITION AND MACHINE LEARNINGCHAPTER 12: CONTINUOUS LATENT VARIABLES
Relation To Other Topics
• Last weeks: Approximate Inference• Today: Back to• data-preprocessing• Data representation/Feature extraction• “Model-free” analysis• Dimensionality reduction• The matrix
• Link: We also have a (particular easy) model of the underlying state of the world whose parameters we want to infer from the data
Take-home TLAs (Three-letter acronyms)
Although termed “continuous latent variables”, we mainly deal with• PCA (Principal Component Analysis)• ICA (Independent Component Analysis)• Factor analysis
General motivation/theme: “What is interesting about my data – but hidden (latent)? …And what is just noise?”
Importance Sampling ;-) 1996 2 0.1918 %1997 3 0.2876 %1998 7 0.6711 %1999 17 1.6299 %2000 33 3.1640 %2001 41 3.9310 %2002 54 5.1774 %2003 53 5.0815 %2004 77 7.3826 %2005 85 8.1496 %2006 98 9.3960 %2007 115 11.0259 %2008 139 13.3269 %2009 160 15.3404 %2010 157 15.0527 %
Publications concerningfMRI and (PCA or ICA or factorAnalysis)Source: ISI Web of Knowledge, Dec 13th, 2010
Importance Sampling: fMRI
MELODIC Tutorial: 2nd principal component (eigenimage) and corresponding time series of a visual block stimulation
• Used for fMRI analysis, e.g. software package FSL: “MELODIC”
Motivation: Low intrinsic dimensionality
• Generating hand-written digit samples by translating and rotating one example 100 times
• High dimensional data (100 x 100 pixel)• Low degrees of freedom (1 rotation angle, 2 translations)
Roadmap for today
Standard PCA (heuristic)
•Dimensionality Reduction
•Maximum Variance•Minimum Error
Probabilistic PCA (Maximum Likelihood)
•Generative Probabilistic Model
•ML-equivalence to Standard PCA
Bayesian PCA
•Automatic determination of latent space dimension
Generalizations
•Relaxing equal data noise amplitude: Factor analysis
•Relaxing Gaussianity: ICA
•Relaxing Linearity: Kernel PCA
Heuristic PCA: Projection View
How do we simplify or compress our data (make it low-dimensional) without losing actual information? Dimensionality reduction by projecting on a linear subspace
2D-data
Projected on 1D-line
Heuristic PCA: Dimensionality Reduction
High dimensional data
• Data points
Projection Low-Dimensional Subspace
• Dimension • Projected data
points
Advantages:• Reduced amount of data• Might be easier to reveal structure withinin the data (pattern recognition, data
visualization)
Heuristic PCA: Maximum Variance View
• We want to reduce the dimensionality of our data space via a linear projection.
• But we still want to keep the projected samples as different as possible.
• A good measure for this difference is the data covariance expressed by the matrix
• Note: This expresses the covariance between different data dimensions, not between data points.
• We now aim to maximize the variance of the projected data in the projection space spanned by the basis vectors .
𝒙−mean of all data points ,𝑁−number of data points
Maximum Variance View: The Maths
• Maximum variance formulation of 1D-projection with projection vector :
• Constraint optimization:
• Leads to best projector being an eigenvector of , the data covariance matrix:
• with maximum projected variance equal to the
maximum eigenvalue:
Heuristic PCA: Conclusion
By induction we yield the general PCA result to maximize the variance of the data in the projected dimensions:
The projection vectors shall be the eigenvectors corresponding to the largest eigenvalues of the data covariance matrix . These vectors are called
the principal components.
Heuristic PCA: Minimum error formulation
• By projecting, we want to lose as few information as possible, i.e. keep the projected data points as similiar to the raw data as possible.
• Therefore we minimize the mean quadratic error
• With respect to the projection vectors .• This leads to the same result as in the maximum
variance formulation: shall be the eigenvectors corresponding to the largest eigenvalues of the data covariance matrix .
Example: Eigenimages
Eigenimages II
Christopher DeCoro http://www.cs.princeton.edu/cdecoro/eigenfaces/
Dimensionality Reduction
Roadmap for today
Standard PCA (heuristic)
•Dimensionality Reduction
•Maximum Variance•Minimum Error
Probabilistic PCA (Maximum Likelihood)
•Generative Probabilistic Model
•ML-equivalence to Standard PCA
Bayesian PCA
•Automatic determination of latent space dimension
Generalizations
•Relaxing equal data noise amplitude: Factor analysis
•Relaxing Gaussianity: ICA
•Relaxing Linearity: Kernel PCA
Probabilistic PCA: A synthesizer’s view
𝒙=𝑊 𝒛 +𝝁+𝝐• – standardised normal distribution
• Independent latent variables with zero mean & unit variance• – a spherical Gaussian
• i.e. identical independent noise in each of the data dimensions• Prior predictive or marginal distribution of data points:
Probabilistic PCA: ML-solution
Same as in heuristic PCA matrix of first eigenvectors, diagonal matrix of eigenvalues Only specified up to a rotation in latent space
Recap: The EM-algorithm
• The Expectation-Maximization algorithm determines the Maximum Likelihood-solution for our model parameters iteratively
• Advantageous compared to direct eigenvector decomposition, if , i.e. if we have considerably fewer latent variables than data dimensions• Projection on a very low dimensional space, e.g.
for data visualization to
EM-Algorithm: Expectation Step
• We consider the complete-data likelihood
• Maximizing the marginal likelihood instead would need an integration over latent space
• E-Step: The posterior distribution of the latent variables is updated and used to calculate the expected value of the complete-data log likelihood with respect to
• Keeping estimates of fixed
EM-Algorithm: Maximization Step
• M-Step: The calculated expectation is now maximized with respect to :
• keeping the estimated posterior distribution of fixed from the E-Step
EM-algorithm for ML-PCA
Green dots: Data points, always fixedExpectation: Red rod is fixed, cyan connection of blue springs moves
obeying spring forces (Maximization: Cyan connections are fixed, red rod moves
(obey spring forces)
M
E M
𝑊𝑍𝑊 𝑇
Roadmap for today
Standard PCA (heuristic)
•Dimensionality Reduction
•Maximum Variance•Minimum Error
Probabilistic PCA (Maximum Likelihood)
•Generative Probabilistic Model
•ML-equivalence to Standard PCA
Bayesian PCA
•Automatic determination of latent space dimension
Generalizations
•Relaxing equal data noise amplitude: Factor analysis
•Relaxing Gaussianity: ICA
•Relaxing Linearity: Kernel PCA
Bayesian PCA – Finding the real dimension
MaximumLikelihood
BayesianPCA
Introducing hyperparameters, marginalizing :
𝑥=𝑊𝑧+𝜇+ϵ
Estimating
Estimated projection matrix for an dimensional latent variable model and synthetic data generated from a latent model with
Roadmap for today
Standard PCA (heuristic)
•Dimensionality Reduction
•Maximum Variance•Minimum Error
Probabilistic PCA (Maximum Likelihood)
•Generative Probabilistic Model
•ML-equivalence to Standard PCA
Bayesian PCA
•Automatic determination of latent space dimension
Generalizations
•Relaxing equal data noise amplitude: Factor analysis
•Relaxing Gaussianity: ICA
•Relaxing Linearity: Kernel PCA
Factor Analysis: A non-spherical PCA
with )
• Noise is still independent and Gaussian
• Controversy: Do thefactors (dimensions of ) have an interpretable meaning?• Problem: posterior invariant wrt rotations of
Independent Component Analysis (ICA)
with • Still a linear model of independent components• No data noise components, for dim(latent space) =
dim(data space)• Explicitly Non-Gaussian• Otherwise, no separation of mixing coefficients in from
latent variables would be possible• Rotational symmetry
• Maximization of Non-Gaussianity/Independence• Different criteria, e.g. kurtosis, skewness • Minimization of mutual information
ICA vs PCA
• ICA rewards bi-modality of projected distribution• PCA rewards maximum variance between elements
PCA 1st principal component
ICA 1st independentcomponent
Unsupervised method:No class labels!
Summary
Parameter estimation
Heuristic quadratic cost function
(Minimum Error Projection)
Probabilistic (Maximum Likelihood
projection matrix)
Bayesian (Hyperparameters
of projection vectors)
Generative probabilistic
process in latent space
Standardized normal distribution
(PCA)
Standardized normal distribution
(Factor Analysis)
Independent probabilistic
process for each dimension (ICA)
Noise in data space
Spherical Gaussian(PCA)
Gaussian(Factor Analysis)
None (ICA)
Feature Mapping (latent to data space)
Linear: PCA, ICA, Factor Analysis
Nonlinear: Kernel PCA
Relation To Other Topics
• Today• data-preprocessing• Whitening via covariance => Identity
• Data representation/Feature extraction• “Model-free” analysis• Well: NO! We have seen the model assumptions in probabilistic
PCA • Dimensionality reduction• Via projection on basis vectors carrying the most
variance/leaving the smallest error• At least for linear models, not for kernel PCA
• The matrix
Kernel PCA
𝐶= 1𝑁 ∑
𝑛=1
𝑁
𝑥𝑛𝑥𝑛𝑇 𝐶= 1
𝑁 ∑𝑛=1
𝑁
Φ(𝑥¿¿𝑛)⋅Φ (𝑥𝑛 )𝑇 ¿
• Instead of the sample covariance matrix, we now consider a covariance matrix in a feature space
• As always, the kernel trick of not computing in the high-dimensional feature space works, because the covariance matrix only needs scalar products of the
Kernel PCA – Example: Gaussian kernel
• Kernel PCA does not enable dimensionality reduction via • is a manifold in feature space, not a linear subspace• The PCA projects onto subspaces in feature space with elements • These elements typically not lie in , so their pre-images ) will not be in data
space
top related