Independent Component Analysis
CAP5610: Machine Learning
Instructor: Guo-Jun QI
Review: Principle Component Analysis
• PCA aims to find a set of principle components that span a subspace,• Projecting data into this subspace will generate
minimum reconstruction error.
• Principle components should be orthogonal
• PCA projection
• Each row of W is a direction along which x will be projected.
𝐲 = 𝑊𝐱
𝑤1𝑤2
PCA
• PCA removes the correlations between components, but it does not mean the components become independent.• No correlation: COV 𝑦1𝑦2 = 𝐸 𝑦1𝑦2 − 𝐸 𝑦1 𝐸 𝑦2 = 0
• Independence: p 𝑦1𝑦2 − 𝑝 𝑦1 𝑝 𝑦2 = 0
• Only for Gaussian distribution, no correlation means independence.
• Independence Component Analysis (ICA) aims at finding a set of independent components•
Source separation problem
• M independent sources{𝑠1, … , 𝑠𝑀}
• Mixture observations of signals
𝑥𝑖 =
𝑗=1
𝑀
𝑎𝑖𝑗𝑠𝑗
𝐱 = 𝐴𝐬
• 𝐴 = [𝑎𝑖𝑗] is mixing matrix
• Can we find the mixing matrix and recover the sources?• ICA
Inverse problem
• Mixture of signals𝐱 = 𝐴𝐬
• ICA: Find W, 𝐲 = 𝑊𝐱
so that The components of y are as much independent as possible. • y is an estimate of s
• W is an estimate of 𝐴−1
PCA VS. ICA
• ICA finds the underlying independent components that generate the data.
ICA for Natural images
• ICA components: corresponding to some natural image structures
PCA for Natural images
• PCA components are orthogonal, which may not correspond to any independent structures in natural images.
Applications: denoising images
• Noise and image are independent.
Original Noisy Median filter ICA
Statistical independence
• Definition – independence
Source ambiguity
• Independent sources can be recovered only up to sign, scale and permutation.
• If 𝐬 is changed by sign, scale and permutation, there exists another mixing matrix, so that the observed signals stay 𝐱 unchanged.• Proof: P is a permutation matrix, and D is a diagonal scaling matrix
𝐱 = 𝐴𝑃−1𝐷−1 [𝑃𝐷 𝐬]
𝐱 = 𝐴𝐬
Preprocessing: subtracting mean
• Mean: 𝐦 = 𝐄 𝐱
• 𝐱 −𝐦
• In this case, the original sources 𝐬 also have zero mean
Preprocessing: whitening • Covariance matrix of observed signals
• Do SVD,
• Let , then is the whitened signals, because
• Define as a new mixing matrix , then
• We also have
Preprocessing: Benefit
• Reducing the number of parameters• The orthogonal matrix 𝐴∗ of N by N only has (N-1)(N-2)/2 free parameters.
Solving ICA
• Problem: Given whitened zero mean x, find an orthogonal matrix W, so that the components in y=Wx are as much independent as possible.
• Question: how to measure the independence between the components?• Central limit theorem – the sum of a set of i.i.d. random variables approaches
to Gaussian distribution.
Non-Gaussianity and independence
• y=wTx=wTAs is a weighted sum of s, where wT is a row vector of W.
• If y is a mixture of s (up to scale, sign), then y is closer to Gaussian
• Otherwise, y is not a mixture of s, but only one of its components, then y should be far away from Gaussian.
• Non-Gaussianity measures the independence of y.
Measure of non-Gaussianity
• Kurtosis – the forth-order cumulant
The Fast ICA algorithm (Hyvarinen)
• Lagragian function:𝐿 𝐰 = 𝑓 𝐰 + 𝜆(𝐰𝐰𝑇 − 1)
• Given whitened zero-mean data z, find w such that y=wTz is far away from Gaussian, and w is a unit vector 𝐰𝐓𝐰 = 1.
• Maximize Kurtosis 𝑓 𝐰 = κ4 𝑦 = 𝐸 𝑦2 − 3, 𝑠. 𝑡. , 𝐰𝑇𝐰 = 1
• Lagragian function 𝑓 𝐰 + 𝜆 𝐰𝑇𝐰− 1
• KKT condition for constrained optimization problem 𝑓′ 𝐰 + 2𝜆𝐰 = 0
4𝐸 𝐰𝑇𝐳 3𝐳 + 2𝜆𝐰 = 0
Algorithm
• Randomly initialize w(1)
• Updatew 𝑘 + 1 ← 𝐸 𝐰(𝑘)𝑇𝐳 3𝐳 − 3𝐰 𝑘
𝒘 𝑘 + 1 ←𝒘 𝑘 + 1
𝒘 𝑘 + 1
Estimate the other components
• Given an estimate of w1, find other directions to recover more sources
• The 2nd w, with the similar formulation, but an additional constraint that 𝐰 ⊥ 𝐰1
• 3rd, 4th , each item an additional orthogonal constraint will be added…
The other independence measure
• For all the distributions with the same variance, Gaussian has the maximal entropy.
• Minimizing negentropy
where 𝐲𝑔𝑎𝑢𝑠𝑠 is the Gaussian with the same covariance as y
• Because y=wTz, w is unity, z is a random variable with covariance I, y has a covariance matrix of I
Approximation to Negentropy
• Negentropy is difficult to compute
• Approximation using 3rd order and 4th order cumulant
• Approximation using non-quadratic functions
Question
• In MP3, we will use PCA to project images into a subspace where the obtained components are supposed to be independent. Is this assumption valid?
Question
• In MP3, we will use PCA to project images into a subspace where the obtained components are supposed to be independent. Is this assumption valid?• PCA gets uncorrelated components.
• Under Gaussian, uncorrelated components imply independence.
• So we need to verify if the pixels are generated from a Gaussian• Using Kurtosis and negentropy.
Summary
• ICA recovers a set of independent components
• PCA finds a set of uncorrelated components
• By central limit theorem, we use nongaussianity to find the independent component • Surrogate: Kurtosis and negentropy
• Fast ICA algorithm – iterative algorithm, no closed-form solution
• Application: separating independent sources from mixture signals• Image denoising
• Voice separation