ling 696b: pca and other linear projection methods
DESCRIPTION
LING 696B: PCA and other linear projection methods. Curse of dimensionality. The higher the dimension, the more data is needed to draw any conclusion Probability density estimation: Continuous: histograms Discrete: k-factorial designs Decision rules: - PowerPoint PPT PresentationTRANSCRIPT
1
LING 696B: PCA and other linear projection methods
2
Curse of dimensionality The higher the dimension, the more data
is needed to draw any conclusion Probability density estimation:
Continuous: histograms
Discrete: k-factorial designs Decision rules:
Nearest-neighbor and K-nearest neighbor
3
How to reduce dimension? Assume we know something about
the distribution Parametric approach: assume data follow
distributions within a family H Example: counting histograms for 10-
D data needs lots of bins, but knowing it’s a pancake allows us to fit a Gaussian (Number of bins)10 v.s. (10 + 10*11/2)
4
Linear dimension reduction Pancake/Gaussian assumption is
crucial for linear methods Examples:
Principle Components Analysis Multidimensional Scaling Factor Analysis
5
Covariance structure of multivariate Gaussian 2-dimensional example
No correlations --> diagonal covariance matrix, e.g. Special case: = I - log likelihood Euclidean distance to the
center
Variance in each dimension
Correlation between dimensions
6
Covariance structure of multivariate Gaussian Non-zero correlations --> full
covariance matrix, COV(X1,X2) 0 E.g. =
Nice property of Gaussians: closed under linear transformation
This means we can remove correlation by rotation
7
Covariance structure of multivariate Gaussian Rotation matrix: R = (w1, w2),
where w1, w2 are two unit vectors perpendicular to each other Rotation by 90 degree
Rotation by 45 degree
w1 w2
w1
w2
w1w2
8
Covariance structure of multivariate Gaussian Matrix diagonalization: any 2X2
covariance matrix A can be written as:
Interpretation: we can always find a rotation to make the covariance look “nice” -- no correlation between dimensions
This IS PCA when applied to N dimensions
Rotation!
9
Computation of PCA The new coordinates uniquely identify
the rotation
In computation, it’s easier to identify one coordinate at a time.
Step 1: centering the data X <-- X - mean(X) Want to rotate around the center
w1w2
w3
3-D: 3 coordinates
10
Computation of PCA Step 2: finding a direction of
projection that has the maximal “stretch”
Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered)
Now measure the stretch This is sample variance = Var(X*w)
wx X
w
11
Computation of PCA Step 3: formulate this as a
constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can
explode), only consider the direction So formally:
max||w||=1 Var(X*w), find w
12
Computation of PCA Some algebra (homework):
Var(x) = E[(x - E[x])2
= E[x2] - (E[x])2
Apply to matrices (homework)Var(X*w) = wT XT X w = wTCov(X) w (why)
Cov(X) is a dXd matrix (homework) Symmetric (easy) For any y, yTCov(X) y >= 0 (tricky)
13
Computation of PCA Going back to the optimization
problem:= max||w||=1 Var(X*w)= max||w||=1 wTCOV(X) w
The answer is the largest eigenvalue for COV(X)
w1
The first Principle Component!
(see demo)
14
More principle components We keep looking among all the
projections perpendicular to w1
Formally:max||w2||=1,w2w1 wTCov(X) w
This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue(see demo) w2
New coordinates!
15
Rotation Can keep going until we find all
projections/coordinates w1,w2,…,wd
Putting them together, we have a big matrix W=(w1,w2,…,wd)
W is called an orthogonal matrix This corresponds to a rotation
(sometimes plus reflection) of the pancake
This pancake has no correlation between dimensions (see demo)
16
When does dimension reduction occur? Decomposition of covariance
matrix
If only the first few ones are significant, we can ignore the rest, e.g. 2-D coordinates of X
17
Measuring “degree” of reduction
a2a1
Pancake data in 3D
18
Reconstruction from principle components Perfect reconstruction (x
centered):
Reconstruction error:
w1w2
xlength
direction
Another fomulationof PCA
x
Many pieces
The bigger pieces
19
A creative interpretation/ implementation of PCA Any x can be reconstructed from
principle components (PC form a basis for the whole space)
Output X
Input X
hidden=W
W
When (# of hidden) < (# of input), the network does dimension reduction
This can be used to implement PCA
“neural firing” Connection weights
“encoding”
20
An intuitive application of PCA:(Story and Titze) and others Vocal tract measurements are high
dimensional (different articulators) Measurements from different positions are
correlated Underlying geometry: a few articulatory
parameters, possibly pancake-like after collapsing a number of different sounds
Big question: relate low-dimensional articulatory parameters (tongue shape) to low dimensional acoustic parameters (F1/F2)
21
Story and Titze’s application of PCA Source data: area function data
obtained from MRI (d=44) Step 1: Calculate the mean
Interestingly, the mean produces a schwa-like frequency response
22
Story and Titze’s application of PCA Step 2: substract the mean from
the area function (center the data)
Step 3: form the covariance matrix
R = XTX (dXd matrix), X ~ (x, p)
23
Story and Titze’s application of PCA
Step 4: eigen-decomposition of the covariance matrix, get PC’s Story calls them “empirical modes”
Length of projection:
Reconstruction:
24
Story and Titze’s application of PCA
Story’s principle components The first 2 PC’s can do most of the
reconstruction Can be seen as a perturbation of overall
tongue shape (from the mean)
Constriction < 0
Expansion > 0
25
Story and Titze’s application of PCA The principle components are
interpretable as control parameters
Acoustic-to-Articulatory mapping almost one-to-one after dimension reduction
26
Applying PCA to ultrasound data? Another imaging technique
Generate a tongue profile similar to X-ray and MRI
High-dimensional Correlated Need dimension reduction to interpret
articulatory parameters See demo
27
An unintuitive application of PCA Latent Semantic Indexing in
document retrieval Documents as vectors of word counts Try to extract some “features” by
linear combination of word counts The underlying geometry unclear
(mean? Distance?) The meaning of principle components
unclear (rotation?)
“market”
“stock”
“bonds”
28
Summary of PCA: PCA looks for:
A sequence of linear, orthogonal projections that reveal interesting structure in data (rotation)
Defining “interesting”: Maximal variance under each
projection Uncorrelated structure after
projection
29
Departure from PCA 3 directions of divergence
Other definitions of “interesting”? Linear Discriminant Analysis Independent Component Analysis
Other methods of projection? Linear but not orthogonal: sparse coding Implicit, non-linear mapping
Turning PCA into a generative model Factor Analysis
30
Re-thinking “interestingness” It all depends on what you want Linear Disciminant Analysis (LDA):
supervised learning Example: separating 2 classes
Maximal variance
Maximal separation
31
Re-thinking “interestingness” Most high-dimensional data look like
Gaussian under linear projections Maybe non-Gaussian is more interesting
Independent Component Analysis Projection pursuits
Example: ICA projection of 2-class dataMost unlike Gaussian (e.g. maximize kurtosis)
32
The “efficient coding” perspective
Sparse coding: Projections do not have to be orthogonal There can be more basis vectors than
the dimension of the space Neural interpretation (Dana Ballard’s talk last
week)xw2
w1
w3
w4
p << d; compact coding (PCA)p > d; sparse coding
Basis expansion
33
“Interesting” can be expensive
Often faces difficult optimization problems Need many constraints Lots of parameter sharing Expensive to compute, no longer an
eigenvalue problem
34
PCA’s relatives: Factor Analysis PCA is not a generative model:
reconstruction error is not likelihood Sensitive to outliers Hard to build into bigger models
Factor Analysis: adding a measurement noise to account for variability
Factors: spherical Gaussian N(0,I)
observation
Loading matrix (scaled PC’s)
Measurement noiseN(0,R), R diagonal
35
PCA’s relatives: Factor Analysis Generative view: sphere --> stretch
and rotate --> add noise
Learning: a version of EM algorithm (see demo and synthesis)
36
Mixture of Factor Analyzers Same intuition as other mixture
models: there may be several pancakes out there, each with its own center/rotation
37
PCA’s relatives: Metric multidimensional scaling Approach the problem in a different
way No measurements from stimuli, but
pairwise “distance” between stimuli Intend to recover some
psychological space for the stimuli See Jeff’s talk