10-701 introduction to machine learning - pcalwehbe/10701_s19/files/lecture... · 2019. 4. 8. ·...
TRANSCRIPT
10-701 Introduction to Machine Learning
PCA
Slides based on 18-661 Fall 2018
PCA
To understand a phenomenon we measure various related quantities
If we knew what to measure or how to represent our measurements we might find simple relationships
But in practice we often measure redundant signals, e.g., US and European shoe sizes
We also represent data via the method by which it was gathered, e.g., pixel representation of brain imaging data
Raw data can be Complex, High-dimensional
Issues • Measure redundant signals • Represent data via the method by which it was gathered
Goal: Find a ‘better’ representation for data • To visualize and discover hidden patterns • Preprocessing for supervised task
Dimensionality Reduction
How do we define ‘better’?
American SizeEu
rope
an S
ize
E.g., Shoe SizeWe take noisy measurements on European and American scale • Modulo noise, we expect perfect
correlation
How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto
this direction
American SizeEu
rope
an S
ize
E.g., Shoe SizeWe take noisy measurements on European and American scale • Modulo noise, we expect perfect
correlation
How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto
this direction
American SizeEu
rope
an S
ize
We take noisy measurements on European and American scale • Modulo noise, we expect perfect
correlation
How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto
this direction
E.g., Shoe Size
American SizeEu
rope
an S
ize
We take noisy measurements on European and American scale • Modulo noise, we expect perfect
correlation
How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto
this direction
E.g., Shoe Size
American SizeEu
rope
an S
ize
Minimize Euclidean distances between original points and their projections
PCA solution solves this problem!
Goal: Minimize Reconstruction Error
American SizeEu
rope
an S
ize
x
y
Linear Regression — predict y from x. Evaluate accuracy of predictions (represented by blue line) by vertical distances between points and the line
PCA — reconstruct 2D data via 2D data with single degree of freedom. Evaluate reconstructions (represented by blue line) by Euclidean distances
American SizeEu
rope
an S
ize
Another Goal: Maximize Variance
To identify patterns we want to study variation across observations
Can we do ‘better’, i.e., find a compact representation that captures variation?
American SizeEu
rope
an S
ize
Another Goal: Maximize Variance
To identify patterns we want to study variation across observations
Can we do ‘better’, i.e., find a compact representation that captures variation?
American SizeEu
rope
an S
ize
Another Goal: Maximize Variance
To identify patterns we want to study variation across observations
Can we do ‘better’, i.e., find a compact representation that captures variation? PCA solution finds directions of maximal variance!
PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data)• is n × k (reduced representation, PCA ‘scores’)• is d × k (columns are k principal components)• Variance constraints
P
X
Z = XP
Linearity assumption ( ) simplifies problem
Z = XP ≈ ≈
≈
X
P
Z =
σ21 =1n
n�
i=1
�x(i)1
�2Variance of 1st feature (assuming zero mean)
Variance of 1st feature σ21 =1n
n�
i=1
�x(i)1 � μ1
�2
Given n training points with d features:• : matrix storing points• : jth feature for ith point• : mean of jth feature
X � Rn�d
x(i)j
μj
• Symmetric: • Zero → uncorrelated • Large magnitude → (anti) correlated / redundant • → features are the same
σ12 = σ21
Given n training points with d features: • : matrix storing points • : jth feature for ith point • : mean of jth feature
X � Rn�d
x(i)j
μj
σ12 =1n
n�
i=1
x(i)1 x(i)2Covariance of 1st and 2nd
features (assuming zero mean)
σ12 = σ21 = σ22
Covariance MatrixCovariance matrix generalizes this idea for many features
• ith diagonal entry equals variance of ith feature• ijth entry is covariance between ith and jth features• Symmetric (makes sense given definition of covariance)
CX =1nX�X
d × d covariance matrix with zero mean features
What constraints make sense in reduced representation?• No feature correlation, i.e., all off-diagonals in are zero• Rank-ordered features by variance, i.e., sorted diagonals of
PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data) • is n × k (reduced representation, PCA ‘scores’) • is d × k (columns are k principal components) • Variance / Covariance constraints
P
X
Z = XP
CZ
CZ
PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data) • is n × k (reduced representation, PCA ‘scores’) • is d × k (columns are k principal components) • Variance / Covariance constraints
P
X
Z = XP
equals the top k eigenvectors of CXP ≈ ≈
≈
X
P
Z =
PCA Solution All covariance matrices have an eigendecomposition • (eigendecomposition) • is d × d (column are eigenvectors, sorted by their eigenvalues) • is d × d (diagonals are eigenvalues, off-diagonals are zero)
The d eigenvectors are orthonormal directions of max variance • Associated eigenvalues equal variance in these directions • 1st eigenvector is direction of max variance (variance is )
U
Λ
CX = UΛU�
λ1
Choosing kHow should we pick the dimension of the new representation?
Visualization: Pick top 2 or 3 dimensions for plotting purposes
Other analyses: Capture ‘most’ of the variance in the data • Recall that eigenvalues are variances in the directions specified
by eigenvectors, and that eigenvalues are sorted
• Fraction of retained variance:�k
i=1 λi�di=1 λi
Can choose k such that we retain some fraction of the
variance, e.g., 95%
Other Practical TipsPCA assumptions (linearity, orthogonality) not always appropriate • Various extensions to PCA with different underlying
assumptions, e.g., manifold learning, Kernel PCA, ICA
Centering is crucial, i.e., we must preprocess data so that all features have zero mean before applying PCA
PCA results dependent on scaling of data • Data is sometimes rescaled in practice before applying PCA
Orthogonal and Orthonormal VectorsOrthogonal vectors are perpendicular to each other• Equivalently, their dot product equals zero• and , but c isn’t orthogonal to others
Orthonormal vectors are orthogonal and have unit norm • a are b are orthonormal, but b are d are not orthonormal
a =�1 0
��b =
�0 1
��c =
�1 1
��d =
�2 0
��
a�b = 0 d�b = 0
PCA Iterative Algorithm k = 1: Find direction of max variance, project onto this direction • Locations along this direction are the new 1D representation
American SizeEu
rope
an S
ize
More generally, for i in {1, …, k}:• Find direction of max variance that is
orthonormal to previously selected directions, project onto this direction
• Locations along this direction are the ith feature in new representation
EigendecompositionAll covariance matrices have an eigendecomposition• (eigendecomposition)• is d × d (column are eigenvectors, sorted by their eigenvalues)• is d × d (diagonals are eigenvalues, off-diagonals are zero)Eigenvector / Eigenvalue equation:• By definition (unit norm)
• Example:
U
Λ
CX = UΛU�
Cxu = λuu�u = 1
�1 00 1
� �10
�=
�10
�⟹ eigenvector:
eigenvalue: λ = 1u =
�1 0
��
PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data) • is n × k (reduced representation, PCA ‘scores’) • is d × k (columns are k principal components) • Variance / Covariance constraints
P
X
Z = XP
≈ ≈
≈
X
P
Z =
PCA Formulation, k = 1
Goal: Maximizes variance, i.e., maxp
||z||22
σ2 =1n
n�
i=1
�z(i)
�2= ||z||22 = ||Xp||22
σ2z ||p||2 = 1where
σ2z
PCA: find one-dimensional representation of raw data • is n × d (raw data) • is n × 1 (reduced representation, PCA ‘scores’) • is d × 1 (columns are k principal components) • Variance constraint
X
p
z = Xp
σ2z =1n||z||22
=1nz�z
=1n(Xp)�(Xp)
=1np�X�Xp
= p�CXp
Relationship between Euclidean distance and dot product
z = XpDefinition:
Transpose property: ; associativity of multiply(Xp)� = p�X�
Definition: CX =1nX�X
Goal: Maximizes variance, i.e., maxp
||z||22σ2z ||p||2 = 1where
maxp
p�CxpRestated Goal: ||p||2 = 1where
Recall eigenvector / eigenvalue equation:• By definition , and thus • But this is the expression we’re optimizing, and thus maximal
variance achieved when is top eigenvector of
Similar arguments can be used for k > 1
maxp
p�CxpRestated Goal: ||p||2 = 1where
Cxu = λu
u�u = 1 u�Cxu = λ
CXp
Connection to Eigenvectors