10-701 introduction to machine learning - pcalwehbe/10701_s19/files/lecture... · 2019. 4. 8. ·...

10-701 Introduction to Machine Learning

PCA

Slides based on 18-661 Fall 2018

To understand a phenomenon we measure various related quantities

If we knew what to measure or how to represent our measurements we might find simple relationships

But in practice we often measure redundant signals, e.g., US and European shoe sizes

We also represent data via the method by which it was gathered, e.g., pixel representation of brain imaging data

Raw data can be Complex, High-dimensional

Issues • Measure redundant signals • Represent data via the method by which it was gathered

Goal: Find a ‘better’ representation for data • To visualize and discover hidden patterns • Preprocessing for supervised task

Dimensionality Reduction

How do we define ‘better’?

American SizeEu

rope

an S

ize

E.g., Shoe SizeWe take noisy measurements on European and American scale • Modulo noise, we expect perfect

correlation

How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto

this direction

American SizeEu

rope

an S

ize

We take noisy measurements on European and American scale • Modulo noise, we expect perfect

correlation

How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto

this direction

E.g., Shoe Size

American SizeEu

rope

an S

ize

Minimize Euclidean distances between original points and their projections

PCA solution solves this problem!

Goal: Minimize Reconstruction Error

American SizeEu

rope

an S

ize

x

y

Linear Regression — predict y from x. Evaluate accuracy of predictions (represented by blue line) by vertical distances between points and the line

PCA — reconstruct 2D data via 2D data with single degree of freedom. Evaluate reconstructions (represented by blue line) by Euclidean distances

American SizeEu

rope

an S

ize

Another Goal: Maximize Variance

To identify patterns we want to study variation across observations

Can we do ‘better’, i.e., find a compact representation that captures variation?

American SizeEu

rope

an S

ize

Another Goal: Maximize Variance

To identify patterns we want to study variation across observations

Can we do ‘better’, i.e., find a compact representation that captures variation? PCA solution finds directions of maximal variance!

PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data)• is n × k (reduced representation, PCA ‘scores’)• is d × k (columns are k principal components)• Variance constraints

P

X

Z = XP

Linearity assumption ( ) simplifies problem

Z = XP ≈ ≈

≈

X

P

Z =

σ21 =1n

n�

i=1

�x(i)1

�2Variance of 1st feature (assuming zero mean)

Variance of 1st feature σ21 =1n

n�

i=1

�x(i)1 � μ1

�2

Given n training points with d features:• : matrix storing points• : jth feature for ith point• : mean of jth feature

X � Rn�d

x(i)j

μj

• Symmetric: • Zero → uncorrelated • Large magnitude → (anti) correlated / redundant • → features are the same

σ12 = σ21

Given n training points with d features: • : matrix storing points • : jth feature for ith point • : mean of jth feature

X � Rn�d

x(i)j

μj

σ12 =1n

n�

i=1

x(i)1 x(i)2Covariance of 1st and 2nd

features (assuming zero mean)

σ12 = σ21 = σ22

Covariance MatrixCovariance matrix generalizes this idea for many features

• ith diagonal entry equals variance of ith feature• ijth entry is covariance between ith and jth features• Symmetric (makes sense given definition of covariance)

CX =1nX�X

d × d covariance matrix with zero mean features

What constraints make sense in reduced representation?• No feature correlation, i.e., all off-diagonals in are zero• Rank-ordered features by variance, i.e., sorted diagonals of

PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data) • is n × k (reduced representation, PCA ‘scores’) • is d × k (columns are k principal components) • Variance / Covariance constraints

P

X

Z = XP

CZ

CZ


P

X

Z = XP

equals the top k eigenvectors of CXP ≈ ≈

≈

X

P

Z =

PCA Solution All covariance matrices have an eigendecomposition • (eigendecomposition) • is d × d (column are eigenvectors, sorted by their eigenvalues) • is d × d (diagonals are eigenvalues, off-diagonals are zero)

The d eigenvectors are orthonormal directions of max variance • Associated eigenvalues equal variance in these directions • 1st eigenvector is direction of max variance (variance is )

U

Λ

CX = UΛU�

λ1

Choosing kHow should we pick the dimension of the new representation?

Visualization: Pick top 2 or 3 dimensions for plotting purposes

Other analyses: Capture ‘most’ of the variance in the data • Recall that eigenvalues are variances in the directions specified

by eigenvectors, and that eigenvalues are sorted

• Fraction of retained variance:�k

i=1 λi�di=1 λi

Can choose k such that we retain some fraction of the

variance, e.g., 95%

Other Practical TipsPCA assumptions (linearity, orthogonality) not always appropriate • Various extensions to PCA with different underlying

assumptions, e.g., manifold learning, Kernel PCA, ICA

Centering is crucial, i.e., we must preprocess data so that all features have zero mean before applying PCA

PCA results dependent on scaling of data • Data is sometimes rescaled in practice before applying PCA

Orthogonal and Orthonormal VectorsOrthogonal vectors are perpendicular to each other• Equivalently, their dot product equals zero• and , but c isn’t orthogonal to others

Orthonormal vectors are orthogonal and have unit norm • a are b are orthonormal, but b are d are not orthonormal

a =�1 0

��b =

�0 1

��c =

�1 1

��d =

�2 0

��

a�b = 0 d�b = 0

PCA Iterative Algorithm k = 1: Find direction of max variance, project onto this direction • Locations along this direction are the new 1D representation

American SizeEu

rope

an S

ize

More generally, for i in {1, …, k}:• Find direction of max variance that is

orthonormal to previously selected directions, project onto this direction

• Locations along this direction are the ith feature in new representation

EigendecompositionAll covariance matrices have an eigendecomposition• (eigendecomposition)• is d × d (column are eigenvectors, sorted by their eigenvalues)• is d × d (diagonals are eigenvalues, off-diagonals are zero)Eigenvector / Eigenvalue equation:• By definition (unit norm)

• Example:

U

Λ

CX = UΛU�

Cxu = λuu�u = 1

�1 00 1

� �10

�=

�10

�⟹ eigenvector:

eigenvalue: λ = 1u =

�1 0

��


P

X

Z = XP

≈ ≈

≈

X

P

Z =

PCA Formulation, k = 1

Goal: Maximizes variance, i.e., maxp

||z||22

σ2 =1n

n�

i=1

�z(i)

�2= ||z||22 = ||Xp||22

σ2z ||p||2 = 1where

σ2z

PCA: find one-dimensional representation of raw data • is n × d (raw data) • is n × 1 (reduced representation, PCA ‘scores’) • is d × 1 (columns are k principal components) • Variance constraint

X

p

z = Xp

σ2z =1n||z||22

=1nz�z

=1n(Xp)�(Xp)

=1np�X�Xp

= p�CXp

Relationship between Euclidean distance and dot product

z = XpDefinition:

Transpose property: ; associativity of multiply(Xp)� = p�X�

Definition: CX =1nX�X

Goal: Maximizes variance, i.e., maxp

||z||22σ2z ||p||2 = 1where

maxp

p�CxpRestated Goal: ||p||2 = 1where

Recall eigenvector / eigenvalue equation:• By definition , and thus • But this is the expression we’re optimizing, and thus maximal

variance achieved when is top eigenvector of

Similar arguments can be used for k > 1

maxp

p�CxpRestated Goal: ||p||2 = 1where

Cxu = λu

u�u = 1 u�Cxu = λ

CXp

Connection to Eigenvectors

10-701 introduction to machine learning - pcalwehbe/10701_s19/files/lecture... · 2019. 4. 8. ·...

Documents