10-701 introduction to machine learning - pcalwehbe/10701_s19/files/lecture... · 2019. 4. 8. ·...

29
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018

Upload: others

Post on 03-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

10-701 Introduction to Machine Learning

PCA

Slides based on 18-661 Fall 2018

Page 2: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

PCA

Page 3: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

To understand a phenomenon we measure various related quantities

If we knew what to measure or how to represent our measurements we might find simple relationships

But in practice we often measure redundant signals, e.g., US and European shoe sizes

We also represent data via the method by which it was gathered, e.g., pixel representation of brain imaging data

Raw data can be Complex, High-dimensional

Page 4: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

Issues • Measure redundant signals • Represent data via the method by which it was gathered

Goal: Find a ‘better’ representation for data • To visualize and discover hidden patterns • Preprocessing for supervised task

Dimensionality Reduction

How do we define ‘better’?

Page 5: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

E.g., Shoe SizeWe take noisy measurements on European and American scale • Modulo noise, we expect perfect

correlation

How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto

this direction

Page 6: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

E.g., Shoe SizeWe take noisy measurements on European and American scale • Modulo noise, we expect perfect

correlation

How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto

this direction

Page 7: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

We take noisy measurements on European and American scale • Modulo noise, we expect perfect

correlation

How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto

this direction

E.g., Shoe Size

Page 8: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

We take noisy measurements on European and American scale • Modulo noise, we expect perfect

correlation

How can we do ‘better’, i.e., find a simpler, compact representation? • Pick a direction and project onto

this direction

E.g., Shoe Size

Page 9: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

Minimize Euclidean distances between original points and their projections

PCA solution solves this problem!

Goal: Minimize Reconstruction Error

Page 10: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

x

y

Linear Regression — predict y from x. Evaluate accuracy of predictions (represented by blue line) by vertical distances between points and the line

PCA — reconstruct 2D data via 2D data with single degree of freedom. Evaluate reconstructions (represented by blue line) by Euclidean distances

Page 11: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

Another Goal: Maximize Variance

To identify patterns we want to study variation across observations

Can we do ‘better’, i.e., find a compact representation that captures variation?

Page 12: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

Another Goal: Maximize Variance

To identify patterns we want to study variation across observations

Can we do ‘better’, i.e., find a compact representation that captures variation?

Page 13: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

American SizeEu

rope

an S

ize

Another Goal: Maximize Variance

To identify patterns we want to study variation across observations

Can we do ‘better’, i.e., find a compact representation that captures variation? PCA solution finds directions of maximal variance!

Page 14: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data)• is n × k (reduced representation, PCA ‘scores’)• is d × k (columns are k principal components)• Variance constraints

P

X

Z = XP

Linearity assumption ( ) simplifies problem

Z = XP ≈ ≈

X

P

Z =

Page 15: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

σ21 =1n

n�

i=1

�x(i)1

�2Variance of 1st feature (assuming zero mean)

Variance of 1st feature σ21 =1n

n�

i=1

�x(i)1 � μ1

�2

Given n training points with d features:• : matrix storing points• : jth feature for ith point• : mean of jth feature

X � Rn�d

x(i)j

μj

Page 16: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

• Symmetric: • Zero → uncorrelated • Large magnitude → (anti) correlated / redundant • → features are the same

σ12 = σ21

Given n training points with d features: • : matrix storing points • : jth feature for ith point • : mean of jth feature

X � Rn�d

x(i)j

μj

σ12 =1n

n�

i=1

x(i)1 x(i)2Covariance of 1st and 2nd

features (assuming zero mean)

σ12 = σ21 = σ22

Page 17: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

Covariance MatrixCovariance matrix generalizes this idea for many features

• ith diagonal entry equals variance of ith feature• ijth entry is covariance between ith and jth features• Symmetric (makes sense given definition of covariance)

CX =1nX�X

d × d covariance matrix with zero mean features

Page 18: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

What constraints make sense in reduced representation?• No feature correlation, i.e., all off-diagonals in are zero• Rank-ordered features by variance, i.e., sorted diagonals of

PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data) • is n × k (reduced representation, PCA ‘scores’) • is d × k (columns are k principal components) • Variance / Covariance constraints

P

X

Z = XP

CZ

CZ

Page 19: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data) • is n × k (reduced representation, PCA ‘scores’) • is d × k (columns are k principal components) • Variance / Covariance constraints

P

X

Z = XP

equals the top k eigenvectors of CXP ≈ ≈

X

P

Z =

Page 20: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

PCA Solution All covariance matrices have an eigendecomposition • (eigendecomposition) • is d × d (column are eigenvectors, sorted by their eigenvalues) • is d × d (diagonals are eigenvalues, off-diagonals are zero)

The d eigenvectors are orthonormal directions of max variance • Associated eigenvalues equal variance in these directions • 1st eigenvector is direction of max variance (variance is )

U

Λ

CX = UΛU�

λ1

Page 21: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

Choosing kHow should we pick the dimension of the new representation?

Visualization: Pick top 2 or 3 dimensions for plotting purposes

Other analyses: Capture ‘most’ of the variance in the data • Recall that eigenvalues are variances in the directions specified

by eigenvectors, and that eigenvalues are sorted

• Fraction of retained variance:�k

i=1 λi�di=1 λi

Can choose k such that we retain some fraction of the

variance, e.g., 95%

Page 22: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

Other Practical TipsPCA assumptions (linearity, orthogonality) not always appropriate • Various extensions to PCA with different underlying

assumptions, e.g., manifold learning, Kernel PCA, ICA

Centering is crucial, i.e., we must preprocess data so that all features have zero mean before applying PCA

PCA results dependent on scaling of data • Data is sometimes rescaled in practice before applying PCA

Page 23: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

Orthogonal and Orthonormal VectorsOrthogonal vectors are perpendicular to each other• Equivalently, their dot product equals zero• and , but c isn’t orthogonal to others

Orthonormal vectors are orthogonal and have unit norm • a are b are orthonormal, but b are d are not orthonormal

a =�1 0

��b =

�0 1

��c =

�1 1

��d =

�2 0

��

a�b = 0 d�b = 0

Page 24: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

PCA Iterative Algorithm k = 1: Find direction of max variance, project onto this direction • Locations along this direction are the new 1D representation

American SizeEu

rope

an S

ize

More generally, for i in {1, …, k}:• Find direction of max variance that is

orthonormal to previously selected directions, project onto this direction

• Locations along this direction are the ith feature in new representation

Page 25: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

EigendecompositionAll covariance matrices have an eigendecomposition• (eigendecomposition)• is d × d (column are eigenvectors, sorted by their eigenvalues)• is d × d (diagonals are eigenvalues, off-diagonals are zero)Eigenvector / Eigenvalue equation:• By definition (unit norm)

• Example:

U

Λ

CX = UΛU�

Cxu = λuu�u = 1

�1 00 1

� �10

�=

�10

�⟹ eigenvector:

eigenvalue: λ = 1u =

�1 0

��

Page 26: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

PCA Formulation PCA: find lower-dimensional representation of raw data • is n × d (raw data) • is n × k (reduced representation, PCA ‘scores’) • is d × k (columns are k principal components) • Variance / Covariance constraints

P

X

Z = XP

≈ ≈

X

P

Z =

Page 27: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

PCA Formulation, k = 1

Goal: Maximizes variance, i.e., maxp

||z||22

σ2 =1n

n�

i=1

�z(i)

�2= ||z||22 = ||Xp||22

σ2z ||p||2 = 1where

σ2z

PCA: find one-dimensional representation of raw data • is n × d (raw data) • is n × 1 (reduced representation, PCA ‘scores’) • is d × 1 (columns are k principal components) • Variance constraint

X

p

z = Xp

Page 28: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

σ2z =1n||z||22

=1nz�z

=1n(Xp)�(Xp)

=1np�X�Xp

= p�CXp

Relationship between Euclidean distance and dot product

z = XpDefinition:

Transpose property: ; associativity of multiply(Xp)� = p�X�

Definition: CX =1nX�X

Goal: Maximizes variance, i.e., maxp

||z||22σ2z ||p||2 = 1where

maxp

p�CxpRestated Goal: ||p||2 = 1where

Page 29: 10-701 Introduction to Machine Learning - PCAlwehbe/10701_S19/files/Lecture... · 2019. 4. 8. · Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate

Recall eigenvector / eigenvalue equation:• By definition , and thus • But this is the expression we’re optimizing, and thus maximal

variance achieved when is top eigenvector of

Similar arguments can be used for k > 1

maxp

p�CxpRestated Goal: ||p||2 = 1where

Cxu = λu

u�u = 1 u�Cxu = λ

CXp

Connection to Eigenvectors