cs 6890: deep learning principal component...

30
CS 6890: Deep Learning Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected] Principal Component Analysis

Upload: others

Post on 07-Oct-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

CS 6890: Deep Learning

Razvan C. Bunescu

School of Electrical Engineering and Computer Science

[email protected]

Principal Component Analysis

Page 2: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

Principal Component Analysis (PCA)

• A technique widely used for:– dimensionality reduction.– data compression.– feature extraction.– data visualization.

• Two equivalent definitions of PCA:1) Project the data onto a lower dimensional space such that the

variance of the projected data is maximized.2) Project the data onto a lower dimensional space such that the

mean squared distance between data points and their projections (average projection cost) is minimized.

2

maximum variance

minimum error

Page 3: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

Principal Component Analysis (PCA)

3

Page 4: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

Principal Component Analysis (PCA)

4

Page 5: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA (Maximum Variance)

• Let X = {xn}1£n£N be a set of observations:

– Each xnÎRD (D is the dimensionality of xn).

• Project X onto an M dimensional space (M < D) such that the variance of the projected X is maximized.– Minimum error formulation leads to the same solution [PRML

12.1.2].• shows how PCA can be used for compression.

• Work out solution for M = 1, then generalize to any M < D.

5

Page 6: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA (Maximum Variance, M = 1)

• The lower dimensional space is defined by a vector u1ÎRD.

– Only direction is important Þ choose ||u1||=1.

• Each xn is projected onto a scalar

• The (sample) mean of the data is:

• The (sample) mean of the projected data is

6

å=

=N

nnN 1

1 xx

nTxu1

xuT1

Page 7: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA (Maximum Variance, M = 1)

• The (sample) variance of the projected data:

where Σ is the data covariance matrix:

• Optimization problem is:

7

1N

u1Tx n−u1

Tx( )2

n=1

N

∑ = u1TΣu1

Σ =1N

x n−x( ) x n−x( )Tn=1

N

−u1TΣu1

111 =uuT

minimize:

subject to:

Page 8: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA (Maximum Variance, M = 1)

• Lagrangian function:

where l1 is the Lagrangian multiplier for constraint

• Solve:

8

111 =uuT

LP (u1,λ1) = −u1TΣu1 +λ1(u1

Tu1 −1)

∂LP∂u1

= 0⇒ Σu1 = λ1u1⇒u1 is an eigenvector of Σl1 is an eigenvalue of Σ

⇒−u1TΣu1 = −λ1u1

Tu1 = −λ1Þ l1 is the largest eigenvalue of Σ.

Page 9: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA (Maximum Variance, M = 1)

• l1 is the largest eigenvalue of Σ.• u1 is the eigenvector corresponding to l1:

– also called the first principal component.

• For M < D dimensions:– u1 u2 … uM are the eigenvectors corresponding to the largest

eigenvalues l1 l2 … lM of Σ.– proof by induction.

9

Page 10: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA on Normalized Data

• Preprocess data X ={x(i)}1£i£m such that:– features have the same mean (0).

– features have the same variance (1).

10

Page 11: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA on Natural Images

• Stationarity: the statistics in one part of the image should be the same as any other.Þno need for variance normalization.Þdo mean normalization by subtracting from each image its mean

intensity.

11

Page 12: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA on Normalized Data

• The covariance matrix is:

• The eigenvectors are:

• Equivalent with:

12

Σ =1mXXT =

1m

x(i) x(i)( )T

i=1

m

Σu j = λ ju j

ΣU =UΛU = [u1,u2,…,uD ]Λ = diag(λ1,λ2,…,λD )

where λ1 ≥ λ2 ≥…≥ λD and ujTuj =1

λ1 ≥ λ2 ≥…≥ λD and UTU = I

Page 13: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA on Normalized Data

• U is an orthogonal (rotation) matrix, i.e. UTU = I.

• The full transformation (rotation) of x(i) through PCA is:

• The k-dimensional projection of x(i) through PCA is:

• How many components k should be used?

13

y(i) =UT x(i)

y(i) =U1,kT x(i) = [u1,…,uk ]

T x(i)

⇒ x(i) =Uy(i)

⇒ x(i) =U1,k y(i)

Page 14: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

How many components k should be used?

• Compute percentage of variance retained by Y = {y(i)}, for each value of k:

14

y(i) = [u1,…,uk ]T x(i)

Var(k) = Var yj!" #$j=1

k

∑ = Var ujT x!" #$

j=1

k

=1m

ujT x(i) −uj

T x( )2

i=1

m

∑j=1

k

∑ =1m

ujT x(i)( )

2

i=1

m

∑j=1

k

∑ = λ jj=1

k

HW: Prove it is λj

Page 15: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

How many components k should be used?

• Compute percentage of variance retained by Y = {y(i)}, for each value of k:

– Variance retained:

– Total variance:

– Percentage of variance retained:

15

Var(k) = λ jj=1

k∑

P(k) =λ jj=1

k∑

λ jj=1

D∑

Var(D) = λ jj=1

D∑

Page 16: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

How many components k should be used?

• Compute percentage of variance retained by Y = {y(i)}, for each value of k:

• Choose smallest k as to retain 99% of variance:

16

P(k) =λ jj=1

k∑

λ jj=1

D∑

k = argmin1≤k≤D

P(k) ≥ 0.99[ ]

Page 17: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA on Normalized Data:

17

[x1(i), x2

(i) ]T

Page 18: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

Rotation through PCA:

18

[u1T x(i),u2

T x(i) ]T

Page 19: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

1-Dimensional PCA Projection:

19

[u1T x(i), 0]T

Page 20: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

1-Dimensional PCA Approximation:

20

u1u1T x(i)

Page 21: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA as a Linear Auto-Encoder

• The full transformation (rotation) of x(i) through PCA is:

• The k-dimensional projection of x(i) through PCA is:

• The minimum error formulation of PCA:

21

y =UT x⇒ x =Uy

y =U1,kT x = [u1,…,uk ]

T x⇒ x =U1,k y =U1,kU1,kT x

U1,k* = argmin

U1,kU1,kU1,k

T x(i) − x(i)2

i=1

m

∑a linear auto-encoder

with tied weights!

Page 22: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA as a Linear Auto-Encoder

22

x1

x2

x3

x4

x1

x2

x3

x4

zi(2)

U1,kT U1,k

wi1(1)

wi2(1)

wi3(1)

wi4(1)

w1i(2)

w2i(2)

w3i(2)

w4i(2)

ui = wi1(1),wi2

(1),wi3(1),wi4

(1)!" #$T

ui = w1i(2),w2i

(2),w3i(2),w4i

(2)!" #$T

Page 23: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA and Decorrelation

• The full transformation (rotation) of x(i) through PCA is:

• What is the covariance matrix of the rotated data Y?

23

y(i) =UT x(i) ⇒Y =UTX

1mYY T =

1mUTX( ) UTX( )

T=1mUTXXTU

=UT 1mXXT!

"#

$

%&U =UTΣU = Λ

= diag(λ1,λ2,…,λD ) => the features in yare decorrelated!

Page 24: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA Whitening (Sphering)

• The goal of whitening is to make the input less redundant, i.e. the learning algorithm sees a training input where:

1. The features are not correlated with each other.

2. The features all have the same variance.

24

1. PCA already results in uncorrelated features:

y(i) =UT x(i)⇔Y =UTX 1mYY T = diag(λ1,λ2,…,λD )

2. Transform to identity covariance (PCA Whitening) :

yj(i) =

ujT x(i)

λ j

⇔ y(i) = Λ−12UT x(i)⇔Y = Λ−12UTX

Page 25: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA on Normalized Data:

25

[x1(i), x2

(i) ]T

Page 26: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

Rotation through PCA:

26

[u1T x(i),u2

T x(i) ]T

Page 27: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

PCA Whitening:

27

u1T x(i)

λ1, u2

T x(i)

λ2

!

"##

$

%&&

T

Page 28: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

ZCA Whitening (Sphering)

• Observation: If Y has identity covariance and R is an orthogonal matrix, then RY has identity covariance.

28

YPCA = Λ−12UTX

1. PCA Whitening:

YZCA =UYPCA =UΛ−12UTX

2. ZCA Whitening:

Out of all rotations, U makes YZCA closest to original X.

Page 29: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

ZCA Whitening:

29

YZCA =UΛ−12UTX

Page 30: CS 6890: Deep Learning Principal Component Analysisace.cs.ohio.edu/~razvan/courses/dl6890/lecture03.pdfPrincipal Component Analysis (PCA) •A technique widely used for: –dimensionality

Smoothing

• When eigenvalues λj are very close to 0, dividing by λj-1/2 is

numerically unstable.

• Smoothing: add a small ε to eigenvalues before scaling for PCA/ZCA whitening:

• ZCA whitening is a rough model of how the biological eye (the retina) processes images (through retinal neurons).

30

yj(i) =

ujT x(i)

λ j +εε ≈10−5