pcabrettb/linoptfall2018/lab9slides.pdf · pca 1 de ne x 1 = (4;1), x 2 = ( 3;1), and x 3 = (1;1)....

24
PCA 1 Define x 1 = (4, 1), x 2 =(-3, 1), and x 3 = (1, 1). 1 Give a one-dimensional affine subspace of R 2 that best approximates these three points. 2 Use this to represent each point using a single number (i.e., reduce the dimension from 2 to 1). 2 Suppose x 1 ,..., x n R p are datapoints you want to represent in k < p dimensions. 1 Explain how to do this using PCA. 2 How do you determine a value for k ? 3 How can you implement PCA using the SVD? 4 Why should we perform dimensionality reduction? 3 Suppose there are two eigenvectors of the covariance matrix that correspond to large eigenvalues, and the rest of the eigenvalues are small. How do we interpret this? Brett Bernstein (CDS at NYU) November 14, 2018 1 / 24

Upload: others

Post on 23-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • PCA

    1 Define x1 = (4, 1), x2 = (−3, 1), and x3 = (1, 1).1 Give a one-dimensional affine subspace of R2 that best approximates

    these three points.2 Use this to represent each point using a single number (i.e., reduce the

    dimension from 2 to 1).

    2 Suppose x1, . . . , xn ∈ Rp are datapoints you want to represent ink < p dimensions.

    1 Explain how to do this using PCA.2 How do you determine a value for k?3 How can you implement PCA using the SVD?4 Why should we perform dimensionality reduction?

    3 Suppose there are two eigenvectors of the covariance matrix thatcorrespond to large eigenvalues, and the rest of the eigenvalues aresmall. How do we interpret this?

    Brett Bernstein (CDS at NYU) November 14, 2018 1 / 24

  • Scree Plot

    Brett Bernstein (CDS at NYU) November 14, 2018 2 / 24

  • Variance Along a Direction

    1 Let x1, . . . , xn ∈ Rp, and fix a direction w ∈ Rp with ‖w‖ = 1. Wedefine the variance along the direction w by

    1

    n − 1n∑

    i=1

    (wT xi − wTµ)2

    where µ = 1n∑n

    i=1 wT xi . On the homework we will see that the first

    eigenvector of the covariance matrix gives the direction withmaximum variance. Why is this desirable?

    Brett Bernstein (CDS at NYU) November 14, 2018 3 / 24

  • Project Along Direction (x̃i = xi − µ)

    x̃1

    x̃2

    x̃3

    x̃4 x̃5

    x̃6

    x̃7

    w

    Brett Bernstein (CDS at NYU) November 14, 2018 4 / 24

  • Project Along Direction (x̃i = xi − µ)

    x̃1

    x̃2

    x̃3

    x̃4 x̃5

    x̃6

    x̃7

    wT x̃i-values

    w

    Brett Bernstein (CDS at NYU) November 14, 2018 5 / 24

  • Low Variance Direction

    Brett Bernstein (CDS at NYU) November 14, 2018 6 / 24

  • Low Variance Direction

    Brett Bernstein (CDS at NYU) November 14, 2018 7 / 24

  • Low Variance Direction

    Brett Bernstein (CDS at NYU) November 14, 2018 8 / 24

  • High Variance Direction

    Brett Bernstein (CDS at NYU) November 14, 2018 9 / 24

  • High Variance Direction

    Brett Bernstein (CDS at NYU) November 14, 2018 10 / 24

  • Standardization

    1 Someone suggests that you should standardize each feature beforerunning PCA (i.e., subtract the mean of each feature, and then divideby the standard deviation). Does this have any effect?

    Brett Bernstein (CDS at NYU) November 14, 2018 11 / 24

  • Centering: Uncentered Data

    Brett Bernstein (CDS at NYU) November 14, 2018 12 / 24

  • Centering: Uncentered Data

    u1

    u2

    Brett Bernstein (CDS at NYU) November 14, 2018 13 / 24

  • Centering: Centered Data

    u1u2

    Brett Bernstein (CDS at NYU) November 14, 2018 14 / 24

  • Scaling

    u1

    u2

    Brett Bernstein (CDS at NYU) November 14, 2018 15 / 24

  • Scaling: Multiply Second Feature by 4

    u2

    u1

    Brett Bernstein (CDS at NYU) November 14, 2018 16 / 24

  • PCA Example

    Example

    A collection of people come to a testing site to have their heightsmeasured twice. The two testers use different measuring devices, each ofwhich introduces errors into the measurement process. Below we depictsome of the measurements computed (already centered).

    Brett Bernstein (CDS at NYU) November 14, 2018 17 / 24

  • PCA Example (Data is Centered)

    −20

    −10

    10

    20

    Tester 2

    −10 −5 5 10Tester 1

    1 Describe (vaguely) what you expect the sample covariance matrix tolook like.

    2 What do you think the principal directions u1 and u2 are?

    Brett Bernstein (CDS at NYU) November 14, 2018 18 / 24

  • PCA Example: Solutions

    1 We expect tester 2 to have a larger variance than tester 1, and to benearly perfectly correlated. The sample covariance matrix is

    S =

    (40.5154 93.506993.5069 232.8653

    ).

    2 We have

    S = UΛUT ,U =

    (0.3762 −0.92650.9265 0.3762

    ),Λ =

    (270.8290 0

    0 2.5518

    ).

    Note that trace Λ = traceS .

    Since λ2 is small, it shows that u2 is almost in the nullspace of S .This suggests −.9265x + .3762y ≈ 0 for centered data points(x , y) ∈ R2. In other words, y ≈ 2.46x . Maybe tester 2 usedcentimeters and tester 1 used inches.

    Brett Bernstein (CDS at NYU) November 14, 2018 19 / 24

  • PCA Example: Plot In Terms of Principal Components

    −20

    −10

    10

    20

    Tester 2

    −10 −5 5 10Tester 1

    −1.25

    2.5

    6.25u2

    −20 −10 10 20u1

    Brett Bernstein (CDS at NYU) November 14, 2018 20 / 24

  • Principal Components Are Linear

    Suppose we have the following labeled data.

    How can we apply PCA and obtain a single principal component thatdistinguishes the colored clusters?

    Brett Bernstein (CDS at NYU) November 14, 2018 21 / 24

  • Principal Components Are Linear: Doesn’t Work

    Brett Bernstein (CDS at NYU) November 14, 2018 22 / 24

  • Principal Components Are Linear

    1 In general, we can deal with non-linear by adding features or usingkernels.

    2 Using kernels results in the technique called Kernel PCA.

    3 Below we added the feature ‖x̃i‖2 and took the first principalcomponent.

    4 Next class we will look at diffusion maps.

    Brett Bernstein (CDS at NYU) November 14, 2018 23 / 24

  • Diagonalization

    1 Suppose A ∈ Rn×n has a linearly independent list of n eigenvectorsv1, . . . , vn with eigenvalues λ1, . . . , λn. Can we factor A in a waysimilar to the spectral decomposition?

    2 The Fibonacci sequence is defined by F0 = 0, F1 = 1, andFk+2 = Fk+1 + Fk for k ≥ 0. How quickly does Fk grow (linearly,polynomially, exponentially)?

    Brett Bernstein (CDS at NYU) November 14, 2018 24 / 24