2009 7 pca + factor analyses

Upload: muhamad-burhanudin

Post on 03-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 2009 7 PCA + Factor Analyses

    1/68

    Data Reduction I : PCA and Factor Analysis

    Carlos Pinto1

    1Neuropsychiatric Genetics GroupUniversity of Dublin

    Data Analysis Seminars 11 November 2009

  • 7/28/2019 2009 7 PCA + Factor Analyses

    2/68

    Overview

    1. Distortions in Data

    2. Variance and Covariance

    3. PCA: A Toy Example

    4. Running PCA in R

    5. Factor Analysis

    6. Running Factor Analysis in R

    7. Conclusions

  • 7/28/2019 2009 7 PCA + Factor Analyses

    3/68

    Sources of Confusion in Data

    1. Noise

    2. Rotation

    3. Redundancy

    in linearsystems . . .

  • 7/28/2019 2009 7 PCA + Factor Analyses

    4/68

    Sources of Confusion in Data

    1. Noise

    2. Rotation

    3. Redundancy

    in linearsystems . . .

  • 7/28/2019 2009 7 PCA + Factor Analyses

    5/68

    Sources of Confusion in Data

    1. Noise

    2. Rotation

    3. Redundancy

    in linearsystems . . .

  • 7/28/2019 2009 7 PCA + Factor Analyses

    6/68

    Sources of Confusion in Data

    1. Noise

    2. Rotation

    3. Redundancy

    in linearsystems . . .

  • 7/28/2019 2009 7 PCA + Factor Analyses

    7/68

    Noise

    1. Noise arises from inaccuracies in our measurements

    2. Noise is measured relative to the signal

    SNR =2signal

    2noise

    The SNR must be large if we are to extract useful information

    from our experiment.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    8/68

    Noise

    1. Noise arises from inaccuracies in our measurements

    2. Noise is measured relative to the signal

    SNR =2signal

    2noise

    The SNR must be large if we are to extract useful information

    from our experiment.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    9/68

    Noise

    1. Noise arises from inaccuracies in our measurements

    2. Noise is measured relative to the signal

    SNR =2signal

    2noise

    The SNR must be large if we are to extract useful information

    from our experiment.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    10/68

    Noise

    1. Noise arises from inaccuracies in our measurements

    2. Noise is measured relative to the signal

    SNR =2signal

    2noise

    The SNR must be large if we are to extract useful information

    from our experiment.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    11/68

    Variance in Signal and Noise

  • 7/28/2019 2009 7 PCA + Factor Analyses

    12/68

    Rotation

    1. We dont always know the true underlying variables in the

    system we are studying

    2. We may actually be measuring some (hopefully linear!)

    combination of the true variables.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    13/68

    Rotation

    1. We dont always know the true underlying variables in the

    system we are studying

    2. We may actually be measuring some (hopefully linear!)

    combination of the true variables.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    14/68

    Redundancy

    1. We dont always know whether the variables we are

    measuring are independent.

    2. We may actually be measuring one or more (hopefully

    linear!) combinations of the same set of variables.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    15/68

    Redundancy

    1. We dont always know whether the variables we are

    measuring are independent.

    2. We may actually be measuring one or more (hopefully

    linear!) combinations of the same set of variables.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    16/68

    Degrees of Redundancy

  • 7/28/2019 2009 7 PCA + Factor Analyses

    17/68

  • 7/28/2019 2009 7 PCA + Factor Analyses

    18/68

    Goals of PCA

    1. To isolate noise

    2. To separate out the redundant degrees of freedom

    3. To eliminate the effects of rotation

  • 7/28/2019 2009 7 PCA + Factor Analyses

    19/68

    Goals of PCA

    1. To isolate noise

    2. To separate out the redundant degrees of freedom

    3. To eliminate the effects of rotation

  • 7/28/2019 2009 7 PCA + Factor Analyses

    20/68

    Goals of PCA

    1. To isolate noise

    2. To separate out the redundant degrees of freedom

    3. To eliminate the effects of rotation

  • 7/28/2019 2009 7 PCA + Factor Analyses

    21/68

    Example: Measuring Two Correlated Variables

  • 7/28/2019 2009 7 PCA + Factor Analyses

    22/68

    Reduced Data

  • 7/28/2019 2009 7 PCA + Factor Analyses

    23/68

    Variance

    Variable : X Measurements: {x1 . . .xn}

    Variable : Y Measurements: {y1 . . .yn}

    Variance of X: 2XX =P

    i(xix)(xix)n1

    Variance of Y: 2YY =P

    i(yiy)(yiy)

    n1

  • 7/28/2019 2009 7 PCA + Factor Analyses

    24/68

    Covariance

    Covariance of Xand Y:

    2XY =

    i(xi x)(yi y)

    n 1= 2YX

    The covariance measures the degree of the (linear!)

    relationship between Xand Y

    Small values of the covariance indicate that the variables areindependent (uncorrelated)

  • 7/28/2019 2009 7 PCA + Factor Analyses

    25/68

    Covariance Matrix

    We can write the four variances associated with our two

    variables in the form of a matrix:

    CXX CXYCYX CYY

    The diagonal elements are the variances and the offdiagonal

    elements are the covariances

    The covariance matrix is symmetric, since CXY = CYX

  • 7/28/2019 2009 7 PCA + Factor Analyses

    26/68

    Three Variables

    For 3 variables, we get a 3 3 matrix:

    C11 C12 C13

    C21 C22 C23C31 C32 C33

    As before, the diagonal elements are the variances and the

    offdiagonal elements are the covariances, and the matrix issymmetric

  • 7/28/2019 2009 7 PCA + Factor Analyses

    27/68

    Interpretation of the Covariance Matrix

    1. The diagonal terms are the variances of our different

    variables.

    2. Large diagonal values correspond to strong signals

    3. The off-diagonal terms are the covariances between

    different variables. They reflect distortions in the data

    (noise, rotation, redundancy).

    4. Large off-diagonal values correspond to distortions in ourdata

  • 7/28/2019 2009 7 PCA + Factor Analyses

    28/68

    Interpretation of the Covariance Matrix

    1. The diagonal terms are the variances of our different

    variables.

    2. Large diagonal values correspond to strong signals

    3. The off-diagonal terms are the covariances between

    different variables. They reflect distortions in the data

    (noise, rotation, redundancy).

    4. Large off-diagonal values correspond to distortions in ourdata

  • 7/28/2019 2009 7 PCA + Factor Analyses

    29/68

    Minimizing Distortion

    1. If we redefine our variables (as linear combinations of each

    other) the covariance matrix will change.

    2. We want to change the covariance matrix so that the

    offdiagonal elements are close to zero.

    3. That is, we want to diagonalize the covariance matrix.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    30/68

    Minimizing Distortion

    1. If we redefine our variables (as linear combinations of each

    other) the covariance matrix will change.

    2. We want to change the covariance matrix so that the

    offdiagonal elements are close to zero.

    3. That is, we want to diagonalize the covariance matrix.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    31/68

    Minimizing Distortion

    1. If we redefine our variables (as linear combinations of each

    other) the covariance matrix will change.

    2. We want to change the covariance matrix so that the

    offdiagonal elements are close to zero.

    3. That is, we want to diagonalize the covariance matrix.

  • 7/28/2019 2009 7 PCA + Factor Analyses

    32/68

    Example 1: Toy Example

    X = (1, 2) X = 3/2

    Y = (3, 6) Y = 9/2

    Subtract the means, since we are only interested in thevariances

    X

    =

    1

    2 ,1

    2

    Y =

    3

    2,

    3

    2

  • 7/28/2019 2009 7 PCA + Factor Analyses

    33/68

    Example 1: Toy Example

    X = (1, 2) X = 3/2

    Y = (3, 6) Y = 9/2

    Subtract the means, since we are only interested in thevariances

    X

    =

    1

    2 ,1

    2

    Y =

    3

    2,

    3

    2

    V i

  • 7/28/2019 2009 7 PCA + Factor Analyses

    34/68

    Variances

    C11 =

    12

    12

    +

    1

    2 1

    2

    =

    1

    2

    C

    12 =

    1

    2 3

    2

    +1

    2 3

    2

    =

    3

    2

    C21 =

    32

    12

    +

    3

    2 1

    2

    =

    3

    2

    C22 =

    32

    32

    +

    32

    32

    = 9

    2

    C i M t i

  • 7/28/2019 2009 7 PCA + Factor Analyses

    35/68

    Covariance Matrix

    The covariance matrix is:

    C

    =

    1

    2

    3

    2

    3

    2

    9

    2

    This is not diagonal the nonzero off-diagonal terms indicate

    the presence of distortion, in this case because the two

    variables are correlated

    R t ti f A I

  • 7/28/2019 2009 7 PCA + Factor Analyses

    36/68

    Rotation of Axes I

    We need to redefine our variables to diagonalize the covariance

    matrix.

    Choose new variables which are a linear combination of the old

    ones:

    X = a1X + a2Y

    Y = b1X + b2Y

    We have to determine the four constants a1, a2, b1, b2 such that

    the covariance matrix is diagonal.

    R t ti f A I

  • 7/28/2019 2009 7 PCA + Factor Analyses

    37/68

    Rotation of Axes I

    We need to redefine our variables to diagonalize the covariance

    matrix.

    Choose new variables which are a linear combination of the old

    ones:

    X = a1X + a2Y

    Y = b1X + b2Y

    We have to determine the four constants a1, a2, b1, b2 such that

    the covariance matrix is diagonal.

    Rotation of Axes I

  • 7/28/2019 2009 7 PCA + Factor Analyses

    38/68

    Rotation of Axes I

    We need to redefine our variables to diagonalize the covariance

    matrix.

    Choose new variables which are a linear combination of the old

    ones:

    X = a1X + a2Y

    Y = b1X + b2Y

    We have to determine the four constants a1, a2, b1, b2 such that

    the covariance matrix is diagonal.

    Rotation of Axes II

  • 7/28/2019 2009 7 PCA + Factor Analyses

    39/68

    Rotation of Axes II

    In this case, the constants are easy to find:

    X

    = X

    + 3Y

    Y = 3X + Y

    Recalculating the Data Points

  • 7/28/2019 2009 7 PCA + Factor Analyses

    40/68

    Recalculating the Data Points

    Our data points relative to the new axes are now:

    X1 = 1

    2+

    3 32

    = 5

    X

    2 =1

    2 +

    3 3

    2

    = 5

    Y1 = 3 1

    2+ 3

    2= 0

    Y2 =

    3 12

    + 3

    2= 0

    Recalculating the Variances

  • 7/28/2019 2009 7 PCA + Factor Analyses

    41/68

    Recalculating the Variances

    The new variances are now:

    C11 = (5 5) + (5 5) = 25

    C

    12 = (5 0) + (5 0) = 0

    C21 = (0 5) + (0 5) = 0

    C

    22 = (0 0) + (0 0) = 0

    Diagonalised Covariance Matrix

  • 7/28/2019 2009 7 PCA + Factor Analyses

    42/68

    Diagonalised Covariance Matrix

    The covariance matrix is now:

    C =

    25 0

    0 0

    1. All offdiagonal entries are now zero, so data distortions

    have been removed

    2. The variance of Y

    (second entry on the diagonal) is nowzero. This variable has now been eliminated, so the data

    has been dimensionally reduced (from 2 dimensions to 1)

    Eigenvalues

  • 7/28/2019 2009 7 PCA + Factor Analyses

    43/68

    Eigenvalues

    1. The numbers on the diagonal of C are called the

    eigenvalues of the covariance matrix.

    2. Large eigenvalues correspond to large variances theseare the important variables to consider.

    3. Small eigenvalues correspond to small variances these

    variables can sometimes be neglected.

    Eigenvectors

  • 7/28/2019 2009 7 PCA + Factor Analyses

    44/68

    Eigenvectors

    The directions of the new rotated axes are called the

    eigenvectors of the covariance matrix

    Etymological Note:

    vector: from Latin, carrier

    eigen: from German, proper to, belonging to

    Example 2: A More Realistic Example

  • 7/28/2019 2009 7 PCA + Factor Analyses

    45/68

    Example 2: A More Realistic Example

    3 variables, 5 observations

    Finance, X Marketing, Y Policy, Z

    3 6 5

    7 3 3

    10 9 8

    3 9 7

    10 6 5

    Running PCA in R

  • 7/28/2019 2009 7 PCA + Factor Analyses

    46/68

    Running PCA in R

    PCA Output from R

  • 7/28/2019 2009 7 PCA + Factor Analyses

    47/68

    PCA Output from R

    Assumptions and Limitations of PCA

  • 7/28/2019 2009 7 PCA + Factor Analyses

    48/68

    Assumptions and Limitations of PCA

    1. Large variances represent signal. Small signals represent

    noise.

    2. The new axes are linear combinations of the original axes

    3. The principal components are orthogonal (perpendicular)

    to each other

    4. The distributions of our measurements are fully described

    by their mean and variance

    Advantages of PCA

  • 7/28/2019 2009 7 PCA + Factor Analyses

    49/68

    d a tages o C

    1. The procedure is hypothesis free.

    2. The procedure is well defined

    3. The solution is (essentially) unique

    Factor Analysis

  • 7/28/2019 2009 7 PCA + Factor Analyses

    50/68

    y

    There are a multiplicity of procedures which go under the name

    of factor analysis.

    Factor analysis is hypothesis drivenand tries to find a prioristructure in the data.

    Well use the data from Example 2 in the next few slides...

    Factor Model

  • 7/28/2019 2009 7 PCA + Factor Analyses

    51/68

    Hypothesis: A significant part of the variance of our 3

    variables can be expressed as linear combinations of twounderlying factors, F1 and F2:

    X = a1F1 + a2F2 + eX

    Y = b1F1 + b2F2 + eY

    Z = c1F1 + c2F2 + eZ

    eX, eY and eZ allow for errors in the measuements of our 3

    variables.

    Factor Model

  • 7/28/2019 2009 7 PCA + Factor Analyses

    52/68

    Hypothesis: A significant part of the variance of our 3

    variables can be expressed as linear combinations of twounderlying factors, F1 and F2:

    X = a1F1 + a2F2 + eX

    Y = b1F1 + b2F2 + eY

    Z = c1F1 + c2F2 + eZ

    eX, eY and eZ allow for errors in the measuements of our 3

    variables.

    Factor Model

  • 7/28/2019 2009 7 PCA + Factor Analyses

    53/68

    Hypothesis: A significant part of the variance of our 3

    variables can be expressed as linear combinations of twounderlying factors, F1 and F2:

    X = a1F1 + a2F2 + eX

    Y = b1F1 + b2F2 + eY

    Z = c1F1 + c2F2 + eZ

    eX, eY and eZ allow for errors in the measuements of our 3

    variables.

    Assumptions

  • 7/28/2019 2009 7 PCA + Factor Analyses

    54/68

    1. The factors Fi are independent with mean 0 and variance 1

    2. The error terms are independent with mean 0 and variance2i

    With these assumptions, we can calculate the variances in

    terms of our hypothesised loadings.

    Assumptions

  • 7/28/2019 2009 7 PCA + Factor Analyses

    55/68

    1. The factors Fi are independent with mean 0 and variance 1

    2. The error terms are independent with mean 0 and variance2i

    With these assumptions, we can calculate the variances in

    terms of our hypothesised loadings.

    Factor Model Variances

  • 7/28/2019 2009 7 PCA + Factor Analyses

    56/68

    CXX = a2

    1 + a2

    2 +

    2

    eX

    CYY = b2

    1 + b2

    2 + 2

    eY

    CZZ = c2

    1 + c2

    2 + 2

    eZ

    CXY = a1b1 + a2b2

    CXZ = a1c1 + a2c2

    CYZ = b1c1 + b2c2

    Observed Variances for Example 2

  • 7/28/2019 2009 7 PCA + Factor Analyses

    57/68

    CXX = 9.84

    CXX = 5.04

    CXX = 3.04

    CXY = 0.36

    CXZ = 0.44

    CYZ = 3.84

    Note: Observations are not mean-adjusted

    Fitting the Model

  • 7/28/2019 2009 7 PCA + Factor Analyses

    58/68

    To fit the model we would like to choose the loadings ai, bi and

    ci so that the predicted variances match the observed

    variances.

    Problem: The solution is not unique

    That is, there are different choices for the loadings that yield the

    same values for the variances; in fact, an infinite number.....

    Fitting the Model

  • 7/28/2019 2009 7 PCA + Factor Analyses

    59/68

    To fit the model we would like to choose the loadings ai, bi and

    ci so that the predicted variances match the observed

    variances.

    Problem: The solution is not unique

    That is, there are different choices for the loadings that yield the

    same values for the variances; in fact, an infinite number.....

    So ... what does the computer do ?

  • 7/28/2019 2009 7 PCA + Factor Analyses

    60/68

    In essence, it finds an initial set of loadings and then rotates

    these to find the best solution according to a user specified

    criterion

    Varimax: Tends to maximise the difference in loading

    between factors.

    Quartimax: Tends to maximise the difference in loading

    between variables.

    So ... what does the computer do ?

  • 7/28/2019 2009 7 PCA + Factor Analyses

    61/68

    In essence, it finds an initial set of loadings and then rotates

    these to find the best solution according to a user specified

    criterion

    Varimax: Tends to maximise the difference in loading

    between factors.

    Quartimax: Tends to maximise the difference in loading

    between variables.

    Example 3

  • 7/28/2019 2009 7 PCA + Factor Analyses

    62/68

    Physics History English Maths Politics

    8 4 3 8 5

    7 3 3 9 3

    10 4 3 9 4

    3 8 9 4 7

    4 8 6 4 7

    5 9 8 5 9

    Running Factor Analysis in R

  • 7/28/2019 2009 7 PCA + Factor Analyses

    63/68

    PCA Analysis of Example 3

  • 7/28/2019 2009 7 PCA + Factor Analyses

    64/68

    Factor Analysis of Example 3

  • 7/28/2019 2009 7 PCA + Factor Analyses

    65/68

    Some Opinions on Factor Analysis...

  • 7/28/2019 2009 7 PCA + Factor Analyses

    66/68

    Take Home

  • 7/28/2019 2009 7 PCA + Factor Analyses

    67/68

    1. Measurements are subject to many distortions including

    noise, rotation and redundancy

    2. PCA is a robust, well defined method for reducing data and

    investigating underlying structure.

    3. There are many approaches to factor analysis; all of them

    are subject to a certain degree of arbitrariness and should

    be used with caution

    Further Reading

  • 7/28/2019 2009 7 PCA + Factor Analyses

    68/68

    A Tutorial on Principal Components Analysis

    Jonathon Shlens

    http://www.snl.salk.edu/ shlens/notes.html

    Factor Analysis

    Peter Tryfos

    www.yorku.ca/ptryfos/f1400.pdf