maa704, multivariate analysis. - mdh/menu/general/column-content... · principal component analysis...

MAA704,Multivariateanalysis.

ChristopherEngstrom

Multivariateanalysis

Principalcomponentanalysis

Partial leastsquaresregression

Lineardiscriminantanalysis

MAA704, Multivariate analysis.

Christopher Engstrom

December 20, 2013


ChristopherEngstrom





Todays lecture

I Principal component analysis (PCA)

I Partial least squares regression (PLS-R)

I Linear discriminant analysis (LDA)


ChristopherEngstrom





Principal component analysis

Principal component analysis (PCA) is a method often used toreduce the dimension of a large dataset to one of a moremanageble size.

I The new dataset can then be used to make your analysis,supposedly with little loss of information.

I The new dataset can also be used to find hiddenrelationsin the data that might not have been obviousotherwise.


ChristopherEngstrom






Assume we have n sets of data, each containing mmeasurements, for example:

I The value of n stocks over a period of time.

I The water usage of n households every day over someperiod of time.

I The rainfall at n different places over time.

I And many more.

I What if we instead could work with a smaller set of j < nsets of data with m measurements each instead?


ChristopherEngstrom






What about using the Gram-Schmidt process?

I If one set of data is a linear combination of the others, wecould use the Gram-Schmidt process.

I We would then have a lower dimension dataset with noloss of information!

I Unfortunately even if the dataset contains very strongdependencies, this is generally not the case in any datacollected from a real process.

I Any measurement from real data contains some randomerrors which quickly destroys any use we could have of theGram-Schmidt process.


ChristopherEngstrom






But what about if they are ”nearly” linear dependent.

I For example points close to a plane in three dimensions.

I We could represent our points in R3 as points on theplane (in R2) by projecting the points on the plane?

I We could now work on this 2-dimensional space insteadwith (probably) a small loss of information.

I PCA works in a similar way as we will describe next.


ChristopherEngstrom






Before doing any PCA we need to pre-process our data a bit.

I We start by subtracting the mean from every individualset.

I We then let every set of measures correspond to one rowof a matrix (resulting in a n ×m matrix X).


ChristopherEngstrom






Next we want to find the single one vector that ”best”describes the data in that projecting all the measures in X onthe line described by this vector captures the maximumvariance in the original data.

I Setting the length of the vector to equal one, this vectorw = (w1,w2, . . .wm) should then satisfy:

w = arg max||w ||=1

{n∑

i=1

(xi · w)2

}

I Writing this in matrix form yields:

w = arg max||w ||=1

{||Xw||2

}= arg max||w ||=1

{w>X>Xw

}


ChristopherEngstrom






Since ||w|| = 1 we can write this as:

w = arg max||w ||=1

{w>X>Xw

w>w

}

I This is called the Rayleigh quotient for M = X>X if M is aHermitian matrix (MH = M).

I In practice M will always be a Hermitian matrix for anyreal measured data.

I But how do we find the vector which gives the maximum?


ChristopherEngstrom






We let

R(M,w) =w>X>Xw

w>w

We write w = c1v1 + c2v2 + . . .+ cmvm as a linear cmbinationof the eigenvectors of M resulting in:

R(M,w) =(∑m

j=1 civi )>X>X(

∑mj=1 civi )

(∑m

j=1 civi )>(∑m

j=1 civi )

Since the vectors vi are orthogonal and using that Avi = λiviwe can write this as:

=

∑mj=1 c

2i λi∑m

j=1 c2i


ChristopherEngstrom






Our problem can then be written as:

w = arg max∑mi=1 c

2i =1

{∑mj=1 c

2i λi∑m

j=1 c2i

}

This is clearly when ci = 1 for the largest eigenvalue λi whichmeans that w should be equal to the eigenvector correspondingto the largest eigenvalue of M.


ChristopherEngstrom






So finding the eigenvalues and eigenvectors of M = X>X wecan find the first ”principal component” as the eigenvector tothe largest eigenvalue.

I Similarly the direction which gives the larges varianceorthogonal to the first vector is the eigenvectorcorresponding to the second largest eigenvalue.

I M = X>X is called the Covariance matrix for X.

I Now that we have our principal components, we canreduce the dimension of our data with a (hopefully) lowloss of information.


ChristopherEngstrom






To reduce the dimension we create a new basis with only thevectors pointing in directions of large variance.

I If we wanted to use all the eigenvectors we would get:

T = XW

where W contains all the eigenvectors of M.

I By only taking the first L eigenvectors we get:

TL = XWL

where WL is a L×m matrix taken as the first L columnsof W


ChristopherEngstrom






But how do we know how many eigenvectors to include?

I The eigenvalues gives and estimate on how much of the”variance” is captured by corresponding eigenvector, wecan use this!

I You might want to keep at least some proportion of the”variance”:

V (L)

V (n)≥ 0.9

where V (i) is the sum of the largest i eigenvalues.

I Another method is to plot V and only include theeigenvectors corresponding to eigenvalues that give asignificant increase of V .


ChristopherEngstrom






I In practice you often use the correlation matrix rather thancovariance matrix, this fixes problems with somemeasurements having much higher variance than others.

I Briefly described here was the covariance method, there isanother based on the singular value decomposition (SVD)which is more stable.

I While PCA can be useful to reduce the dimension of manydatasets, it doesn’t always work, we need that there is atleast some correlation between the different measures forit to be useful.


ChristopherEngstrom






Short overview of the Covariance method.

I 1) Remove the mean from every individual set ofmeasurements, order the measurements as the rows ofmatrix X.

I 2) Calculate the Covariance matrix M = X>X, (or use thecorrelation matrix).

I 3) Calculate the eigenvalues and eigenvectors of M.

I 4) Choose amount of eigenvectors to include, for exampleenough to at least keep 0.9 of the ”variance”.

I 5) Calculate TL = XWL which is our new dataset.


ChristopherEngstrom





Partial least squares regression

Partial least squares regression (PLS-R) is a method used topredict a set of variables (observations) from another set ofvariables (predictors).

I Y is a I × K matrix with I observations of K dependentvariables.

I X is a I × J matrix with J predictions for the Iobservations.

I The goal s to predict the K dependent variables using theJ predictor variables.

I When the number of predictors J is large compared to thenumber of observations I , common multiple regressionoften fails because X tend to be singular.


ChristopherEngstrom






The goal is to find a T such that:

X = TP> + E

where TP> is a projection of X on the basis in T and E is anerror term.

We then estimate Y using:

Y = TBQ>

We will look at how to find the different matrices shortly.


ChristopherEngstrom






Overview of similar methods and their differences.

I Principal component regression: Use PCA on thepredictors X. This gives informative directions for ourpredictors, but these directions might not explain thepredicted variables well. Based on the spectralfactorization of X>X .

I Maximum Redundancy Analysis Use PCA on thepredicted variables Y. Seeks directions in X which wellexplains the responses in Y well, but we might not get anaccurate prediction. Based on the spectral factorization ofY>Y .

I Partial least squares regression We choose our vectorsin T such that the covariance between X and Y ismaximized. Based on the Singular value decomposition(SVD) of X>Y.


ChristopherEngstrom






What do we mean by maximizing the covariance between Xand Y and how do we find such vectors?

I In order to maximize the variance we want to find vectorssuch that:

t = Xw, w>w = 1

u = Yq, q>q = 1

Such that t>u is maximized.

I w can be shown to be equal to the first right singularvector to X>Y.

I q can be shown to be equal to the first left singular vectorto X>Y.


ChristopherEngstrom






To predict Y we wanted to use:

Y = TBQ>

I T is a matrix containing all the vectors ti as it’s columns.

I B is a diagonal matrix with elements bi = t>i ui used topredict Y from ti

I Q is a matrix containing all the vectors qi as it’s columns.


ChristopherEngstrom






I In practice all singular values are not computed, instead acouple are computed iteratively and corresponding vectorsare ”deflated” from the matrices X,Y.

I This can be done in a number of different ways, which wehowever will not look at here.


ChristopherEngstrom






Similarly to PCA, we can say something about the total”variance” of X,Y explained by the first L vectors using:

I

VY (L) =

∑Li=1 b

2i

SSY

where SSY is the sum of the squares of measurements inY after subtracting the mean and dividing by the standarddeviation. VY (L) is then the proportion of ”variance” of Yexplained by the first L vectors.


ChristopherEngstrom






I

VX (L) =

∑Li=1 p

>i pi

SSX

where SSX is the sum of the squares of measurements inX after subtracting the mean and dividing by the standarddeviation. VX (L) is then the proportion of ”variance” of Xexplained by the first L vectors.

I pi = E>i ti , where Ei = X−i−1∑j=1

tjp>j , (X normalized as

above).


ChristopherEngstrom





Linear discriminant analysis

Sometimes you might want to cluster your data, but youalready know your clusters beforehand, you just want to beable to predict to which cluster a given measure belongs to.

I This is called a classification problem, where we typicallyhave some observations and which class they belong to.We use this to ”train” our method, which we can then useto classify new observations where wo don’t know whichclass they belong to.

I Similar to clustering of data, but we already know ourclusters and their interpretation.

I We will look at Linear Discriminant Analysis (LDA) whichis used to classify observations in one of 2 different classes.

I Can quite easily be extended for multiple classes as well.


ChristopherEngstrom






Our object is to reduce the dimension while preserving as muchof the information discriminating the classes from each other aspossible.

I Assuming we have N D-dimensional samples.

I N1 of these samples belonging to class c1 and N2

belonging to class c2.

I Our aim is to project the samples x on a line y = w>x.Resulting in a scalar value for every sample.

I We want to find the line which when projected upon, bestseperates the two classes.


ChristopherEngstrom






To evaluate the class separability we will use something calledthe Fisher linear discriminant:

I Maximizing this measure will give us the ”best” line.

I But first we need to look at some components we willneed.


ChristopherEngstrom






We start by looking at the distance between the projectedmeans of the classes.

I In the original space the mean µi of a class is easily found:

µi =1

Ni

∑x∈ci

x

I And for the projected means we get:

µi =1

Ni

∑y∈ci

y =1

Ni

∑x∈ci

w>x = w>µi

I We could now use the distance between the projectedmeans:

|µ1 − µ2| = |w>(µ1 − µ2)|


ChristopherEngstrom





Linear discriminant analysisAlthough the distance between the projected means mightseperate the classes well, it does not take into consideration thevariance of the data.

I If there is a high variance in the same direction as the onewe would get when maximizing the direction of the means,we could get bad seperability anyway.

I To solve this we will look at the variance within a class(also called the scatter) on the projected line:

S2i =

∑y∈ci

(y − µi )2

I Adding the scatter of both classes and we get the”within-class scatter”:

(S21 + S2

2 )


ChristopherEngstrom






Fishers linear discriminant is defined as the function w>x whichmaximizes J(w):

J(w) =|µ1 − µ2|2

S21 + S2

2

We are now looking for a projection where elements from the

same class are projected close to eachother (low within-classscatter) and the distance between the projected class meansare far apart.


ChristopherEngstrom






To maximize J(w) we start by writing it using w.

I We start by looking at the ”within-class scatter”:

S2i =

∑y∈ci

(y − µi )2 =∑y∈ci

(w>x−w>µi )2

I

=∑y∈ci

w>(x− µi )(x− µi )>w

I We call: Si =∑x∈ci

(x− µi )(x− µi )> and SW = S1 + S2

and get:

S21 + S2

2 = w>(S1 + S2)w = w>SWw


ChristopherEngstrom






If we instead look at the distance between projected means weget:

|µ1 − µ2|2 = (w>µ1 −w>µ2)2

w>(µ1 − µ2)(µ1 − µ2)>w

We call SB = (µ1 − µ2)(µ1 − µ2)> and we get:

|µ1 − µ2|2 = w>SBw


ChristopherEngstrom






Combining the two gives the following function in w which wewant to maximize:

J(w) =w>SBw

w>SWw

For which we can find the maximum by setting the gradient tozero:

d

dwJ(w) = 0


ChristopherEngstrom






With the help of some matrix calculus this is equivalent to theeigenvalue problem:

S−1W SBw = Jw

With solution:

w∗ = arg max

[w>SBw

w>SWw

]= S−1W (µ1 − µ2)


ChristopherEngstrom






A short note on generalizations to more than two classes.

I Fisher’s linear discriminant itself can be generalized formore classes, will result in finding a subspacew1,w2, . . . ,wC−1 where C is the number of classesinstead.

I Another alternative is to classify every class itself withrespect to all the other classes, giving C classifiers thatcan then be combined.

I It is also possible to use the classifiers of every pair ofclasses.


ChristopherEngstrom






While LDA works well for Gaussian distributions, some othershapes or types of distributions can give problems.

I Some shapes of the classes can lead to problems, forexample two interwined ”C” shapes.

I If both clusters have the same mean we cannot classifythem.

I Or if the information lies not in the mean of the classesbut in the variance it will also fail.

maa704, multivariate analysis. - mdh/menu/general/column-content... · principal component analysis...

Documents