linear models - courant institute of mathematical …cfgranda/pages/dsga1002_fall15...finally, we...

DS-GA 1002 Lecture notes 9 November 16, 2015

Linear models

1 Projections

The projection of a vector x onto a subspace S is the vector in S that is closest to x. Inorder to define this rigorously, we start by introducing the concept of direct sum. If twosubspaces are disjoint, i.e. their only common point is the origin, then a vector that can bewritten as a sum of a vector from each subspace is said to belong to their direct sum.

Definition 1.1 (Direct sum). Let V be a vector space. For any subspaces S1,S2 ⊆ V suchthat

S1 ∩ S2 = {0} (1)

the direct sum is defined as

S1 ⊕ S2 := {x | x = s1 + s2 s1 ∈ S1, s2 ∈ S2} . (2)

The representation of a vector in the direct sum of two subspaces is unique.

Lemma 1.2. Any vector x ∈ S1 ⊕ S2 has a unique representation

x = s1 + s2 s1 ∈ S1, s2 ∈ S2. (3)

Proof. If x ∈ S1 ⊕ S2 then by definition there exist s1 ∈ S1, s2 ∈ S2 such that x = s1 + s2.Assume x = s′1 + s′2, s

′1 ∈ S1, s′2 ∈ S2, then s1 − s′1 = s2 − s′2. This implies that s1 − s′1

and s2 − s′2 are in S1 and also in S2. However, S1 ∩ S2 = {0}, so we conclude s1 = s′1 ands2 = s′2.

We can now define the projection of a vector x onto a subspace S by separating the vectorinto a component that belongs to S and another that belongs to its orthogonal complement.

Definition 1.3 (Orthogonal projection). Let V be a vector space. The orthogonal projectionof a vector x ∈ V onto a subspace S ⊆ V is a vector denoted by PS x such that x−PS x ∈ S⊥.

Theorem 1.4 (Properties of orthogonal projections). Let V be a vector space. Every vectorx ∈ V has a unique orthogonal projection PS x onto any subspace S ⊆ V of finite dimension.In particular x can be expressed as

x = PS x+ PS⊥ x. (4)

For any vector s ∈ S

〈x, s〉 = 〈PS x, s〉 . (5)

For any orthonormal basis b1, . . . , bm of S,

PS x =m∑i=1

〈x, bi〉 bi. (6)

Proof. Since V has finite dimension, so does S, which consequently has an orthonormal basiswith finite cardinality b′1, . . . , b

′m by Theorem 3.7 in Lecture Notes 8. Consider the vector

p :=m∑i=1

〈x, b′i〉 b′i. (7)

It turns out that x− p is orthogonal to every vector in the basis. For 1 ≤ j ≤ m,⟨x− p, b′j

⟩=

⟨x−

m∑i=1

〈x, b′i〉 b′i, b′j

⟩(8)

=⟨x, b′j

⟩−

m∑i=1

〈x, b′i〉⟨b′i, b

′j

⟩(9)

=⟨x, b′j

⟩−⟨x, b′j

⟩= 0, (10)

so by Lemma 3.3 in Lecture Notes 8 x − p ∈ S⊥ and p is an orthogonal projection. SinceS ∩ S⊥ = {0} 1 there cannot be two other vectors x1 ∈ S, x1 ∈ S⊥ such that x = x1 + x2 sothe orthogonal projection is unique.

Notice that o := x− p is a vector in S⊥ such that x− o = p is in S and therefore in(S⊥)⊥

.This implies that o is the orthogonal projection of x onto S⊥ and establishes (4).

Equation (5) follows immediately from the orthogonality of any vector s ∈ S and PS x.

Equation (6) follows from (5) and Lemma 3.5 in Lecture Notes 8.

Computing the norm of the projection of a vector onto a subspace is easy if we have accessto an orthonormal basis (as long as the norm is induced by the inner product).

Lemma 1.5 (Norm of the projection). The norm of the projection of an arbitrary vectorx ∈ V onto a subspace S ⊆ V of dimension d can be written as

||PS x||〈·,·〉 =

√√√√ d∑i

〈bi, x〉2 (11)

for any orthonormal basis b1, . . . , bd of S.

1For any vector v that belongs to both S and S⊥ 〈v, v〉 = ||v||22 = 0, which implies v = 0.

2

Proof. By (6)

||PS x||2〈·,·〉 = 〈PS x,PS x〉 (12)

=

⟨d∑i

〈bi, x〉 bi,d∑j

〈bj, x〉 bj

⟩(13)

=d∑i

d∑j

〈bi, x〉〈bj, x〉〈bi, bj〉 (14)

=d∑i

〈bi, x〉2 . (15)

Finally, we prove indeed that of a vector x onto a subspace S is the vector in S that is closestto x in the distance induced by the inner-product norm.

Example 1.6 (Projection onto a one-dimensional subspace). To compute the projectionof a vector x onto a one-dimensional subspace spanned by a vector v, we use the fact that{v/ ||v||〈·,·〉

}is a basis for span (v) (it is a set containing a unit vector that spans the subspace)

and apply (6) to obtain

Pspan(v) x =〈v, x〉||v||2〈·,·〉

v. (16)

Theorem 1.7 (The orthogonal projection is closest). The orthogonal projection of a vectorx onto a subspace S belonging to the same inner-product space is the closest vector to x thatbelongs to S in terms of the norm induced by the inner product. More formally, PS x is thesolution to the optimization problem

minimizeu

||x− u||〈·,·〉 (17)

subject to u ∈ S. (18)

Proof. Take any point s ∈ S such that s 6= PS x||x− s||2〈·,·〉 = ||PS⊥ x+ PS x− s||2〈·,·〉 (19)

= ||PS⊥ x||2〈·,·〉 + ||PS x− s||2〈·,·〉 (20)

> 0 because s 6= PS x, (21)

where (20) follows from the Pythagorean theorem since because PS⊥ x belongs to S⊥ andPS x− s to S.

3

2 Linear minimum-MSE estimation

We are interested in estimating the sample of a continuous random variable X from thesample y of a random variable Y . If we know the joint distribution of X and Y then theoptimal estimator in terms of MSE is the conditional mean E (X|Y = y). However often it isvery challenging to completely characterize a joint distribution between two quantities, butit is more tractable to obtain an estimate of their first and second order moments. In thiscase it turns out that we can obtain the optimal linear estimate of X given Y by using ourknowledge of linear algebra.

Theorem 2.1 (Best linear estimator). Assume that we know the means µX , µY and variancesσ2X , σ

2Y of two random variables X and Y and their correlation coefficient ρXY . The best

linear estimate of the form aY + b of X given Y in terms of mean-square error is

gLMMSE (y) =ρXY σX (y − µY )

σY+ µX . (22)

Proof. First we determine the value of b. The cost function as a function of b is

h (b) = E((X − aY − b)2

)= E

((X − aY )2

)+ b2 − 2bE (X − aY ) (23)

= E((X − aY )2

)+ b2 − 2b (µX − aµY ) . (24)

The first and second derivative with respect to b are

h′ (b) = 2b− 2 (µX − aµY ) , (25)

h′′ (b) = 2. (26)

Since h′′ is positive the function is convex so the minimum is obtained by setting h′ (b) tozero, which yields

b = µX − aµY . (27)

Consider the centered random variables X := X − µX and Y := Y − µY . It turns out thatto find a we just need to find the best estimate of X of the form aY

E((X − aY − b)2

)= E ((X − µX − a (Y − µY ) + µX − aµY − b))2 (28)

= E

((X − aY

)2), (29)

clearly any a that minimizes the left-hand side also minimizes the right-hand side and viceversa.

4

Consider the vector space of zero-mean random variables. X and Y belong to this vectorspace. In fact, ⟨

X, Y⟩

= E(XY

)(30)

= Cov (X, Y ) (31)

= σX σY ρXY , (32)∣∣∣∣∣∣X∣∣∣∣∣∣2〈·,·〉

= E(X2)

(33)

= σ2X , (34)∣∣∣∣∣∣Y ∣∣∣∣∣∣2

〈·,·〉= E

(Y 2)

(35)

= σ2Y . (36)

Any random variable of the form aY belongs to the subspace spanned by Y . Since thedistance in this vector space is induced by the mean-square norm, by Theorem 1.7 thevector of the form aY that approximates X better is just the projection of X onto thesubspace spanned by Y , which we will denote by PY X. This subspace has dimension 1, so{Y /σY

}is a basis for the subspace. The projection is consequently equal to

PY X =

⟨X,

Y

σY

⟩Y

σY(37)

=⟨X, Y

⟩ Y

σ2Y

(38)

=σXρXY Y

σY. (39)

So a = σXρXY /σY , which concludes the proof.

In words, the linear estimator of X given Y is obtained by

1. centering Y by removing its mean,

2. normalizing Y by dividing by its standard deviation,

3. scaling the result using the correlation between Y and X,

4. scaling again using the standard deviation of X,

5. recentering by adding the mean of X.

5

3 Matrices

A matrix is a rectangular array of numbers. We denote the vector space of m× n matricesby Rm×n. We denote the ith row of a matrix A by Ai:, the jth column by A:j and the (i, j)entry by Aij. The transpose of a matrix is obtained by switching its rows and columns.

Definition 3.1 (Transpose). The transpose AT of a matrix A ∈ Rm×n is a matrix inA ∈ Rm×n (

AT)ij

= Aji. (40)

A symmetric matrix is a matrix that is equal to its transpose.

Matrices map vectors to other vectors through a linear operation called matrix-vector prod-uct.

Definition 3.2 (Matrix-vector product). The product of a matrix A ∈ Rm×n and a vectorx ∈ Rn is a vector in Ax ∈ Rn, such that

(Ax)i =n∑

j=1

Aijx [j] (41)

= 〈Ai:, x〉 , (42)

i.e. the ith entry of Ax is the dot product between the ith row of A and x.

Equivalently,

Ax =n∑

j=1

A:jx [j] , (43)

i.e. Ax is a linear combination of the columns of A weighted by the entries in x.

One can easily check that the transpose of the product of two matrices A and B is equal tothe the transposes multiplied in the inverse order,

(AB)T = BTAT . (44)

We can now we can express the dot product between two vectors x and y as

〈x, y〉 = xT y = yT x. (45)

The identity matrix is a matrix that maps any vector to itself.

6

Definition 3.3 (Identity matrix). The identity matrix in Rn×n is

I =

1 0 · · · 00 1 · · · 0

· · ·0 0 · · · 1

. (46)

Clearly, for any x ∈ Rn Ix = x.

Definition 3.4 (Matrix multiplication). The product of two matrices A ∈ Rm×n and B ∈Rn×p is a matrix AB ∈ Rm×p, such that

(AB)ij =n∑

k=1

AikBkj = 〈Ai:, B:,j〉 , (47)

i.e. the (i, j) entry of AB is the dot product between the ith row of A and the jth column ofB.

Equivalently, the jth column of AB is the result of multiplying A and the jth column of B

AB =n∑

k=1

AikBkj = 〈Ai:, B:,j〉 , (48)

and ith row of AB is the result of multiplying the ith row of A and B.

Square matrices may have an inverse. If they do, the inverse is a matrix that reverses theeffect of the matrix of any vector.

Definition 3.5 (Matrix inverse). The inverse of a square matrix A ∈ Rn×n is a matrixA−1 ∈ Rn×n such that

AA−1 = A−1A = I. (49)

Lemma 3.6. The inverse of a matrix is unique.

Proof. Let us assume there is another matrix M such that AM = I, then

M = A−1AM by (49) (50)

= A−1. (51)

An important class of matrices are orthogonal matrices.

7

Definition 3.7 (Orthogonal matrix). An orthogonal matrix is a square matrix such that itsinverse is equal to its transpose,

UTU = UUT = I (52)

By definition, the columns U:1, U:2, . . . , U:n of any orthogonal matrix have unit norm andorthogonal to each other, so they form an orthonormal basis (it’s somewhat confusing thatorthogonal matrices are not called orthonormal matrices instead). We can interpret applyingUT to a vector x as computing the coefficients of its representation in the basis formed bythe columns of U . Applying U to UTx recovers x by scaling each basis vector with thecorresponding coefficient:

x = UUTx =n∑

i=1

〈U:i, x〉U:i. (53)

Applying an orthogonal matrix to a vector does not affect its norm, it just rotates the vector.

Lemma 3.8 (Orthogonal matrices preserve the norm). For any orthogonal matrix U ∈ Rn×n

and any vector x ∈ Rn,

||Ux||2 = ||x||2 . (54)

Proof. By the definition of an orthogonal matrix

||Ux||22 = xTUTUx (55)

= xTx (56)

= ||x||22 . (57)

4 Eigendecomposition

An eigenvector v of A satisfies

Av = λv (58)

for a real number λ which is the corresponding eigenvalue. Even if A is real, its eigenvectorsand eigenvalues can be complex.

8

Lemma 4.1 (Eigendecomposition). If a square matrix A ∈ Rn×n has n linearly independenteigenvectors v1, . . . , vn with eigenvalues λ1, . . . , λn it can be expressed in terms of a matrixQ, whose columns are the eigenvectors, and a diagonal matrix containing the eigenvalues,

A =[v1 v2 · · · vn

] λ1 0 · · · 00 λ2 · · · 0

· · ·0 0 · · · λn

[v1 v2 · · · vn]−1

(59)

= QΛQ−1 (60)

Proof.

AQ =[Av1 Av2 · · · Avn

](61)

=[λ1v1 λ2v2 · · · λ2vn

](62)

= QΛ. (63)

As we will establish later on, if the columns of a square matrix are all linearly independent,then the matrix has an inverse, so multiplying the expression by Q−1 on both sides completesthe proof.

Lemma 4.2. Not all matrices have an eigendecomposition

Proof. Consider for example the matrix [0 10 0

]. (64)

Assume λ has a nonzero eigenvalue corresponding to an eigenvector with entries v1 and v2,then [

v20

]=

[0 10 0

] [v1v2

]=

[λv1λv2

], (65)

which implies that v2 = 0 and hence v1 = 0, since we have assumed that λ 6= 0. This impliesthat the matrix does not have nonzero eigenvalues associated to nonzero eigenvectors.

An interesting use of the eigendecomposition is computing successive matrix products veryfast. Assume that we want to compute

AA · · ·Ax = Akx, (66)

9

i.e. we want to apply A to x k times. Ak cannot be computed by taking the power of itsentries (try out a simple example to convince yourself). However, if A has an eigendecom-position,

Ak = QΛQ−1QΛQ−1 · · ·QΛQ−1 (67)

= QΛkQ−1 (68)

= Q

λk1 0 · · · 00 λk2 · · · 0

· · ·0 0 · · · λkn

Q−1, (69)

using the fact that for diagonal matrices applying the matrix repeatedly is equivalent totaking the power of the diagonal entries. This allows to compute the k matrix productsusing just 3 matrix products and taking the power of n numbers.

From high-school or undergraduate algebra you probably remember how to compute eigen-vectors using determinants. In practice, this is usually not a viable option due to stabilityissues. A popular technique to compute eigenvectors is based on the following insight. LetA ∈ Rn×n be a matrix with eigendecomposition QΛQ−1 and let x be an arbitrary vector inRn. Since the columns of Q are linearly independent, they form a basis for Rn, so we canrepresent x as

x =n∑

i=1

αiQ:i, αi ∈ R, 1 ≤ i ≤ n. (70)

Now let us apply A to x k times,

Akx =n∑

i=1

αiAkQ:i (71)

=n∑

i=1

αiλkiQ:i. (72)

If we assume that the eigenvectors are ordered according to their magnitudes and that themagnitude of one of them is larger than the rest, |λ1| > |λ2| ≥ . . ., and that α1 6= 0 (whichhappens with high probability if we draw a random x) then as k grows larger the termα1λ

k1Q:1 dominates. The term will blow up or tend to zero unless we normalize every time

before applying A. Adding the normalization step to this procedure results in the powermethod or power iteration, an algorithm of great importance in numerical linear algebra.

Algorithm 4.3 (Power method).Input: A matrix A.Output: An estimate of the eigenvector of A corresponding to the largest eigenvalue.

10

v1

x1v2

v1

x2

v2

v1

x3

v2

Figure 1: Illustration of the first three iterations of the power method for a matrix with eigen-vectors v1 and v2, whose corresponding eigenvalues are λ1 = 1.05 and λ2 = 0.1661.

Initialization: Set x1 := x/ ||x||2, where x contains random entries.For i = 1, . . . , k, compute

xi :=Axi−1||Axi−1||2

. (73)

Figure 1 illustrates the power method on a simple example, where the matrix– which wasjust drawn at random– is equal to

A =

[0.930 0.3880.237 0.286

]. (74)

The convergence to the eigenvector corresponding to the eigenvalue with the largest magni-tude is very fast.

5 Time-homogeneous Markov chains

A Markov chain is a sequence of discrete random variables X0, X1, . . . such that

pXk+1|X0,X1,...,Xk(xk+1|x0, x1, . . . , xk) = pXk+1|Xk

(xk+1|xk) . (75)

In words, Xk+1 is conditionally independent of Xj for j ≤ k − 1 conditioned on Xk. If thevalue of the random variables is restricted to a finite set {α1, . . . , αn} with probability one,the Markov chain is said to be time homogeneous. More formally,

Pij := pXk+1|Xk(αi|αj) (76)

only depends on i and j, not on k for all 1 ≤ i, j ≤ n, k ≥ 0.

If a Markov chain is time homogeneous we can group the transition probabilities Pij in atransition matrix P . We express the pmf of Xk restricted to the set where it can be

11

nonzero as a vector,

πk =

pXk

(α1)

pXk(α2)

· · ·pXk

(αn)

. (77)

By the Chain Rule, the pmf of Xk can be computed from the pmf of X0 using the transitionmatrix,

πk = P P · · ·P π0 = P k π0. (78)

In some cases, no matter how we initialize the Markov chain, the Markov Chain forgetsits initial state and converges to a stationary distribution. This is exploited in Markov-Chain Monte Carlo methods that allow to sample from arbitrary distributions by buildingthe corresponding Markov chain. These methods are very useful in Bayesian statistics.

A Markov Chain that converges to a stationary distribution π∞ is said to be ergodic. Notethat necessarily Pπ∞ = π∞, so that π∞ is an eigenvector of the transition matrix with acorresponding eigenvalue equal to one.

Conversely, let a transition matrix P of a Markov chain have a valid eigendecomposition withn linearly independent eigenvectors v1, v2, . . . and corresponding eigenvalues λ1 > λ2 ≥ λ3 . . ..If the eigenvector corresponding to the largest eigenvalue has non-negative entries then

π∞ :=v1∑n

i=1 v1 (i)(79)

is a valid pmf and

Pπ∞ = λ1π∞ (80)

is also a valid pmf, which is only possible if the largest eigenvalue λ1 equals one. Now, if werepresent any possible initial pmf π0 in the basis formed by the eigenvectors of P we have

π0 =n∑

i=1

αivi, αi ∈ R, 1 ≤ i ≤ n, (81)

and

πk = P kπ0 (82)

=n∑

i=1

αiPkvi (83)

= α1λ1v1 +n∑

i=2

αiλki vi. (84)

12

Since the rest of eigenvalues are strictly smaller than one,

limk→∞

πk = α1λ1v1 = π∞ (85)

where the last equality follows from the fact that the sequence of πk all belong to the closedset {π |

∑ni π(i) = 1} so the limit also belongs to the set and hence is a valid pmf. We refer

the interested reader to more advanced texts treating Markov chains for further details.

6 Eigendecomposition of symmetric matrices

The following lemma shows that eigenvectors of a symmetric matrix corresponding to dif-ferent nonzero eigenvalues are necessarily orthogonal.

Lemma 6.1. If A ∈ Rn×n is symmetric, then if ui and uj are eigenvectors of A correspondingto different nonzero eigenvalues λi 6= λj 6= 0

uTi uj = 0. (86)

Proof. Since A = AT

uTi uj =1

λi(Aui)

T uj (87)

=1

λiuTi A

Tuj (88)

=1

λiuTi Auj (89)

=λjλiuTi uj. (90)

This is only possible if uTi uj = 0.

It turns out that every n×n symmetric matrix has n linearly independent vectors. The proofof this is beyond the scope of these notes. An important consequence is that all symmetricmatrices have an eigendecomposition of the form

A = UDUT (91)

where U =[u1 u2 · · · un

]is an orthogonal matrix.

The eigenvalues of a symmetric matrix λ1, λ2, . . . , λn can be positive, negative or zero. Theydetermine the value of the quadratic form:

q (x) := xTAx =n∑

i=1

λi(xTui

)2(92)

13

If we order the eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λn then the first eigenvalue is the maximumvalue attained by the quadratic if its input has unit `2 norm, the second eigenvalue is themaximum value attained by the quadratic form if we restrict its argument to be normalizedand orthogonal to the first eigenvector, and so on.

Theorem 6.2. For any symmetric matrix A ∈ Rn with normalized eigenvectors u1, u2, . . . , unwith corresponding eigenvalues λ1, λ2, . . . , λn

λ1 = max||u||2=1

uTAu, (93)

u1 = arg max||u||2=1

uTAu, (94)

λk = max||u||2=1,u⊥u1,...,uk−1

uTAu, (95)

uk = arg max||u||2=1,u⊥u1,...,uk−1

uTAu. (96)

The theorem is proved in Section A.1 of the appendix.

7 The singular-value decomposition

If we consider the columns and rows of a matrix as sets of vectors then we can study theirrespective spans.

Definition 7.1 (Row and column space). The row space row (A) of a matrix A is the spanof its rows. The column space col (A) is the span of its columns.

It turns out that the row space and the column space of any matrix have the same dimension.We name this quantity the rank of the matrix.

Theorem 7.2. The rank is well defined;

dim (col (A)) = dim (row (A)) . (97)

Section A.2 of the appendix contains the proof.

The following theorem states that we can decompose any real matrix into the product oforthogonal matrices containing bases for its row and column space and a diagonal matrixwith a positive diagonal. It is a fundamental result in linear algebra, but its proof is beyondthe scope of these notes.

14

Theorem 7.3. Without loss of generality let m ≤ n. Every rank r real matrix A ∈ Rm×n

has a unique singular-value decomposition of the form (SVD)

A =[u1 u2 · · · um

] σ1 0 · · · 00 σ2 · · · 0

· · ·0 0 · · · σm

vT1vT2· · ·vTm

(98)

= USV T , (99)

where the singular values σ1 ≥ σ2 ≥ · · · ≥ σm ≥ 0 are nonnegative real numbers, thematrix U ∈ Rm×m containing the left singular vectors is orthogonal, and the matrixV ∈ Rm×n containing the right singular vectors is a submatrix of an orthogonal matrix(i.e. its columns form an orthonormal set).

Note that we can write the matrix as a sum of rank-1 matrices

A =m∑i=1

σiuivTi (100)

=r∑

i=1

σiuivTi , (101)

where r is the number of nonzero singular values. The first r left singular vectors u1, u2, . . . , ur ∈Rm form an orthonormal basis of the column space of A and the first r right singular vectorsv1, v2, . . . , vr ∈ Rn form an orthonormal basis of the row space of A. Therefore the rank ofthe matrix is equal to r.

8 Principal component analysis

The goal of dimensionality-reduction methods is to project high-dimensional data ontoa lower-dimensional space while preserving as much information as possible. These meth-ods are a basic tool in data analysis; some applications include visualization (especially ifwe project onto R2 or R3), denoising and increasing computational efficiency. Principalcomponent analysis (PCA) is a linear dimensionality-reduction technique based on theSVD.

If we interpret a set of data vectors as belonging to an ambient vector space, applying PCAallows to find directions in this space along which the data have a high variation. This isachieved by centering the data and then extracting the singular vectors corresponding tothe largest singular values. The next two sections provide a geometric and a probabilisticjustification.

15

Algorithm 8.1 (Principal component analysis).Input: n data vectors x1, x2, . . . , xn ∈ Rm, a number k ≤ min {m,n}.Output: The first k principal components, a set of orthonormal vectors of dimension m.

1. Center the data. Compute

xi = xi −1

n

n∑i=1

xi, (102)

for 1 ≤ i ≤ n.

2. Group the centered data in a data matrix X

X =[x1 x2 · · · xn

]. (103)

Compute the SVD of X and extract the left singular vectors corresponding to the klargest singular values. These are the first k principal components.

8.1 PCA: Geometric interpretation

Once the data are centered, the energy of the projection of the data points onto differentdirections in the ambient space reflects the variation of the dataset along those directions.PCA selects the directions that maximize the `2 norm of the projection and are mutuallyorthogonal. By Lemma 1.5, the sum of the squared `2 norms of the projection of the centereddata x1, x2, . . . , xn onto a 1D subspace spanned by a unit-norm vector u can be expressed as

n∑i=1

∣∣∣∣Pspan(u) xi∣∣∣∣2

2=

n∑i=1

uTxixTi u (104)

= uTXXTu (105)

=∣∣∣∣XTu

∣∣∣∣22. (106)

If we want to maximize the energy of the projection onto a subspace of dimension k, anoption is to choose orthogonal 1D projections sequentially. First we choose a unit vector

u1 that maximizes∣∣∣∣XTu

∣∣∣∣22

and is consequently the 1D subspace that is better adapted tothe data. Then, we choose a second unit vector u orthogonal to the first which maximizes∣∣∣∣XTu

∣∣∣∣22

and hence is the 1D subspace that is better adapted to the data while being inthe orthogonal complement of u1. We repeat this procedure until we have k orthogonaldirections. This is exactly equivalent to performing PCA, as proved in Theorem 8.2 below.The k directions correspond to the first k principal components. Figure 2 provides anexample in 2D. Note how the singular values are proportional to the energy that lies in thedirection of the corresponding principal component.

16

σ1/√n = 0.705,

σ2/√n = 0.690

σ1/√n = 0.9832,

σ2/√n = 0.3559

σ1/√n = 1.3490,

σ2/√n = 0.1438

u1

u2

u1

u2

u1

u2

Figure 2: PCA of a dataset with n = 100 2D vectors with different configurations. The twofirst singular values reflect how much energy is preserved by projecting onto the two first principalcomponents.

Theorem 8.2. For any matrix X ∈ Rm×n, where n > m, with left singular vectors u1, u2, . . . , umcorresponding to the nonzero singular values σ1 ≥ σ2 ≥ . . . ≥ σm,

σ1 = max||u||2=1

∣∣∣∣XTu∣∣∣∣2, (107)

u1 = arg max||u||2=1

∣∣∣∣XTu∣∣∣∣2, (108)

σk = max||u||2=1

u⊥u1,...,uk−1

∣∣∣∣XTu∣∣∣∣

2, 2 ≤ k ≤ r, (109)

uk = arg max||u||2=1

u⊥u1,...,uk−1

∣∣∣∣XTu∣∣∣∣2, 2 ≤ k ≤ r. (110)

Proof. If the SVD of X is USV T then the eigendecomposition of XXT is equal to

XXT = USV TV SUT = US2UT , (111)

where V TV = I because n > m and the matrix has m nonzero singular values. S2 is adiagonal matrix containing σ2

1 ≥ σ22 ≥ . . . ≥ σ2

m in its diagonal.

The result now follows from applying Theorem 6.2 to the quadratic form

uXXTu =∣∣∣∣XTu

∣∣∣∣2. (112)

This result shows that PCA is equivalent to choosing the best (in terms of `2 norm) k 1Dsubspaces following a greedy procedure, since at each step we choose the best 1D subspace

17

σ1/√n = 5.077 σ1/

√n = 1.261

σ2/√n = 0.889 σ2/

√n = 0.139

u1

u2

u2

u1

Uncentered data Centered data

Figure 3: PCA applied to n = 100 2D data points. On the left the data are not centered. As aresult the dominant principal component u1 lies in the direction of the mean of the data and PCAdoes not reflect the actual structure. Once we center, u1 becomes aligned with the direction ofmaximal variation.

orthogonal to the previous ones. A natural question to ask is whether this method producesthe best k-dimensional subspace. A priori this is not necessarily the case; many greedyalgorithms produce suboptimal results. However, in this case the greedy procedure is indeedoptimal: the subspace spanned by the first k principal components is the best subspace wecan choose in terms of the `2-norm of the projections. The theorem is proved in Section A.3of the appendix.

Theorem 8.3. For any matrix X ∈ Rm×n with left singular vectors u1, u2, . . . , um corre-sponding to the nonzero singular values σ1 ≥ σ2 ≥ . . . ≥ σm,

n∑i=1

∣∣∣∣Pspan(u1,u2,...,uk) xi∣∣∣∣22≥

n∑i=1

||PS xi||22 , (113)

for any subspace S of dimension k ≤ min {m,n}.

Figure 3 illustrates the importance of centering before applying PCA. Theorems 8.2 and 8.3still hold if the data are not centered. However, the norm of the projection onto a certaindirection no longer reflects the variation of the data. In fact, if the data are concentratedaround a point that is far from the origin, the first principal component will tend be alignedin that direction. This makes sense as projecting onto that direction captures more energy.As a result, the principal components do not capture the directions of maximum variationwithin the cloud of data.

18

8.2 PCA: Probabilistic interpretation

Let us interpret our data, x1, x2, . . . , xn in Rm, as samples of a random vector X of dimensionm. Recall that we are interested in determining the directions of maximum variation of thedata in ambient space. In probabilistic terms, we want to find the directions in which thedata have higher variance. The covariance matrix of the data provides this information. Infact, we can use it to determine the variance of the data in any direction.

Lemma 8.4. Let u be a unit vector,

Var(XTu

)= uTΣXu. (114)

Proof.

Var(XTu

)= E

((XTu

)2)− E2(XTu

)(115)

= E(uXXTu

)− E

(uTX

)E(XTu

)(116)

= uT(

E(XXT

)− E (X) E (X)T

)u (117)

= uTΣXu. (118)

Of course, if we only have access to samples of the random vector, we do not know the covari-ance matrix of the vector. However we can approximate it using the empirical covariancematrix.

Definition 8.5 (Empirical covariance matrix). The empirical covariance of the vectorsx1, x2, . . . , xn in Rm is equal to

Σn :=1

n

n∑i=1

(xi − xn) (xi − xn)T (119)

=1

nXXT , (120)

where xn is the sample mean, as defined in Definition 1.3 of Lecture Notes 4, and X is thematrix containing the centered data as defined in (103).

If we assume that the mean of the data is zero (i.e. that the data have been centered usingthe true mean), then the empirical covariance is an unbiased estimator of the true covariancematrix:

E

(1

n

n∑i=1

XiXTi

)=

1

n

n∑i=1

E(XiX

Ti

)(121)

= ΣX. (122)

19

n = 5 n = 20 n = 100

True covarianceEmpirical covariance

Figure 4: Principal components of n data vectors samples from a 2D Gaussian distribution. Theeigenvectors of the covariance matrix of the distribution are also shown.

If the higher moments of the data E(X2

iX2j

)and E (X4

i ) are finite, by Chebyshev’s inequalitythe entries of the empirical covariance matrix converge to the entries of the true covariancematrix. This means that in the limit

Var(XTu

)= uTΣXu (123)

≈ 1

nuTXXTu (124)

=1

n

∣∣∣∣XTu∣∣∣∣22

(125)

for any unit-norm vector u. In the limit the principal components correspond to the directionsof maximum variance of the underlying random vector. These directions also correspond tothe eigenvectors of the true covariance matrix by Theorem 6.2. Figure 4 illustrates how theprincipal components converge to the eigenvectors of Σ.

20

A Proofs

A.1 Proof of Theorem 6.2

The eigenvectors are an orthonormal basis (they are mutually orthogonal and we assume thatthey have been normalized), so we can represent any unit-norm vector hk that is orthogonalto u1, . . . , uk−1 as

hk =m∑i=k

αiui (126)

where

||hk||22 =m∑i=k

α2i = 1, (127)

by Lemma 1.5. Note that h1 is just an arbitrary unit-norm vector.

Now we will show that the value of the quadratic form when the normalized input is restrictedto be orthogonal to u1, . . . , uk−1 cannot be larger than λk,

hTkAhk =n∑

i=1

λi

(m∑j=k

αjuTi uj

)2

by (92) and (126) (128)

=n∑

i=1

λiα2i because u1, . . . , um is an orthonormal basis (129)

≤ λk

m∑i=k

α2i because λk ≥ λk+1 ≥ . . . ≥ λm (130)

= λk, by (127). (131)

This establishes (93) and (95). To prove (108) and (110) we just need to show that ukachieves the maximum

uTkAuk =n∑

i=1

λi(uTi uk

)2(132)

= λk. (133)


It is sufficient to prove

dim (row (A)) ≤ dim (row (A)) (134)

21

for an arbitrary matrix A. We can apply the result to AT to establish dim (row (A)) ≥dim (row (A)), since row (A) = row (A)T and row (A) = row (A)T .

To prove (134) let r := dim (row (A)) and let x1, . . . , xr ∈ Rn be a basis for row (A). Considerthe vectors Ax1, . . . , Axr ∈ Rn. They belong to row (A) by (43), so if they are linearlyindependent then dim (row (A)) must be at least r. We will prove that this is the case bycontradiction.

Assume that Ax1, . . . , Axr are linearly dependent. Then there exist coefficients α1, . . . , αr ∈R such that

0 =r∑

i=1

αiAxi = A

(r∑

i=1

αixi

)(by linearity of the matrix product), (135)

This implies that∑r

i=1 αixi is orthogonal to every row of A and hence to every vector inrow (A). However it is in the span of a basis of row (A) by construction! This is onlypossible if

∑ri=1 αixi = 0, which is a contradiction because x1, . . . , xr are assumed to be

linearly independent.


We prove the result by induction. The base case, k = 1, follows immediately from (108).To complete the proof we need to show that if the results is true for k − 1 ≥ 1 (this is theinduction hypothesis) then it also holds for k.

Let S be an arbitrary subspace of dimension k. We choose an orthonormal basis for thesubspace b1, b2, . . . , bk such that bk is orthogonal to u1, u2, . . . , uk1 . We can do this by usingany vector that is linearly independent of u1, u2, . . . , uk1 and subtracting its projection ontothe span of u1, u2, . . . , uk1 (if the result is always zero then S is in the span and consequentlycannot have dimension k).

By the induction hypothesis,

n∑i=1

∣∣∣∣∣∣Pspan(u1,u2,...,uk1)xi

∣∣∣∣∣∣22

=k−1∑i=1

∣∣∣∣XTui∣∣∣∣22

by (6) (136)

≤n∑

i=1

∣∣∣∣∣∣Pspan(b1,b2,...,bk1)xi

∣∣∣∣∣∣22

(137)

=k−1∑i=1

∣∣∣∣XT bi∣∣∣∣22

by (6). (138)

22

By (110)

n∑i=1

∣∣∣∣Pspan(uk) xi∣∣∣∣22

=∣∣∣∣XTuk

∣∣∣∣22

(139)

≤n∑

i=1

∣∣∣∣Pspan(bk) xi∣∣∣∣22

(140)

=∣∣∣∣XT bk

∣∣∣∣22. (141)

Combining (138) and (138) we conclude

n∑i=1

∣∣∣∣Pspan(u1,u2,...,uk) xi∣∣∣∣22

=k∑

i=1

∣∣∣∣XTui∣∣∣∣22

(142)

≤k∑

i=1

∣∣∣∣XT bi∣∣∣∣22

(143)

≤n∑

i=1

||PS xi||22 . (144)

23

linear models - courant institute of mathematical …cfgranda/pages/dsga1002_fall15...finally, we...

Documents