ele 535 notes

30
ELE 535 Machine Learning and Pattern Recognition 1 Peter J. Ramadge Fall 2015, v1.0 1 c P. J. Ramadge 2015. Please do not distribute without permission.

Upload: jc4024

Post on 06-Dec-2015

285 views

Category:

Documents


8 download

DESCRIPTION

Notes for ELE 535

TRANSCRIPT

Page 1: ELE 535 Notes

ELE 535Machine Learning and Pattern Recognition1

Peter J. Ramadge

Fall 2015, v1.0

1 c©P. J. Ramadge 2015. Please do not distribute without permission.

Page 2: ELE 535 Notes

ELE 535 Fall 2015 2

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 3: ELE 535 Notes

ELE 535 Fall 2015 3

Chapter 1

Data And Data Embeddings

Machine learning is essentially about learning structural patterns and relations in data and using theseto make decisions about new data points. This can include identifying low dimensional structure (di-mensionality reduction), clustering the data based on a measure similarity, the prediction of a category(classification), or the prediction of the value of an associated unmeasured variable (regression). We wantthe data to be the primary guide for accomplishing this but we sometimes...

1.1 Data as a Set of Vectors

We often represent the data of interest as a set of vectors in some Euclidean space Rn. This has two majoradvantages: 1) we can exploit the algebraic structure of Rn and 2) we can exploit its Euclidean Geometry.This enables tools and concepts such as differential calculus, convexity, convex optimization, probabilitydistributions, and so on.

In some cases there is a natural embedding of the data into Rn. In other cases we must carefullyconstruct a useful embedding of the data into Rn (possibly ignoring some information) .

Example 1.1.1. In a medical context the readily measured variables might be: age, gender, weight, bloodpressure, resting heart rate, respiratory rate, body temperature, blood analysis, and so on. Group anappropriate subset of these measurable variables into a vector x ∈ Rn. The variable of interest y ∈ Rmight be: the existence of an infection, the level of an infection, the degree of reaction to a drug, theprobability that the patient is about to have heart attack, and so on. From measurements of the first set ofmedical variables x we would like to predict the value of y.

Example 1.1.2. Medical Example 2. Not written up yet.

Example 1.1.3. Document Example 1. Not written up yet.

Example 1.1.4. fMRI Example 1. Not written up yet.

1.2 A Quick Review of the Algebraic Structure of Rn

Below we give a quick summary of the key algebraic properties of Rn Although we do this in the contextof Rn, the concepts and constructions described generalize to any finite dimensional vector space.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 4: ELE 535 Notes

ELE 535 Fall 2015 4

1.2.1 Linear combinations

A linear combination of vectors x1, . . . , xk using scalars α1, . . . , αk produces the vector x =∑k

j=1 αjxj .The span of vectors x1, . . . , xk ∈ Rn is the set of all such linear combinations:

span{x1, . . . , xk} = {x : x =k∑j=1

αjxj , for some scalars αj , j = 1, . . . , k}.

1.2.2 Subspaces

A subspace of Rn is a subset U ⊆ Rn that is closed under linear combinations of its elements. So for anyx1, . . . , xk ∈ U and any scalars α1, . . . , αk,

∑kj=1 αjxj ∈ U . A subspace U of Rn is itself a vector space

under the field R.There are two common ways to specify a subspace. First, for any x1, . . . , xk ∈ Rn, U = span{x1, . . . , xk}

is a subspace of Rn. So any subset of vectors specifies a subspace through the span operation.

Lemma 1.2.1. span{x1, . . . , xk} is smallest subspace of Rn that contains the vectors x1, . . . , xk.

Proof. It is clear that U = span{x1, . . . , xk} contains the vectors x1, . . . , xk and is a subspace of Rn.Suppose V is a subspace of Rn and contains the vectors x1, . . . , xk. Since V is a subspace it follows thatspan{x1, . . . , xk} = U ⊆ V . So U is the smallest subspace that contains {x1, . . . , xk}.

The second method gives an implicit specification through a set of linear equations. Given fixed scalarsαi, i = 1, . . . , n, the set U = {x :

∑ni=1 αix(i) = 0} is a subspace of Rn. More generally, given k sets of

scalars, {α(j)i }ni=1, j = 1, . . . , k, the set U = {x :

∑ni=1 α

(j)i x(i) = 0, j = 1, . . . , k} is a subspace of Rn.

1.2.3 Subspace intersection and subspace sum

For U , V subspaces ofRn, define

U ∩ V ∆= {x : x ∈ U and x ∈ V}

U + V ∆= {x : x = u+ v, some u ∈ U , v ∈ V}.

U∩V is the set of vectors in both U and V and U+V is the set of all vectors formed by a linear combinationof a vector in U and a vector in V .

Lemma 1.2.2. For subspaces U ,V of Rn, U ∩ V and U + V are subspaces of Rn. U ∩ V is the largestsubspace contained in both U and V and U + V is the smallest subspace containing both U and V .

Proof. Exercise.

1.2.4 Linear independence

A finite set of vectors {x1, . . . , xk} ⊂ Rn is linearly independent if for each set of scalars α1, . . . , αk,

k∑i=1

αixi = 0 ⇒ αi = 0, i = 1, . . . , k.

A set of vectors which is not linearly independent is said to be linearly dependent. Notice that a linearlyindependent set cannot contain the zero vector.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 5: ELE 535 Notes

ELE 535 Fall 2015 5

Lemma 1.2.3. The following conditions are equivalent:

1) {x1, . . . , xk} is linearly independent.

2) Each x ∈ span{x1, . . . , xk} is a unique linearly combination of x1, . . . , xk, that is,∑ki=1 αixi =

∑ki=1 βixi ⇒ αi = βi, i = 1, . . . , k.

3) No element of {x1, . . . , xk} can be written as a linear combination of the others.

Proof. We prove:(1→ 2) Suppose that x can be represented in two ways: x =

∑ni=1 αixi and x =

∑ni=1 βixi. Subtracting

these equations yields∑n

i=1(αi − βi)xi = 0. Then the linear independence of the basis implies thatαi = βi, i = 1, . . . , n. So the representation of x is unique.(2 → 3) Suppose xm =

∑j 6=m αjxj . Then xm has two distinct representations with respect to the basis:

the one given above and xm =∑k

j=1 βjxj with βj = 0 if j 6= m and βm = 1. This contracts 2). Henceno element of {x1, . . . , xk} combination of the others.(3 → 1) Suppose that

∑kj=1 αjxj = 0. If αm 6= 0, then were can write xm =

∑j 6=m βjxj with

βj = αj/αm. This violates 2). Hence αj = 0, j = 1, . . . , k.

1.2.5 Spanning sets and bases

Let U be a subspace of Rn. A finite set of vectors {x1, . . . , xk} is said to span U , or to be a spanning setfor U , if U = span{x1, . . . , xk}. In this case, every x ∈ U can be written as a linear combination of thevectors xj , j = 1, . . . , k. A spanning set may be redundant in the sense that one or more elements of theset may be a linear combination of a subset of the elements.

A basis for a subspace U ⊆ Rn is a linearly independent finite set of vectors that spans U . Thespanning property ensures that every vector in U can be represented as a linear combination of the basisvectors and linear independence ensures that this representation is unique. A vector space that has a basisis said to be finite dimensional.

We show that Rn is finite dimensional by exhibiting a basis. The standard basis for Rn is the set ofvectors ej , j = 1, . . . , n, with

ei(j) =

{1, if i = j;

0, otherwise.

It is clear that if∑n

i=1 αiei = 0, then αi = 0, i = 1, . . . , n. Hence the set is linearly independent. It isalso clear that any vector in Rn can be written as a linear combination of the ei’s. Hence Rn is a finitedimensional vector space.

We show below that every subspace U ⊆ Rn has a basis, and every basis contains the same numberof elements. We define the dimension of U to be the number of elements in any basis for U . The standardbasis for Rn implies that every basis for Rn has n elements and hence Rn has dimension n.

Lemma 1.2.4. Every nonzero subspace U ⊆ Rn has a basis and every such basis contains the samenumber of vectors.

Proof. Rn is finite dimensional and has a basis with n elements. List the basis as L0 = {x1, . . . , xn}.Since U 6= {0}, U contains a nonzero vector b1. It must hold that b1 ∈ span(L0) since L0 spans Rn.If U = span{b1}, then we are done. Otherwise, add b1 to the start of L0 to form the new ordered listL1 = {b1, x1, . . . , xn}. Then L1 is a linearly dependent set that spans Rn. Proceeding from the left, theremust be a first vector in L1 that is linearly dependent on the subset of vectors that precedes it. Suppose

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 6: ELE 535 Notes

ELE 535 Fall 2015 6

α x 1 1

α x 2 2

x

0

Figure 1.1: The coordinates of a point x with respect a given basis {x1, . . . , xn}.

this is xp. Removing xp from L1 yields a list L′1 that still spans Rn. Since U 6= span{b1}, there mustbe a nonzero vector b2 ∈ U not contained in span{b1}. If span{b1, b2} = U , we are done. Otherwise,add b2 after b1 in L′1 to obtain a new ordered list L2 = {b1, b2, x1, . . . , xp−1, xp+1, . . . , xn}. Then L2 islinearly dependent and spans Rn. Proceeding from the left, there must be a first vector in L2 that is linearlydependent on the subset of vectors that precedes it. This can’t be one of the bj’s since these are linearlyindependent. Hence we can again remove one of the remaining xj’s to obtain a reduced list L′2 that spansRn. Since span{b1, b2} 6= U , there must be a vector b3 ∈ U such that b3 /∈ span{b1, b2}. Adding thisafter b2 in L′2 gives a new linearly dependent list L3 that spans Rn. In this way, we either terminate with abasis for U or continue to remove x′js from the ordered set and add bj’s until all the xj’s are removed. Inthat event, Ln = {b1, . . . , bn} is a linearly independent spanning set for Rn and U = Rn. In either case,U has a basis.

Let {x1, . . . , xk} and {y1, . . . , ym} be bases for U . First form the ordered list L0 = {x1, . . . , xk}. Wewill progressively add one of the yj’s to the start of list and if possible remove one of the xj’s. To begin,set L1 = {ym, x1, . . . , xk}. By assumption, U = span{L0}. Hence ym ∈ span{L0}. It follows that L1 islinearly dependent and spans U . Proceeding from the left there must be a first vector in L1 that is linearlydependent on the subset of vectors that precedes it. Suppose this is xp. Removing xp from L1 leaves alist L′1 that still spans U . Hence ym−1 ∈ span(L′1). So adding ym−1 to L′1 in the first position gives anew linearly dependent list S2 = {ym−1, ym, x1, . . . , xp−1, xp+1, . . . , xk} that spans U . Proceeding fromthe left there must be a first vector in L2 that is linearly dependent on the subset of vectors that precedesit. This can’t be one of the yj’s since these are linearly independent. Hence we can again remove one ofthe remaining xj’s to obtain a reduced list spanning U . Then adding ym−2 in the first place gives a newlinearly dependent list L3 spanning U . In this way, we continue to remove x′js from the ordered set andadd yj’s. If we remove all the xj’s before adding all of the yj’s we obtain a contradiction, since that wouldimply that {y1, . . . , yk} is linearly dependent. So we must be able to add all of the yj’s. Hence k ≤ m. Asymmetric argument with the roles of the two bases interchanged shows that m ≤ k. Hence m = k.

1.2.6 Coordinates

The coordinates of x ∈ Rn with respect to a basis {xj}nj=1 are the unique scalars αj such that x =∑nj=1 αjxj . Every vector uniquely determines, and is uniquely determined by, its coordinates. For the

standard basis, the coordinates of x ∈ Rn are simply the entries of x: x =∑n

j=1 x(j)ej .You can think of the coordinates as a way to locate (or construct) x starting from the origin. You simply

go via x1 scaled by α1, then via x2 scaled by α2, and so on. This is not a unique path, since the scaledbasis elements can be added in any order, but all such paths reach the same point x. This is illustratedin Fig 1.1. In this sense, coordinates are like map coordinates with the basic map navigation primitivesspecified by the basis elements. That also makes it clear that if we choose a different basis, then we mustuse a different set of coordinates to reach the same point x.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 7: ELE 535 Notes

ELE 535 Fall 2015 7

1.3 Problems1.1. Show that:

a) A linearly independent set in Rn containing n vectors is a basis for Rn.

b) A subset of Rn containing k > n vectors is linearly dependent.

c) If U is proper subspace of Rn, then dim(U) < n.

1.4 Notes

We have given a brief outline of the algebraic structure of Rn. For a more detailed introduction see therelevant sections in: Strang, Linear Algebra and its Applications, Chapter 2.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 8: ELE 535 Notes

ELE 535 Fall 2015 8

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 9: ELE 535 Notes

ELE 535 Fall 2015 9

Chapter 2

The Geometry of Rn

Rn also has a geometric structure defined by the Euclidean inner product and norm. These add the impor-tant concepts of length, distance, angle, and orthogonality. It will be convenient to discuss these conceptsusing the vector space (Cn,C). The concepts and definitions readily specialize to (Rn,R). In addition,the vector spaces of complex and real matrices of given fixed dimensions can also be given a Euclideangeometry.

2.1 Inner Product and Norm on Cn

The inner product of vectors x, y ∈ Cn is the scalar

<x, y> =n∑k=1

x(k)y(k),

where y denotes the element-wise complex conjugate of y. This can also be written in terms of a matrixproduct as <x, y> = xT y. The inner product satisfies the following basic properties.

Lemma 2.1.1 (Properties of the Inner Product). For x, y, z ∈ Cn and α ∈ C,

1) <x, x> ≥ 0 with equality⇔ x = 0

2) <x, y> = <y, x>

3) <αx, y> = α<x, y>

4) <x+ y, z> = <x, z>+<y, z>

Proof. These claims follow from the definition of the inner product via simple algebra.

The inner product also specifies the corresponding Euclidean norm on Cn via the formula:

‖x‖ = (<x, x>)12 =

(n∑k=1

|x(k)|2) 1

2

.

Based on the definition and properties of the inner product and the definition of the norm, we can thenderive the famous Cauchy-Schwartz inequality.

Lemma 2.1.2 (Cauchy-Schwarz Inequality). For all x, y ∈ Cn, |<x, y>| ≤ ‖x‖ ‖y‖.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 10: ELE 535 Notes

ELE 535 Fall 2015 10

y

||y||

x

||x||

x+y

||x+y|| ||x||

Figure 2.1: An illustration of the triangle inequality.

Proof. Exercise.

By Cauchy-Schwarz, we have

0 ≤ |<x, y>|‖x‖ ‖y‖

≤ 1.

This allows us to define the angle θ ∈ [0, π/2] between x and y by

cos θ =|<x, y>|‖x‖ ‖y‖

.

For vectors in Rn we can dispense with the modulus function and write

<x, y> = ‖x‖‖y‖ cos(θ),

where cos θ = <x, y>/(‖x‖ ‖y‖). So in Rn, the inner product of unit length vectors is the cosine of theangle between them.

Finally, we have the properties of the norm function.

Lemma 2.1.3 (Properties of the Norm). For x, y ∈ Cn and α ∈ C:

1) ‖x‖ ≥ 0 with equality if and only if x = 0 (positivity).

2) ‖αx‖ = |α| ‖x‖ (scaling).

3) ‖x+ y‖ ≤ ‖x‖+ ‖y‖ (triangle inequality).

Proof. Items 1) and 2) easily follow from the definition of the norm. Item 3) can be proved using theCauchy-Schwarz inequality and is left as an exercise.

The norm ‖x‖ measures the “length” or “size” of the vector x. Equivalently, ‖x‖ is the distancebetween 0 and x and ‖x − y‖ is the distance between x to y. The triangle inequality is illustrated inFig. 2.1. If ‖x‖ = 1, x is called a unit vector or a unit direction. The set {x : ‖x‖ = 1} of all unit vectorsis called the unit sphere.

2.2 Orthogonality and Orthonormal Bases

Vectors x, y ∈ Cn are orthogonal, written x ⊥ y, if <x, y> = 0. A set of vectors {x1, . . . , xk} in Rn isorthogonal if each pair is orthogonal: xi ⊥ xj , i, j = 1, . . . , k, i 6= j.

Theorem 2.2.1 (Pythagoras). If x1, . . . , xk is an orthogonal set, then ‖∑k

j=1 xj‖2 =∑k

j=1 ‖xj‖2.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 11: ELE 535 Notes

ELE 535 Fall 2015 11

Proof. Using only the definition of the norm and properties of the inner product we have:

‖k∑j=1

xj‖2 = <

k∑i=1

xi,

k∑j=1

xj> =

k∑i=1

k∑j=1

<xi, xj> =

k∑j=1

<xj , xj> =

k∑j=1

‖xj‖2.

A set of vectors {x1, . . . , xk} in Cn is orthonormal if it is orthogonal and every vector in the set hasunit norm (‖xj‖ = 1, j = 1, . . . , k).

Lemma 2.2.1. An orthonormal set is linearly independent.

Proof. Let {x1, . . . , xk} be an orthonormal set and suppose that∑k

j=1 αjxj = 0. Then for each xi wehave 0 = <

∑kj=1 αjxj , xi> = αi.

An orthonormal basis for Cn is basis of orthonormal vectors. Since an orthonormal set is alwayslinearly independent, any set of n orthonormal vectors is an orthonormal basis for Cn. Orthonormal baseshave a particularly convenient property: it is easy to find coordinates of any vector x with respect to sucha basis. To see this, let {x1, . . . , xn} be an orthonormal basis and x =

∑j αjxj . Then

<x, xk> = <∑j

αjxj , xk> =∑j

αj<xj , xk> = αk

So the coordinate of x with respect to the basis element xk is simply αk = <x, xk>, k = 1, . . . , n.

2.3 General Inner Product Spaces

More generally, a real or complex vector space X equipped with a function<·, ·> satisfying the propertieslisted in Lemma 2.1.1 is called an inner product space. We give an important example below.

2.3.1 Inner Product on Cm×n

We can define an inner product on the vector space of complex matrices Cm×n by:

<A,B> =∑i,j

AijBij .

This function satisfies the properties listed in Lemma 2.1.1. The corresponding norm is:

‖A‖F = (<A,A>)12 =

∑i,j

|Aij |21/2

.

This is frequently called the Frobenius norm. Hence the special notation ‖A‖F . The following lemmagives a very useful alternative expression <A,B>.

Lemma 2.3.1. For all A,B ∈ Cm×n, <A,B> = trace(ATB).

Proof. Exercise.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 12: ELE 535 Notes

ELE 535 Fall 2015 12

2.4 Orthogonal and Unitary Matrices

A square matrix Q ∈ Rn×n is orthogonal if QTQ = QQT = In. In this case, the columns of Q forman orthonormal basis for Rn and QT is the inverse of Q. We denote denote the set of n × n orthogonalmatrices by On.

Lemma 2.4.1. If Q ∈ On, then for each x, y ∈ Rn, <Qx,Qy> = <x, y> and ‖Qx‖ = ‖x‖.

Proof. <Qx,Qy> = xTQTQQy = xT y = <x, y>, and ‖Qx‖2 = <Qx,Qx> = <x, x> = ‖x‖2.

Lemma 2.4.2. The set On contains the identity matrix In, and is closed under matrix multiplication andmatrix inverse.

Proof. IfQ,W are orthogonal, then (QW )T (QW ) = W TQTQW = In and (QW )(QW )T = QWW TQT =In. So QW is orthogonal. If Q is orthogonal, Q−1 = QT is orthogonal. Clearly, In is orthogonal.

Hence the set of matrices On forms a (noncommutative) group under matrix multiplication. This iscalled the n× n orthogonal group.

For complex matrices a slight change is required. A square complex matrix Q ∈ Cn×n is called aunitary matrix if Q∗Q = QQ∗ = I where Q∗ = Q

T . It is readily verified that if Q is a unitary matrix,then for each x, y ∈ Cn, <Qx,Qy> = <x, y> and ‖Qx‖ = ‖x‖. So multiplication of a vector by aunitary (or orthogonal) matrix preserves inner products, angles, norms, distances, and hence Euclideangeometry.

2.5 Problems2.1. The mean of a vector x ∈ Rn is the scalar mx = (1/n)

∑ni=1 x(i). Show that the set of all vectors in Rn

with mean 0 (zero vector) is a subspace U0 ⊂ Rn of dimension n− 1. Show that all vectors in U0 are orthogonal to1 ∈ Rn, where 1 denotes the vector with all components equal to 1.

2.2. The correlation of x1, x2 ∈ Rn is the scalar:

ρ(x1, x2) =<x1, x2>

‖x1‖‖x2‖.

For given x1, what vectors x2 maximize the correlation? What vectors x2 minimize the correlation? Show thatρ(x1, x2) ∈ [−1, 1] and is zero precisely when the vectors are orthogonal.

2.3. Use the definition of the inner product and its properties listed in Lemma 2.1.1, together with the definition ofthe norm to prove the Cauchy-Schwarz Inequality (Lemma 2.1.2).

a) First let x, y ∈ Cn with ‖x‖ = ‖y‖ = 1.

1) Set x = <x, y>y and rx = x− x. Show that <rx, y> = <rx, x> = 0.

2) Show that ‖rx‖2 = 1−<x, x>.

3) Using the previous result and the definition of x show that |<x, y>| ≤ 1.

b) Prove the result when ‖x‖ 6= 0 and ‖y‖ 6= 0.

c) Prove the result when x or y (or both) is zero.

2.4. Prove the triangle inequality for the Euclidean norm in Cn. Expand ‖x+ y‖2 using the properties of the innerinner product, and note that 2Re(<x, y>) ≤ 2|<x, y>|.

2.5. Let X ,Y be inner product spaces over the same field F with F = R, or F = C. A linear isometry from X toY is a linear function D : X → Y that preserves distances: (∀x ∈ X ) ‖D(x)‖ = ‖x‖. Show that a linear isometrybetween inner product spaces also preserves inner products.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 13: ELE 535 Notes

ELE 535 Fall 2015 13

a) First examine ‖D(x+ y)‖2 and conclude that Re(<Dx,Dy>) = Re(<x, y>).

b) Now examine ‖D(x+ iy)‖2 where i is the imaginary unit.

2.6. Let Pn denote the set of n× n permutation matrices. Show that Pn is a (noncommutative) group under matrixmultiplication. Show that every permutation matrix is an orthogonal matrix. Hence Pn is a subgroup of On.

2.7. Show that for A,B ∈ Cm×n:

a) <A,B> = trace(ATB) = trace(B∗A).

b) | trace(B∗A)| ≤ trace(A∗A)1/2 trace(B∗B)1/2.

2.8. Show that the Euclidean norm in Cn is:

a) permutation invariant: if y a permutation of the entries of x, then ‖y‖ = ‖x‖.

b) an absolute norm: if y = |x| component-wise, then ‖y‖ = ‖x‖.

2.9. Let b1, . . . , bn be an orthonormal basis for Cn (or Rn), and set Bj,k = bjbT

k , j, k = 1, . . . , n. Show that

{Bj,k = bjbT

k }ni,j=1 is an orthonormal basis for Cn×n.

2.6 Notes

For a more detailed introduction to the Euclidean geometry of Rn, see the relevant sections in: Strang,Linear Algebra and its Applications, Chapter 2.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 14: ELE 535 Notes

ELE 535 Fall 2015 14

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 15: ELE 535 Notes

ELE 535 Fall 2015 15

Chapter 3

Orthogonal Projection

We now consider the following fundamental problem. Given a data vector x in an inner product space Xand a subspace U ⊂ X , find the closest point to x in U . This operation is a simple building block that wewill use repeatedly.

3.1 Simplest Instance

The simplest instance of our problem is: given x, u ∈ Rn with ‖u‖ = 1, find the closest point to x inspan{u}. This can be posed as the simple constrained optimization problem:

minz∈Rn

12‖x− z‖

2

s.t. z ∈ span{u}(3.1)

The subspace span{u} is a line through the origin in the direction u, and we seek the point z on thisline that is closest x. So we must have z = αu for some scalar α. Hence we can equivalently solve theunconstrained optimization problem:

minα∈R

12‖x− αu‖

2.

Expanding the objective function in (3.1) and setting z = αu yields:

12‖x− z‖

2 = 12<x− z, x− z>

= 12‖x‖

2 − α<u, x>+ 12α

2‖u‖2.

This is a quadratic in α with a positive coefficient on the second order term. Hence there is a unique valueof α that minimizes the objective. Setting the derivative of the above expression w.r.t. α equal to zero givesthe unique solution, α = <u, x>. Hence the optimal solution is

x = <u, x>u.

The associated error vector rx = x − x is called the residual. We claim that the residual is orthogonal tou and hence to the subspace span{u}. To see this note that

<u, rx> = <u, x− x> = <u, x>−<u, x><u, u> = 0.

Thus x is the unique orthogonal projection of x onto the line span{u}. This is illustrated in Figure 3.1.By Pythagoras, we have ‖x‖2 = ‖x‖2 + ‖rx‖2.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 16: ELE 535 Notes

ELE 535 Fall 2015 16

u

x

x

0

rx

Figure 3.1: Orthogonal projection of x onto a line through zero.

We can also write the solution using matrix notation. Noting that <u, x> = uTx, we have

x = (uuT )x = Px

rx = (I − uuT )x = (I − P )x.

So for fixed u, both x and rx are linear functions of x. As one might expect, these linear functions havesome special properties. For example, since x ∈ span{u}, the projection of x onto span{u} must be x.So we must have P 2 = P . We can easily check this using the formula P = uuT : P 2 = (uuT )(uuT ) =uuT = P . In addition, we note that P = uuT is symmetric. So P is symmetric (P T = P ) and idempotent(P 2 = P ). A matrix with these two properties is called a projection matrix.

3.2 Projection of x onto a Subspace U

Now let U be a subspace of Rn with an orthonormal basis {u1, . . . , uk}. For a given x ∈ Rn, we seek apoint z in U that minimizes the distance to x:

minz∈Rn

12‖x− z‖

2

s.t. z ∈ U .(3.2)

Since we can uniquely write z =∑k

j=1 αjuj , we can equivalently pose this as the unconstrained opti-mization problem:

minα1,...,αk

12‖x−

k∑j=1

αjuj‖2.

Using the definition of the norm and the properties of the inner product, we can expand the objectivefunction to obtain:

12‖x− z‖

2 = 12<x− z, x− z>

= 12‖x‖

2 −<z, x>+ 12‖z‖

2

= 12‖x‖

2 −k∑j=1

αj<uj , x>+ 12

k∑j=1

α2j .

In the last line we used Pythagoras to write ‖z‖2 =∑k

j=1 α2j . Taking the derivative with respect to αj and

setting this equal to zero yields the unique solution αj = <uj , x>, j = 1, . . . , k. So the unique closest

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 17: ELE 535 Notes

ELE 535 Fall 2015 17

point in U to x is:

x =

k∑j=1

<uj , x>uj . (3.3)

Moreover, the residual rx = x−x is orthogonal to every uj and hence to the subspace U = span{u1, . . . , uk}.To see this compute:

<uj , rx> = <uj , x− x> = <uj , x>−<uj , x> = <uj , x>−<uj , x> = 0.

Thus x is the unique orthogonal projection of x onto U , and by Pythagoras, ‖x‖2 = ‖x‖2 + ‖rx‖2. This isillustrated in Figure 3.2. From (3.3), notice that x and the residual rx = x− x are linear functions of x.

U

x

x

0

rx

Figure 3.2: Orthogonal projection of x onto the subspace U .

We can also write these results as matrix equations. First, from (3.3) we have

x =k∑j=1

ujuTj x = (

k∑j=1

ujuTj )x = Px,

with P =∑k

j=1 ujuTj . Let U ∈ Rn×k be the matrix with columns u1, . . . , uk. Then

P =

k∑j=1

ujuTj = UUT .

Hence we can write

x = UUTx

rx = (I − UUT )x.

This confirms that x and rx are linear functions of x and that P is symmetric and idempotent.

3.3 The Orthogonal Complement of a Subspace

The orthogonal complement of a subspace U of Rn is the subset:

U⊥ = {x ∈ Rn : (∀u ∈ U) x ⊥ u}.

So U⊥ is the set of vectors orthogonal to every vector in the subspace in U . When U = span{u} we writeU⊥ as simply u⊥. The set U⊥ is easily shown to be a subspace of Rn with U ∩ U⊥ = 0.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 18: ELE 535 Notes

ELE 535 Fall 2015 18

Lemma 3.3.1. U⊥ is a subspace of Rn and U ∩ U⊥ = 0.

Proof. Exercise.

For example, it is clear that {0}⊥ = Rn and Rn⊥ = {0}. For a unit vector u, u⊥ is the n − 1dimensional hyperplane in Rn passing through the origin and with normal u. Given a subspace U in Rnand x ∈ Rn, the projection x of x onto U lies in U and the residual rx lies in U⊥.

An important consequence of the orthogonality of U and U⊥ is that every x ∈ Rn has a uniquerepresentation of the form x = u + v with u ∈ U and v ∈ U⊥. This follows from the properties of theorthogonal projection of x onto U .

Lemma 3.3.2. Every x ∈ Rn has a unique representation in the form x = u+ v with u ∈ U and v ∈ U⊥.

Proof. By the properties of orthogonal projection we have x = x+rx with x ∈ U and rx ∈ U⊥. This givesone decomposition of the required form. Suppose there are two decompositions of this form: x = ui + vi,with ui ∈ U and vi ∈ V , i = 1, 2. Subtracting these expressions gives (u1 − u2) = −(v1 − v2). Nowu1 − u2 ∈ U and v1 − v2 ∈ U⊥, and since U ∩ U⊥ = 0 (Lemma 3.3.1), we must have u1 = u2 andv1 = v2.

It follows from Lemma 3.3.2 that U +U⊥ = Rn. This simply states that every vector in Rn is the sumof some vector in U and some vector in U⊥. Because this representation is also unique, this is sometimeswritten as Rn = U ⊕ U⊥ and we say that Rn is the direct sum of U and U⊥. Several additional propertiesof the orthogonal complement are covered in Problem 3.5.

3.4 Problems3.1. Given x, y ∈ Rn find the closest point to x on the line through 0 in the direction of y.

3.2. Let u1, . . . , uk ∈ Rn be an ON set spanning a subspace U and let v ∈ Rn with v /∈ U . Find a point y on thelinear manifoldM = {x : x − v ∈ U} that is closest to a given point y ∈ Rn. [Hint: transform the problem to onethat you know how to solve.]

3.3. A Householder transformation on Rn is a linear transformation that reflects each point x in Rn about a givenn− 1 dimensional subspace U specified by giving its unit normal u ∈ Rn. To reflect x about U we want to move itorthogonally through the subspace to the point on the opposite side that is equidistant from the subspace.

a) Given U = u⊥ = {x : uTx = 0}, find the required Householder matrix.

b) Show that a Householder matrix H is symmetric, orthogonal, and is its own inverse.

3.4. Prove Lemma 3.3.1.

3.5. Let X be a real or complex inner product space of dimension n, and U ,V be subspaces of X . Prove each of thefollowing:

a) U ⊆ V ⇒ implies V⊥ ⊆ U⊥.

b) (U⊥)⊥ = U .

c) (U + V)⊥ = U⊥ ∩ V⊥

d) (U ∩ V)⊥ = U⊥ + V⊥

e) If dim(U) = k, then dim(U⊥) = n− k

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 19: ELE 535 Notes

ELE 535 Fall 2015 19

Chapter 4

Principal Component Analysis

Given a set of data {xj ∈ Rn}pj=1, and an integer k < n, we ask if there is a k-dimensional subspace ontowhich we can project the data so that the sum of the squared norms of the residuals is minimized. It turnsout that for every 1 ≤ k < n, there is indeed a subspace that minimizes this metric. If the resulting ap-proximation is reasonably accurate, then the original data lies approximately on a k-dimensional subspacein Rn. Hence, at the cost of small approximation error, we gain the benefit of reducing the dimension ofthe data to k. This is one form of linear dimensionality reduction.

There is also a connection with another way of thinking about the data: how is the data spread out aboutits sample mean. Directions in which the data does not have significant variation could be eliminatedallowing the data to be represented in a lower dimensional subspace. This leads to a core method ofdimensionality reduction known as Principal Component Analysis (PCA). It selects a subspace onto whichto project the data that maximizes the captured variance of the original data. It turns out that this subspaceminimizes the sum of squared norms of the resulting residuals.

4.1 Preliminaries

We will need a few properties of symmetric matrices. Recall that a matrix S ∈ Rn×n is symmetric ifST = S. The eigenvalues and eigenvectors of real symmetric matrices have some special properties.

Lemma 4.1.1. A symmetric matrix S ∈ Rn×n has n real eigenvalues and n real orthonormal eigenvectors.

Proof. Let Sx = λx with x 6= 0. Then Sx = Sx = λx. Hence xTSx = λ‖x‖2 and xTSx = λ‖x‖2.Subtracting these expressions and using x 6= 0, yields λ = λ. Thus λ is real. It follows that x can beselected in Rn.

We prove the second claim under the simplifying assumption that S has n distinct eigenvalues. LetSx1 = λ1x1 and Sx2 = λ2x2. Then xT2 Sx1 = λ2x

T2 x1 and xT2 Sx1 = λ1x

T2 x1. Subtracting these

expressions and using λ1 6= λ2 yields xT2 x1 = 0. Thus x1 ⊥ x2. For a proof without our simplifyingassumption, see Theorem 2.5.6 in Horn and Johnson.

A matrix P ∈ Rn×n with the property that for all x ∈ Rn, xTPx ≥ 0 is said to be positive semidefinite.Similarly, P is positive definite if for all x 6= 0, xTPx > 0. Without loss of generality, we will alwaysassume P is symmetric. If not, P can be replaced by the symmetric matrix Q = 1

2(P + P T ) sincexTQx = 1

2xT (P + P T )x = xTPx.

Here is a fundamental property of such matrices.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 20: ELE 535 Notes

ELE 535 Fall 2015 20

Lemma 4.1.2. If P ∈ Rn×n is symmetric and positive semidefinite (resp. positive definite), then all theeigenvalues of P are real and nonnegative (resp. positive) and the eigenvectors of P can be selected to bereal and orthonormal.

Proof. Since P is symmetric all of its eigenvalues are real and it has a set of n real ON eigenvectors. Letx be an eigenvector with eigenvalue λ. Then xTPx = xTλx = λ‖x‖2 ≥ 0. Hence λ ≥ 0. If P is PD andx 6= 0, then xTPx > 0. Hence λ‖x‖2 > 0 and thus λ > 0.

4.2 Centering Data

The sample mean of a set of data {xj ∈ Rn}pj=1 is the vector µ = 1/p∑p

j=1 xj . By subtracting µ fromeach xj , forming yj = xj − µ, we translate the data vectors so that the new sample mean is zero:

1/p

p∑j=1

yj = 1/p

p∑j=1

(xj − µ) = µ− µ = 0.

This is called centering the data. We can also express the centering operation in matrix form as follows.Form the data into the matrix X = [x1, . . . , xp] ∈ Rn×p. Then µ = 1/pX1, where 1 ∈ Rn denotes thevector of all 1’s. Let Y denotes the corresponding matrix of centered data and u = (1/

√p)1. Then

Y = X − µ1T = X − 1/pX11T = X(I − 1/p11T ) = X(I − uuT ).

From this point forward we assume that the data has been centered.

4.3 Parameterizing the Family of k-Dimensional Subspaces

A subspace U ⊆ Rn of dimension k ≤ n can be represented by an orthonormal basis for U . However,this representation is not unique since there are infinitely many orthonormal bases for U . Any such basiscontains k vectors and these can be arranged into the columns of a n×k matrix U = [u1, . . . , uk] ∈ Rn×kwith UTU = Ik.

Let U1, U2 ∈ Rn×k be two orthonormal bases for the same k-dimensional subspace U . Since U1 is abasis for U and every column of U2 lies in U , there must exist a matrix Q ∈ Rk×k such that U2 = U1Q. Itfollows that Q = UT1 U2. Using U1U

T1 U2 = U2 and U2U

T2 U1 = U1, we then have

QTQ = UT2 U1UT1 U2 = UT2 U2 = Ik, and

QQT = UT1 U2UT2 U1 = UT1 U1 = Ik.

HenceQ ∈ Ok. So any two orthonormal basis representationsU1, U2 of U are related by a k×k orthogonalmatrix Q: U2 = U1Q and U1 = U2Q

T .

4.4 An Optimal Projection Subspace

We seek a k-dimensional subspace U such that the orthogonal projection of the data onto U minimizes thesum of squared norms of the residuals. Assuming such a subspace exists, we call it an optimal projectionsubspace of dimension k.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 21: ELE 535 Notes

ELE 535 Fall 2015 21

Let the columns of U ∈ Rn×k be an orthonormal basis for a subspace U . Then the matrix of projecteddata is X = UUTX and the corresponding matrix of residuals is X − UUTX . Hence we seek to solve:

minU∈Rn×k

‖X − UUTX‖2F

s.t. UTU = Ik.(4.1)

The solution of this problem can’t be unique since if U is a solution so is UQ for every Q ∈ Ok. Thesesolutions correspond to different parameterizations of the same subspace. In addition, it is of interest todetermine if two distinct subspaces could both be optimal projection subspaces of dimension k.

Using standard equalities, the objective function of Problem 4.1 can be rewritten as

‖X − UUTX‖2F = trace(XT (I − UUT )(I − UUT )X) = trace(XXT )− trace(UTXXTU).

Hence, letting P ∈ Rn×n denote the symmetric positive semidefinite matrix XXT , we can equivalentlysolve the following problem:

maxU∈Rm×k

trace(UTPU)

s.t. UTU = Ik.(4.2)

Problem 4.2 is a well known problem. For the simplest version with k = 1, we have the following standardresult.

Theorem 4.4.1 (Horn and Johnson, 4.2.2). Let P ∈ Rn×n be a symmetric positive semidefinite matrixwith eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn. The problem

maxu∈Rn

uTPu

s.t. uTu = 1(4.3)

has the optimal value λ1. This is achieved if and only if u is a unit norm eigenvector of P for the eigenvalueλ1.

Proof. We want to maximize xTPx subject to xTx = 1. Bring in a Lagrange multiplier µ and form theLagrangian L(x, µ) = xTPx + µ(1 − xTx). Taking the derivative of this expression with respect to xand setting this equal to zero yields Px = µx. Hence µ must be a eigenvalue of P with x a correspondingeigenvector normalized so that xTx = 1. For such x, xTPx = µxTx = µ. Hence the maximumachievable value of the objective is λ1 and this is achieved when u a corresponding eigenvector of P .Conversely, if u is any unit norm eigenvector of P for λ1, then uTPu = λ1 and hence u is a solution.

Theorem 4.4.1 can be generalized as follows. However, the proof uses results we have not covered yet.

Theorem 4.4.2 (Horn and Johnson, 4.3.18). Let P ∈ Rn×n be a symmetric positive semidefinite matrixwith eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn. The problem

maxU∈Rn×k

trace(UTPU)

s.t. UTU = Ik

(4.4)

has the optimal value∑k

j=1 λj . Moreover, this is achieved if the columns of U are k orthonormal eigen-vectors for the largest k eigenvalues λ1, . . . , λk of P .

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 22: ELE 535 Notes

ELE 535 Fall 2015 22

Proof. Let P = V ΣV T be an eigen-decomposition of P with V ∈ On and Σ ∈ Rn×n a diagonal matrixwith the eigenvalues λ1 ≥ · · · ≥ λn listed in decreasing order down the diagonal. We want to maximizetrace(UTV ΣV TU) = trace(ΣV TUUTV ) = trace(ΣWW T ), where W = V TU ∈ On×k. This isequivalent to maximizing <Σ,WW T> by choice of W ∈ On×k, then setting U = VW .

The maximization of<Σ,WW T> can be solved as follows. Let Z ∈ On,n−k be an orthonormal basisforR(U)⊥. Then the matrices Σ and WW T have the following singular value decompositions

Σ =

[Ik 00 In−k

[Ik 00 In−k

]Tand WW T =

[W Z

] [Ik 00 0

] [W T

ZT

].

It is a standard result that if we are free to select the left and right singular vectors of a matrix B, then theinner product <A,B> is maximized when the left and right singular vectors of B are chosen to equal theleft and right singular vectors of A, respectively. Hence selecting W =

[Ik 0

]T maximizes the innerproduct <Σ,WW T>. This gives U = Vk, where Vk is the matrix of the first k columns of V and resultsin the optimal objective value

∑kj=1 λj .

It follows from Theorem 4.4.2, that a solution U? to Problem 4.2 is obtained by selecting the columnsof U? to be a set orthonormal eigenvectors of P = XXT corresponding to its k largest eigenvalues.Working backwards, we see that U? is then also a solution to Problem 4.1. In both cases, there is nothingspecial about U? beyond the fact that it spans U?. Any basis of the form U?Q with Q ∈ Ok spans thesame optimal subspace U?.

We also note that U? may not be unique. To see this, consider the situation when λk = λk+1. Whenthis holds, the selection of a k-th eigenvector in U? is not unique.

In summary, a solution to Problem 4.1 can be obtained as follows. Find the k largest eigenvalues ofXXT and a corresponding set of orthonormal eigenvectors U?. Then over all k dimensional subspaces,U? = R(U?) minimizes the sum of the squared norms of the projection residuals. By projecting each xjto xj = U?(U?)Txj we obtain a representation of the data as points on U?. Moreover, if we now representxj by its coordinates yj = (U?)Txj with respect the orthonormal basis U?, then we have linearly mappedthe data into k-dimensional space.

4.5 An Alternative Viewpoint

We now consider an alternative way to view the same problem. This will give some additional insightsinto the solution we have derived.

4.5.1 The Sample Covariance of the Data

The data points xj are “spread out” around the sample mean µ. In the case of scalars, to measure thespread around µ we form the sample variance 1/p

∑pj=1(xj − µ)2. However, for vectors the situation is

more complicated since variation about the mean can also depend on direction.We will continue with our assumption that the data has zero sample mean. Hence we examine how the

data is spread around the vector 0. Select a unit norm vector u ∈ Rn and project xj onto the line through 0in the direction u. This yields xj = uuTxj , j = 1, . . . , p. Since the direction is fixed to be u, the projecteddata is effectively specified by the set of scalars uTxj . This set of scalars also has zero sample mean:∑p

j=1 uTxj = uT

∑pj=1 xj = 0.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 23: ELE 535 Notes

ELE 535 Fall 2015 23

So the spread of the data in direction u can be quantified by the scalar sample variance

σ2(u) =1

p

n∑j=1

(uTxj)2 =

1

p

p∑j=1

(uTxj)(uTxj)

T = uT

1

p

p∑j=1

xjxTj

u. (4.5)

This expresses the variance of the data as a function of the direction u.Let

R =1

p

p∑j=1

xjxTj . (4.6)

R is called the sample covariance matrix of the (centered) data. The product xjxTj is a real n×n symmetricmatrix formed by the outer product of the j-th data point with itself. The sample covariance is the mean ofthese matrices and hence is also a real n× n symmetric matrix. More generally, if the data is not centeredbut has sample mean µ, then the sample covariance is

R =1

p

p∑j=1

(xj − µ)(xj − µ)T . (4.7)

Lemma 4.5.1. The sample covariance matrix R is symmetric positive semidefinite.

Proof. R is clearly symmetric. Positive semidefiniteness follows by noting that for any x ∈ Rn,

xTRx = xT (1

p

p∑j=1

xjxTj )x =

1

p

p∑j=1

(xTxjxTj x) =

1

p

p∑j=1

(xTj x)2 ≥ 0.

4.5.2 Directions of Maximum Variance

Using R and (4.5) we can concisely express the variance of the data in direction u as

σ2(v) = uTRu. (4.8)

Hence the direction u in which the data has maximum sample variance is given by the solution of theproblem:

arg maxu∈Rn

uTRu

s.t. uTu = 1(4.9)

with R a symmetric positive semidefinite matrix. This is Problem 4.3. By Theorem 4.4.1, the data hasmaximum variance σ2

1 in the direction v1, where σ21 ≥ 0 is the largest eigenvalue of R and v1 is a corre-

sponding unit norm eigenvector of R.We must take care if we want to find two directions with the largest variance. Without any constraint,

the second direction can come arbitrarily close to v1 and variance σ21 . One way to prevent this is to

constrain the second direction to be orthogonal to the first. Then if we want a third direction, constraint itto orthogonal to the two previous directions, and so on. In this case, for k orthogonal directions we wantto find U = [u1, . . . , uk] ∈ On,k to maximize

∑kj=1 u

Tj Ruj = trace(UTRU). Hence we want to solve

Problem 4.4 with P = R. By Theorem 4.4.2, the solution is attained by taking the k directions to be unitnorm eigenvectors v1, . . . , vk for the largest k eigenvalues of R.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 24: ELE 535 Notes

ELE 535 Fall 2015 24

By this means you see that we obtain n orthonormal directions of maximum (sample) variance inthe data. These directions v1, v2, . . . , vn and the corresponding variances σ2

1 ≥ σ22 ≥ · · · ≥ σ2

n areeigenvectors and corresponding eigenvalues ofR: Rvj = σ2

j vj , j = 1, . . . , n. The vectors vj are called theprincipal components of the data, and this decomposition is called Principal Component Analysis (PCA).Let V be the matrix with the vj as its columns, and Σ2 = diag(σ2

1, . . . , σ2n) (note σ2

1 ≥ σ22 ≥ · · · ≥ σ2

n) .Then PCA is an ordered eigen-decomposition of the sample covariance matrix: R = V Σ2V T .

There is a clear connection between PCA and finding a subspace that minimizes the sum of squarednorms of the residuals. We can see this by writing

R =1

p

p∑j=1

xjxTj =

1

pXXT .

So the sample covariance is just a scalar multiple of the matrix XXT . This means that the principalcomponents are just the eigenvectors of XXT listed in order of decreasing eigenvalues. In particular, thefirst k principal components are the first k eigenvectors (ordered by eigenvalue) of XXT . This is exactlythe orthonormal basis U? that defines an optimal k-dimensional projection subspace U?. So the lead-ing k principal components give a particular orthonormal basis for an optimal k-dimensional projectionsubspace.

A direction in which the data has small variance relative to σ21 may not be not an important direction -

after all the data stays close to the mean in this direction. If one accepts this hypothesis, then the directionsof largest variance are the important directions - they capture most of the variability in the data. Thissuggests that we could select k < rank(R) and project the data onto the k directions of largest variance.Let Vk = [v1, v2, . . . , vk]. Then the projection onto the span of the columns of Vk is xj = Vk(V

Tk xj). The

term yj = (V Tk )xj gives the coordinates of xj with respect to Vk. Then the product Vkyj synthesizes xj

using these coefficients to form the appropriate linear combination of the columns of Vk.Here is a critical observation: since the directions are fixed and known, we don’t need to form xj .

Instead we can simply map xj to the coordinate vector yj ∈ Rk. No information is lost in working withyj instead of xj since the latter is an invertible linear function of the former. Hence {yj}pj=1 gives a newset of data that captures most of the variation in the original data, and lies in a reduced dimension space(k ≤ rank(R) ≤ n).

The natural next question is how to select k? Clearly this involves a tradeoff between the size of k andthe amount of variation in the original data that is captured in the projection. The “variance” captured bythe projection is ν2 =

∑kj=1 σ

2j and the “variance” in the residual is ρ2 =

∑nj=k+1 σ

2j . Reducing k reduces

ν2 and increases ρ2. The selection of k thus involves determining how much of the total variance in Xneeds to be captured in order to successfully use the projected data to complete the analysis or decisiontask at hand. For example, if the projected data is to be used to learn a classifier, then one needs to selectthe value of k that yields acceptable (or perhaps best) classifier performance. This could be done usingcross-validation.

4.6 Problems4.1. LetX ∈ Rn×p. Show that the set of nonzero eigenvalues ofXXT is the same as the set of nonzero eigenvaluesof XTX .

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 25: ELE 535 Notes

ELE 535 Fall 2015 25

Chapter 5

Singular Value Decomposition

5.1 Overview

We now discuss in detail a very useful matrix factorization called the singular value decomposition (SVD).The SVD extends the idea of eigen-decomposition of square matrices to non-square matrices. It is usefulin general, but has specific application in data analysis, dimensionality reduction (PCA), low rank matrixapproximation, and some forms of regression.

5.2 Preliminaries

Recall that for A ∈ Rm×n, the range of A is the subspace of Rm defined by R(A) = {y : y =Ax, some x ∈ Rn} ⊆ Rm. So the range of A is the set of all vectors that can be formed as a linear combi-nation of the columns of A. The nullspace of A is the subspace of Rn defined by N (A) = {x : Ax = 0}.This is the set of all vectors that are mapped to the zero vector in Rm by A.

The following fundamental result from linear algebra will be very useful.

Theorem 5.2.1. Let A ∈ Rm×n have nullspace N (A) and rangeR(A). Then N (A)⊥ = R(AT ).

Proof. Let x ∈ N (A). Then Ax = 0 and xTAT = 0. So for every y ∈ Rm, xT (AT y) = 0. Thusx ∈ R(AT )⊥. This shows thatN (A) ⊆ R(AT )⊥. Now for all subspaces: (a) (U⊥)⊥ = U , and (b) U ⊆ Vimplies V⊥ ⊆ U⊥ (see Problem 3.5). Applying these properties yieldsR(AT ) ⊆ N (A)⊥.

Conversely, suppose x ∈ R(AT )⊥. Then for all y ∈ Rm, xTAT y = 0. Hence for all y ∈ Rm,yTAx = 0. This implies Ax = 0 and hence that x ∈ N (A). Thus R(AT )⊥ ⊆ N (A) and N (A)⊥ ⊆R(AT ).

We have shownR(AT ) ⊆ N (A)⊥ and N (A)⊥ ⊆ R(AT ). Thus N (A)⊥ = R(AT ).

The rank ofA is the dimension r of the range ofA. Clearly this equals the number of linearly indepen-dent columns inA. The rank r is also the number of linearly independent rows ofA. Thus r ≤ min(m,n).The matrix A is said to full rank if r = min(m,n).

An m×n rank one matrix has the form yxT where y ∈ Rm and x ∈ Rn are both nonzero. Notice thatfor all w ∈ Rn, (yxT )w = y(xTw) is a scalar multiple of y. Moreover, by suitable choice of w we canmake this scalar any real value. SoR(yxT ) = span(y) and the rank of yxT is one.

5.2.1 Induced Norm of a Matrix

The Euclidean norm in Rp is also known as the 2-norm and is often denoted by ‖ · ‖2. We will henceforthadopt this notation.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 26: ELE 535 Notes

ELE 535 Fall 2015 26

The gain of a matrix A ∈ Rm×n when acting on a unit norm vector x ∈ Rn is given by the norm of thevector Ax. This measures the change in the vector magnitude resulting from the application of A. Moregenerally, for x 6= 0, define the gain by G(A, x) = ‖Ax‖2/‖x‖2, where in the numerator the norm is inRm, and in the denominator it is in Rn. The maximum gain of A over all x ∈ Rn is then:

G(A) = maxx 6=0

‖A‖2‖x‖2

This is called the induced matrix 2-norm of A, and is denoted by ‖A‖2. It is induced by the Euclideannorms on Rn and Rm. From the definition of the induced norm we see that

‖A‖22 = max‖x‖2=1

‖Ax‖22 = max‖x‖2=1

xT (ATA)x.

Since ATA is real, symmetric and positive semidefinite, the solution of this problem is to select x to be aunit norm eigenvector of ATA for the largest eigenvalue. So

‖A‖2 =√λmax(ATA) (5.1)

Because of this connection with eigenvalues, the induced matrix 2-norm is sometimes also called thespectral norm.

It is easy to check that the induced norm is indeed a norm. It also has the following additional proper-ties.

Lemma 5.2.1. Let A,B be matrices of appropriate size and x ∈ Rn. Then

1) ‖Ax‖2 ≤ ‖A‖2‖x‖2;

2) ‖AB‖2 ≤ ‖A‖2 ‖B‖2.

Proof. Exercise.

Important: The induced matrix 2-norm and the matrix Euclidean norm are distinct norms on Rm×n.Recall, the Euclidean norm on Rm×n is called the Frobenius norm and is denoted by ‖A‖F .

5.3 Singular Value Decomposition

We first present the main SVD result in what is called the compact form. We then give interpretations ofthe SVD and indicate an alternative version known as the full SVD. After these discussions, we turn ourattention to the ideas and constructions that form the foundation of the SVD.

Theorem 5.3.1 (Singular Value Decomposition). Let A ∈ Rm×n have rank r ≤ min{m,n}. Then thereexist U ∈ Rm×r with UTU = Ir, V ∈ Rn×r with V TV = Ir, and a diagonal matrix Σ ∈ Rr×r withdiagonal entries σ1 ≥ σ2 ≥ · · ·σr > 0, such that

A = UΣV T =

r∑j=1

σjujvTj .

The positive scalars σj are called the singular values of A. The r orthonormal columns of U are calledthe left or output singular vectors of A, and the r orthonormal columns of V are called the right or inputsingular vectors ofA. The conditions UTU = Ir and V TV = Ir indicated that U and V have orthonormalcolumns. But in general, since U and V need not be square matrices, UUT 6= Im and V V T 6= In (ingeneral U and V are not orthogonal matrices). Notice also that the theorem does not claim that U and Vare unique. We discuss this issue later in the chapter. The decomposition is illustrated in Fig. 5.1.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 27: ELE 535 Notes

ELE 535 Fall 2015 27

Figure 5.1: A visualization of the matrices in the compact SVD.

Figure 5.2: A visualization of the three operational steps in the compact SVD. The projection of x ∈ Rn ontoN (A)⊥ is represented in terms of the basis v1, v2: here x = α1v1 + α2v2. These coordinates are scaled by thesingular values. Then the scaled coordinates are transferred to the output space Rm and used to the form the resulty = Ax as the linear combination y = σ1α1u1 + σ2α2u2.

Lemma 5.3.1. The matrices U and V in the compact SVD have the following additional properties:

a) The columns of U form an orthonormal basis for the range of A.

b) The columns of V form an orthonormal basis for N (A)⊥.

Proof. a) You can see this by writing Ax = U(ΣV Tx) and noting that Σ is invertible and V T vj = ejwhere ej is the j-th standard basis vector for Rr. So the range of ΣV T is Rr. It follows thatR(U) = R(A).b) By taking transposes and using part a), the columns of V form an ON basis for the range of AT . Thenusing N (A)⊥ = R(AT ), yields that the columns of V form an orthonormal basis for N (A)⊥.

The above observations lead to the following operational interpretation of the SVD. For x ∈ Rn, theoperation V Tx gives the coordinates with respect to V of the orthogonal projection of x onto the subspaceN (A)⊥. (The orthogonal projection is x = V V Tx.) These r coordinates are then individually scaled usingthe r diagonal entries of Σ. Finally, we synthesize the output vector by using the scaled coordinates andthe ON basis U forR(A): y = U(ΣV Tx). So the SVD has three steps: (1) An analysis step: V Tx, (2) Ascaling step: Λ(V Tx), and (3) a synthesis step: U(ΣV Tx). In particular, when x = vk, y = Ax = σkuk,k = 1, . . . , r. So the r ON basis vectors for N (A)⊥ are mapped to scaled versions of corresponding ONbasis vectors forR(A). This is illustrated in Fig. 5.2.

5.3.1 Singular Values and Norms

Recall that the induced matrix 2-norm of A is the maximum gain of A (§5.2.1). We show below that thisis related to the singular values of the matrix A. First note that

‖Ax‖2 = ‖UΣV Tx‖2 = xT (V Σ2V T )x.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 28: ELE 535 Notes

ELE 535 Fall 2015 28

Figure 5.3: A visualization of the action of A on the unit sphere in Rn in terms of its SVD.

Figure 5.4: A visualization of the matrices in the full SVD.

To maximize this expression we select x to be a unit norm eigenvector of (V Σ2V T ) with maximumeigenvalue. Hence we use x = v1 and achieve ‖Ax‖2 = σ2

1 . So the input direction with the most gain isv1, this appears in the output in the direction u1, and the gain is σ1: Av1 = σ1u1. Hence

‖A‖2 = σ1. (5.2)

So the induced 2-norm of A is given by the maximum singular value of A.We can also express the Frobenius norm of matrix in terms of its singular values. To see this, let

A ∈ Rm×n have rank r and write the compact SVD of A in the form A =∑r

j=1 σjujvTj . In this form we

see that the SVD expressesA as a positive linear combination of the rank one matrices ujvTj , j = 1, . . . , r.These matrices form an orthonormal set in Rm×n:

<ukvTk , ujv

Tj > = trace((ukv

Tk )T (ujv

Tj )) = trace(uTk ujv

Tj vk) =

{0, if j 6= k;

1, if j = k.

So the SVD is selecting an orthonormal basis of rank one matrices {ujvTj }rj=1 specifically adapted to A,and expressing A as a positive linear combination of this basis.

With these insights, we can apply Pythagorous’ Theorem to the expression ‖A‖2F = ‖∑r

j=1 σjujvTj ‖2F

to obtain:

‖A‖F =(∑r

j=1 σ2j

)1/2. (5.3)

So the Frobenius norm of A is the Euclidean norm of the singular values of A.

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 29: ELE 535 Notes

ELE 535 Fall 2015 29

5.3.2 The Full SVD

There is a second version of the SVD that is often convenient in various proofs involving the SVD. Oftenthis second version is just called the SVD. However, to emphasize its distinctness from the equally usefulcompact SVD, we refer to it as a full SVD.

The basic idea is very simple. Let A = UcΣcVTc be a compact SVD with Uc ∈ Rm×r, Vc ∈ Rn×r,

and Σc ∈ Rr×r. To Uc we add an orthonormal basis for R(Uc)⊥ to form the orthogonal matrix U =[

Uc U⊥c]∈ Rm×m. Similarly, to Vc we add an orthonormal basis for R(Vc)

⊥ to form the orthogonalmatrix V =

[Vc V ⊥c

]∈ Rn×n. To ensure that these extra columns in U and V do not interfere with the

factorization of A, we form Σ ∈ Rm×n by padding Σc with zero entries:

Σ =

[Σc 0r×(n−r)

0(m−r)×r 0(m−r)×(n−r)

].

We then have a full SVD factorization A = UΣV T . The utility of the full SVD derives from U and Vbeing orthogonal (hence invertible) matrices. The full SVD is illustrated in Fig. 5.4.

If P is a symmetric positive definite matrix, a full SVD of P is simply an eigen-decomposition of P :UΣV T = QΣQT , where Q is the orthogonal matrix of eigenvectors of P . In this sense, the SVD extendsthe eigen-decomposition by using different orthonormal sets of vectors in the input and output spaces.

5.4 Inner Workings of the SVD

We now give a quick overview of where the matrices U , V and Σ of SVD come from. Let A ∈ Rm×nhave rank r. So the range of A has dimension r and nullspace of A has dimension n− r.

Let B = ATA ∈ Rn×n. Since B is a symmetric positive semi-definite (PSD) matrix, it has non-negative eigenvalues and full set of orthonormal eigenvectors. Order the eigenvalues in decreasing order:σ2

1 ≥ σ22 ≥ · · · ≥ σ2

n ≥ 0 and let vj denote the eigenvector for σ2j . So

Bvj = σ2j vj , j = 1, . . . , n.

Noting that Ax = 0 if and only if Bx = 0 we see that the null space of B also has dimension n − r.It follows that n− r of the eigenvectors of B must lie in N (A) and r must lie in N (A)⊥. Hence

σ21 ≥ · · · ≥ σ2

r > 0 and σ2r+1 = · · · = σ2

n = 0,

with v1, . . . , vr an orthonormal basis for N (A)⊥.Now consider C = AAT ∈ Rm×m. This is also symmetric and PSD. Hence C has nonnegative

eigenvalues and a full set of orthonormal eigenvectors. Order the eigenvalues in decreasing order: λ21 ≥

λ22 ≥ · · · ≥ λm ≥ 0 and let uj denote the eigenvector for λ2

j . So

Cuj = λ2juj , j = 1, . . . ,m.

Since R(AT ) = N (A)⊥, the dimension of R(AT ) is r, and that of N (AT ) is m − r. By the samereasoning as above, m− r of the eigenvectors of C must lie in N (AT ) and r must lie inR(A). Hence

λ21 ≥ · · · ≥ λ2

r > 0 and λ2r+1 = · · · = λ2

m = 0,

with u1, . . . , ur an orthonormal basis forR(A).

c© Peter J. Ramadge, 2015. Please do not distribute without permission.

Page 30: ELE 535 Notes

ELE 535 Fall 2015 30

Now we show a relationship between σ2j , λ2

j and the corresponding eigenvectors vj , uj , for j =

1, . . . , r. First consider Bvj = σ2j vj with σ2

j > 0. Then

C(Avj) = (AAT )(Avj) = A(ATAvj) = A(Bvj) = σ2j (Avj).

So either Avj = 0, or Avj is an eigenvector of C with eigenvalue σ2j . The first case, Avj = 0 contradicts

ATAvj = σ2j vj with σ2

j > 0 since Avj = 0 implies (ATA)vj = 0. Hence Avj must be an eigenvector ofC with eigenvalue σ2

j . Assume for simplicity, that the positive eigenvalues of ATA and AAT are distinct.Then for some k, with 1 ≤ k ≤ r:

σ2j = λ2

k and Avj = αuk, with α > 0.

We can take α > 0 by swapping −uk for uk if necessary. Using this result we find

vTj Bvj =

{σ2j vTj vj = σ2

j ;

(Avj)T (Avj) = α2uTj uj = α2.

So we must have α = σj andAvj = σjuk.

Now do the same analysis for Cuk = λ2kuk with λ2

k > 0. This yields

B(ATuk) = (ATA)(ATuk) = AT (AATuk) = λ2k(A

Tuk).

Since λ2k > 0, we can’t have ATuk = 0. So ATuk is an eigenvector of ATA with eigenvalue λ2

k. Underthe assumption of distinct nonzero eigenvalues, this implies that for some p with 1 ≤ p ≤ r,

λ2k = σ2

p and ATuk = βvp, some β 6= 0.

Using this expression to evaluate uTkCuk we find λ2k = β2. Hence β2 = λ2

k = σ2p and ATuk = βvp.

We now have two ways to evaluate ATAvj :

ATAvj =

{σ2j vj by definition;αATuk = αβvp. using the above analysis.

Equating these answers gives j = p and αβ = σ2j . Since α > 0, it follows that β > 0 and α = σj = λj =

β. Thus Avj = σjuj , j = 1, . . . , r. Written in matrix form this is almost the compact SVD:

A[v1 . . . vr

]=[u1 . . . ur

] σ1

. . .σr

.From this we deduce that AV V T = UΣV T . V V T computes the orthogonal projection of x ontoN (A)⊥.Hence for every x ∈ Rn, AV V Tx = Ax. Thus AV V T = A, and we have A = UΣV T .

Finally note that σj =√λj(ATA) =

√λj(AAT ), j = 1, . . . , r. So the singular values are always

unique. If the singular values are distinct, the SVD is unique up to sign interchanges between the uj andvj . But this still leaves the representationA =

∑rj=1 σjujv

Tj unique. If the singular values are not distinct,

then U and V are not unique. For example, In = UInUT for every orthogonal matrix U .

5.5 Problems5.1. LetA ∈ Rn×n be a square invertible matrix with SVDA =

∑nj=1 σjujv

Tj . Show thatA−1 =

∑nj=1(1/σj)vju

Tj .

c© Peter J. Ramadge, 2015. Please do not distribute without permission.