lecture 21 svd and latent semantic indexing and dimensional reduction

Lecture 21SVD and Latent Semantic Indexing

and Dimensional Reduction

Shang-Hua Teng

Singular Value Decomposition

r

Trrr

TT vuvuvuA

21

222111

where • u1 …ur are the r orthonormal vectors that are basis of C(A) and

• v1 …vr are the r orthonormal vectors that are basis of C(AT )

Low Rank Approximation and Reduction

T

r

Tkkk

TTk

vuA

vuvuvuA

1111

21

222111

The Singular Value Decomposition·

·

A UVT

= 0

0

A UVT

m x n m x r r x r r x n

=

0

0

m x n m x m m x n n x n

The Singular Value Reduction·

·

A UVT

m x n m x r r x r r x n

=

0

0

Ak Uk

VkT

m x n m x k k x k k x n

=

How Much Information Lost?

T

r

Tkkk

TTk

vuA

vuvuvuA

1111

21

222111

Distance between Two Matrices

• Frobenius Norm of a matrix A.

• Distance between two matrices A and B

2,

2

1

2

ji ijijF

m

i

n

jijF

BABA

AA

How Much Information Lost?

T

F

rkk

F

Trrr

Tkkk

TkkkFk

AATraceA

vuvuvuAA

2

222

21

2

222111

2

Approximation Theorem

• [Schmidt 1907; Eckart and Young 1939]

Among all m by n matrices B of rank at most k, Ak is the one that minimizes

221

22

)(

2

,

2

min rkFkFkBrank

ji ijijF

AABA

BABA

Application: Image Compression

• Uncompressed m by n pixel image: m×n numbers

• Rank k approximation of image:– k singular values– The first k columns of U (m-vectors)– The first k columns of V (n-vectors)– Total: k × (m + n + 1) numbers

Example: Yogi (Uncompressed)

• Source: [Will]• Yogi: Rock

photographed by Sojourner Mars mission.

• 256 × 264 grayscale bitmap 256 × 264 matrix M

• Pixel values [0,1]• ~ 67584 numbers

Example: Yogi (Compressed)

• M has 256 singular values

• Rank 81 approximation of M:

• 81 × (256 + 264 + 1) = ~ 42201 numbers

Example: Yogi (Both)

Eigenface

• Patented by MIT• Utilizes two dimentional, global grayscale images• Face is mapped to numbers• Create an image subspace(face space) which best

discriminated between faces• Can be used in properly lit and only frontal images

The Face Database

• Set of normalized face images

• Used ORL Face DB

Two-dimensional Embedding

EigenFaces

Eigenface (PCA)

• Latent Semantic Analysis (LSA)

• Latent Semantic Indexing (LSI)

• Principal Component Analysis (PCA)

Term-Document Matrix• Index each document (by human or by

computer)– fij counts, frequencies, weights, etc

m term

2 term

1 term

n docdoc21 doc

21

22221

1 1211

mnmm

n

n

fff

fff

fff

• Each document can be regarded as a point in m dimensions

Document-Term Matrix• Index each document (by human or by

computer)– fij counts, frequencies, weights, etc

m doc

2 doc

1 doc

n term2 term1 term

21

22221

1 1211

mnmm

n

n

fff

fff

fff

• Each document can be regarded as a point in n dimensions

Term Occurrence Matrix

c1 c2 c3 c4 c5 m1 m2 m3 m4human 1 0 0 1 0 0 0 0 0interface 1 0 1 0 0 0 0 0 0computer 1 1 0 0 0 0 0 0 0user 0 1 1 0 1 0 0 0 0system 0 1 1 2 0 0 0 0 0response 0 1 0 0 1 0 0 0 0time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0survey 0 1 0 0 0 0 0 0 1trees 0 0 0 0 0 1 1 1 0graph 0 0 0 0 0 0 1 1 1minors 0 0 0 0 0 0 0 1 1

Another Example

Term Document Matrix

LSI

LSI Factor 1

LS

I F

acto

r 2

using k=2…

“differentialequations”

“applications& algorithms”

T

Each term’s coordinates specified in first K valuesof its row.

Each doc’s coordinates specified in first K valuesof its column.

D

Positive Definite Matrices and Quadratic Shapes

Positive Definite Matrices and Quadratic Shapes

For any m x n matrix A, all eigenvalues of AAT and ATA are non-negative

Symmetric matrices that have positive eigenvalues are called Positive Definite matrices

Symmetric matrices that have non-negative eigenvalues are called Positive semi-definite matrices

lecture 21 svd and latent semantic indexing and dimensional reduction

Documents

x n matrix

x r r x r r x n

s00auvtm x n

s00m x n

n pixel image

n matrices b of rank

columns of v nvectorstotal

approximation of image