singular value decomposition in text mining ram akella university of california berkeley silicon...

Singular Value Decomposition in Text Mining

Ram AkellaUniversity of California Berkeley Silicon Valley Center/SC

Lecture 4bFebruary 9, 2011

Class Outline

Summary of last lecture Indexing Vector Space Models Matrix Decompositions Latent Semantic Analysis Mechanics Example

Summary of previous class

Principal Component Analysis Singular Value Decomposition Uses Mechanics Example swap rates

Introduction

How can we retrieve information using a search engine?.

We can represent the query and the documents as vectors (vector space model) However to construct these vectors we should

perform a preliminary document preparation. The documents are retrieved by finding the

closest distance between the query and the document vector. Which is the most suitable distance to retrieve

documents?

Search engine

Document File Preparation

Manual Indexing Relationships and concepts between

topics can be established It is expensive and time consuming It may not be reproduced if it is

destroyed. The huge amount of information suggest

a more automated system

Document File Preparation

Automatic indexingTo buid an automatic index, we need to perform two steps:

Document AnalysisDecide what information or parts of the document should be indexed

Token analysisDecide with words should be used in order to obtain the best representation of the semantic content of documents.

Document NormalizationAfter this preliminary analysis we need to perform another preprocessing of the data

Remove stop words Function words: a, an, as, for, in, of, the… Other frequent words

Stemming Group morphological variants

Plurals “ streets” -> “street” Adverbs “fully” -> “full”

The current algorithms can make some mistakes “police“, “policy” -> “polic”

File StructuresOnce we have eliminated the stop words and apply the stemmer to the document we can construct:Document File

We can extract the terms that should be used in the index and assign a number to each document.

File StructuresDictionary

We will construct a searchable dictionary of terms by arranging them alphabetically and indicating the frequency of each term in the collection

Term Global Frequency

banana 1

cranb 2

Hanna 2

hunger 1

manna 1

meat 1

potato 1

query 1

rye 2

sourdough 1

spiritual 1

wheat 2

File StructuresInverted List

For each term we find the documents and its related position associated with

Term (Doc, Position)

banana (5,7)

cranb (4,5); (6,4)

Hanna (1,7); (8,2)

hunger (9,4)

manna (2,6)

meat (7,6)

potato (4,3)

query (3,8)

rye (3,3);(6,3)

sourdough (5,5)

spiritual (7,5)

wheat (3,5);(6,6)

Vector Space Model The vector space model can be used to

represent terms and documents in a text collection

The document collection of n documents can be represented with a matrix of m X n where the rows represent the terms and the columns the documents

Once we construct the matrix, we can normalize it in order to have unitary vectors

Vector Space Models

Query Matching If we want to retrieve a document we

should: Transform the query to a vector look for the most similar document vector to the

query. One of the most common similarity methods is the cosine distance between vectors defined as:

Where a is the document and q is the query vector

Example:

Using the book titles we want to retrieve books of “Child Proofing”

Book titles0

1

0

0

0

1

0

0

Query

Cos 2=Cos 3=0.4082

Cos 5=Cos 6=0.500

With a threshold of 0.5, the 5th and the 6th would be retrieved.

Term weighting In order to improve the retrieval, we can give

to some terms more weight than others.

00

01)(

rif

rifrX

Where

Local Term Weights Global Term Weights

jij

jiij f

fp

Synonymy and Polysemy

autoenginebonnettyreslorryboot

caremissions

hood makemodeltrunk

makehiddenMarkovmodel

emissionsnormalize

Synonymy

Will have small cosine

but are related

Polysemy

Will have large cosine

but not truly related

Matrix Decomposition

To produce a reduced –rank approximation of the document matrix, first we need to be able to identify the dependence between columns (documents) and rows (terms)

QR Factorization

SVD Decomposition

QR Factorization The document matrix A can be decomposed as

below:

Where Q is an mXm orthogonal matrix and R is an mX m upper triangular matrix

This factorization can be used to determine the basis vectors for any matrix A

This factorization can be used to describe the semantic content of the corresponding text collection

QRA

Example

5962.03802.00000007071.0

003430.01252.01432.04961.07746.000

003430.01252.09309.00000

5962.03802.00000007071.0

3862.05962.01715.00626.00716.02481.02582.05774.00

006860.02505.02864.06282.0000

003430.09393.000000

3802.05962.01715.00626.00716.02481.02582.05774.00

003430.01252.01432.04961.05164.05774.00

Q

A=

Example

4851.0000000

5756.07528.000000

1013.06583.07596.00000

3508.003508.07211.0000

3651.003651.01155.07454.000

4082.004082.2582.06667.010

0006325.0001

R

Query Matching We can rewrite the cosine distance using

this decomposition

Where rj refers to column j of the matrix R

22

1

22

1

1

22

)()(cos

qr

qQr

qrQ

qrQ

qa

qa

j

TTj

j

Tj

j

Tj

j

Singular Value Decomposition (SVD)

This decomposition provides a reduced rank approximations in the column and row space of the document matrix

This decomposition is defined as

TVUA

mm mn V is nn

Where the columns U are orthogonal eigenvectors of AAT.

The columns of V are orthogonal eigenvectors of ATA.Eigenvalues 1 … r of AAT are the square root of the eigenvalues of ATA.

Latent Semantic Decomposition (LSA)

It is the application of SVD in text mining. We decompose the document-term matrix A into

three matrices

A V U

The V matrix refers to terms

and U matrix refers to documents

Latent Semantic Analysis

Once we have decomposed the document matrix A we can reduce its rank We can account for synonymy and polysemy

in the retrieval of documents Select the vectors associated with the higher

value of in each matrix and reconstruct the matrix A

Latent Semantic Analysis

Query Matching The cosines between the vector q and the n

document vectors can be represented as:

where ej is the canonical vector of dimension n

This formula can be simplified as

qeVU

qeVU

qeA

qeA

jkkk

Tjkkk

jk

Tjk

j

)()(cos

jTkkj

j

Tk

Tj

j

eVs

mjqs

qUs

,....,2,1,)(

cos

where

Example

Apply the LSA method to the following technical memo titlesc1: Human machine interface for ABC computer applicationsc2: A survey of user opinion of computer system response timec3: The EPS user interface management systemc4: System and human system engineering testing of EPSc5: Relation of user perceived response time to error measurement

m1: The generation of random, binary, ordered treesm2: The intersection graph of paths in treesm3: Graph minors IV: Widths of trees and well-quasi-orderingm4: Graph minors: A survey

c1 c2 c3 c4 c5 m1 m2 m3 m4human 1 0 0 1 0 0 0 0 0interface 1 0 1 0 0 0 0 0 0computer 1 1 0 0 0 0 0 0 0user 0 1 1 0 1 0 0 0 0system 0 1 1 2 0 0 0 0 0response 0 1 0 0 1 0 0 0 0time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0survey 0 1 0 0 0 0 0 0 1trees 0 0 0 0 0 1 1 1 0graph 0 0 0 0 0 0 1 1 1minors 0 0 0 0 0 0 0 1 1

Example

First we construct the document matrix

Example

0.22 -0.11 0.29 -0.41 -0.11 -0.34 0.52 -0.06 -0.41 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11 0.24 0.04 -0.16 -0.59 -0.11 -0.25 -0.30 0.06 0.49 0.40 0.06 -0.34 0.10 0.33 0.38 0.00 0.00 0.01 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.17 0.03 0.27 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.17 0.21 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58 0.01 0.49 0.23 0.03 0.59 -0.39 -0.29 0.25 -0.23 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18

The Resulting decomposition is the following

{U} =

Example

{} =

3.342.54

2.351.64

1.501.31

0.850.56

0.36

Example

{V} =

0.20 0.61 0.46 0.54 0.28 0.00 0.01 0.02 0.08-0.06 0.17 -0.13 -0.23 0.11 0.19 0.44 0.62 0.53 0.11 -0.50 0.21 0.57 -0.51 0.10 0.19 0.25 0.08-0.95 -0.03 0.04 0.27 0.15 0.02 0.02 0.01 -0.03 0.05 -0.21 0.38 -0.21 0.33 0.39 0.35 0.15 -0.60-0.08 -0.26 0.72 -0.37 0.03 -0.30 -0.21 0.00 0.36 0.18 -0.43 -0.24 0.26 0.67 -0.34 -0.15 0.25 0.04-0.01 0.05 0.01 -0.02 -0.06 0.45 -0.76 0.45 -0.07-0.06 0.24 0.02 -0.08 -0.26 -0.62 0.02 0.52 -0.45

Example We will perform a 2 rank reconstruction:

We select the first two vectors in each matrix and set the rest of the matrix to zero

We reconstruct the document matrix

Examplec1 c2 c3 c4 c5 m1 m2 m3 m4

human 0.16 0.40 0.38 0.47 0.18 -0.05 -0.12 -0.16 -0.09

interface 0.14 0.37 0.33 0.40 0.16 -0.03 -0.07 -0.10 -0.04

computer 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12

user 0.26 0.84 0.61 0.70 0.39 0.03 0.08 0.12 0.19

system 0.45 1.23 1.05 1.27 0.56 -0.07 -0.15 -0.21 -0.05

response 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22

time 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22

EPS 0.22 0.55 0.51 0.63 0.24 -0.07 -0.14 -0.20 -0.11

survey 0.10 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42

trees -0.06 0.23 -0.14 -0.27 0.14 0.24 0.55 0.77 0.66

graph -0.06 0.34 -0.15 -0.30 0.20 0.31 0.69 0.98 0.85

minors -0.04 0.25 -0.10 -0.21 0.15 0.22 0.50 0.71 0.62

The word user seems to have presence in the documents where the word human appears

singular value decomposition in text mining ram akella university of california berkeley silicon...

Documents

document analysis

similar document vector

document normalization

search engine slide

automated system slide

policy polic slide

vectors vector space

file structures dictionary