singular value decomposition in text mining ram akella university of california berkeley silicon...
Post on 20-Dec-2015
220 views
TRANSCRIPT
![Page 1: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/1.jpg)
Singular Value Decomposition in Text Mining
Ram AkellaUniversity of California Berkeley Silicon Valley Center/SC
Lecture 4bFebruary 9, 2011
![Page 2: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/2.jpg)
Class Outline
Summary of last lecture Indexing Vector Space Models Matrix Decompositions Latent Semantic Analysis Mechanics Example
![Page 3: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/3.jpg)
Summary of previous class
Principal Component Analysis Singular Value Decomposition Uses Mechanics Example swap rates
![Page 4: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/4.jpg)
Introduction
How can we retrieve information using a search engine?.
We can represent the query and the documents as vectors (vector space model) However to construct these vectors we should
perform a preliminary document preparation. The documents are retrieved by finding the
closest distance between the query and the document vector. Which is the most suitable distance to retrieve
documents?
![Page 5: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/5.jpg)
Search engine
![Page 6: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/6.jpg)
Document File Preparation
Manual Indexing Relationships and concepts between
topics can be established It is expensive and time consuming It may not be reproduced if it is
destroyed. The huge amount of information suggest
a more automated system
![Page 7: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/7.jpg)
Document File Preparation
Automatic indexingTo buid an automatic index, we need to perform two steps:
Document AnalysisDecide what information or parts of the document should be indexed
Token analysisDecide with words should be used in order to obtain the best representation of the semantic content of documents.
![Page 8: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/8.jpg)
Document NormalizationAfter this preliminary analysis we need to perform another preprocessing of the data
Remove stop words Function words: a, an, as, for, in, of, the… Other frequent words
Stemming Group morphological variants
Plurals “ streets” -> “street” Adverbs “fully” -> “full”
The current algorithms can make some mistakes “police“, “policy” -> “polic”
![Page 9: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/9.jpg)
File StructuresOnce we have eliminated the stop words and apply the stemmer to the document we can construct:Document File
We can extract the terms that should be used in the index and assign a number to each document.
![Page 10: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/10.jpg)
File StructuresDictionary
We will construct a searchable dictionary of terms by arranging them alphabetically and indicating the frequency of each term in the collection
Term Global Frequency
banana 1
cranb 2
Hanna 2
hunger 1
manna 1
meat 1
potato 1
query 1
rye 2
sourdough 1
spiritual 1
wheat 2
![Page 11: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/11.jpg)
File StructuresInverted List
For each term we find the documents and its related position associated with
Term (Doc, Position)
banana (5,7)
cranb (4,5); (6,4)
Hanna (1,7); (8,2)
hunger (9,4)
manna (2,6)
meat (7,6)
potato (4,3)
query (3,8)
rye (3,3);(6,3)
sourdough (5,5)
spiritual (7,5)
wheat (3,5);(6,6)
![Page 12: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/12.jpg)
Vector Space Model The vector space model can be used to
represent terms and documents in a text collection
The document collection of n documents can be represented with a matrix of m X n where the rows represent the terms and the columns the documents
Once we construct the matrix, we can normalize it in order to have unitary vectors
![Page 13: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/13.jpg)
Vector Space Models
![Page 14: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/14.jpg)
Query Matching If we want to retrieve a document we
should: Transform the query to a vector look for the most similar document vector to the
query. One of the most common similarity methods is the cosine distance between vectors defined as:
Where a is the document and q is the query vector
![Page 15: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/15.jpg)
Example:
Using the book titles we want to retrieve books of “Child Proofing”
Book titles0
1
0
0
0
1
0
0
Query
Cos 2=Cos 3=0.4082
Cos 5=Cos 6=0.500
With a threshold of 0.5, the 5th and the 6th would be retrieved.
![Page 16: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/16.jpg)
Term weighting In order to improve the retrieval, we can give
to some terms more weight than others.
00
01)(
rif
rifrX
Where
Local Term Weights Global Term Weights
jij
jiij f
fp
![Page 17: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/17.jpg)
Synonymy and Polysemy
autoenginebonnettyreslorryboot
caremissions
hood makemodeltrunk
makehiddenMarkovmodel
emissionsnormalize
Synonymy
Will have small cosine
but are related
Polysemy
Will have large cosine
but not truly related
![Page 18: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/18.jpg)
Matrix Decomposition
To produce a reduced –rank approximation of the document matrix, first we need to be able to identify the dependence between columns (documents) and rows (terms)
QR Factorization
SVD Decomposition
![Page 19: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/19.jpg)
QR Factorization The document matrix A can be decomposed as
below:
Where Q is an mXm orthogonal matrix and R is an mX m upper triangular matrix
This factorization can be used to determine the basis vectors for any matrix A
This factorization can be used to describe the semantic content of the corresponding text collection
QRA
![Page 20: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/20.jpg)
Example
5962.03802.00000007071.0
003430.01252.01432.04961.07746.000
003430.01252.09309.00000
5962.03802.00000007071.0
3862.05962.01715.00626.00716.02481.02582.05774.00
006860.02505.02864.06282.0000
003430.09393.000000
3802.05962.01715.00626.00716.02481.02582.05774.00
003430.01252.01432.04961.05164.05774.00
Q
A=
![Page 21: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/21.jpg)
Example
4851.0000000
5756.07528.000000
1013.06583.07596.00000
3508.003508.07211.0000
3651.003651.01155.07454.000
4082.004082.2582.06667.010
0006325.0001
R
![Page 22: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/22.jpg)
Query Matching We can rewrite the cosine distance using
this decomposition
Where rj refers to column j of the matrix R
22
1
22
1
1
22
)()(cos
qr
qQr
qrQ
qrQ
qa
qa
j
TTj
j
Tj
j
Tj
j
![Page 23: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/23.jpg)
Singular Value Decomposition (SVD)
This decomposition provides a reduced rank approximations in the column and row space of the document matrix
This decomposition is defined as
TVUA
mm mn V is nn
Where the columns U are orthogonal eigenvectors of AAT.
The columns of V are orthogonal eigenvectors of ATA.Eigenvalues 1 … r of AAT are the square root of the eigenvalues of ATA.
![Page 24: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/24.jpg)
Latent Semantic Decomposition (LSA)
It is the application of SVD in text mining. We decompose the document-term matrix A into
three matrices
A V U
The V matrix refers to terms
and U matrix refers to documents
![Page 25: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/25.jpg)
Latent Semantic Analysis
Once we have decomposed the document matrix A we can reduce its rank We can account for synonymy and polysemy
in the retrieval of documents Select the vectors associated with the higher
value of in each matrix and reconstruct the matrix A
![Page 26: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/26.jpg)
Latent Semantic Analysis
![Page 27: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/27.jpg)
Query Matching The cosines between the vector q and the n
document vectors can be represented as:
where ej is the canonical vector of dimension n
This formula can be simplified as
qeVU
qeVU
qeA
qeA
jkkk
Tjkkk
jk
Tjk
j
)()(cos
jTkkj
j
Tk
Tj
j
eVs
mjqs
qUs
,....,2,1,)(
cos
where
![Page 28: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/28.jpg)
Example
Apply the LSA method to the following technical memo titlesc1: Human machine interface for ABC computer applicationsc2: A survey of user opinion of computer system response timec3: The EPS user interface management systemc4: System and human system engineering testing of EPSc5: Relation of user perceived response time to error measurement
m1: The generation of random, binary, ordered treesm2: The intersection graph of paths in treesm3: Graph minors IV: Widths of trees and well-quasi-orderingm4: Graph minors: A survey
![Page 29: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/29.jpg)
c1 c2 c3 c4 c5 m1 m2 m3 m4human 1 0 0 1 0 0 0 0 0interface 1 0 1 0 0 0 0 0 0computer 1 1 0 0 0 0 0 0 0user 0 1 1 0 1 0 0 0 0system 0 1 1 2 0 0 0 0 0response 0 1 0 0 1 0 0 0 0time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0survey 0 1 0 0 0 0 0 0 1trees 0 0 0 0 0 1 1 1 0graph 0 0 0 0 0 0 1 1 1minors 0 0 0 0 0 0 0 1 1
Example
First we construct the document matrix
![Page 30: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/30.jpg)
Example
0.22 -0.11 0.29 -0.41 -0.11 -0.34 0.52 -0.06 -0.41 0.20 -0.07 0.14 -0.55 0.28 0.50 -0.07 -0.01 -0.11 0.24 0.04 -0.16 -0.59 -0.11 -0.25 -0.30 0.06 0.49 0.40 0.06 -0.34 0.10 0.33 0.38 0.00 0.00 0.01 0.64 -0.17 0.36 0.33 -0.16 -0.21 -0.17 0.03 0.27 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05 0.27 0.11 -0.43 0.07 0.08 -0.17 0.28 -0.02 -0.05 0.30 -0.14 0.33 0.19 0.11 0.27 0.03 -0.02 -0.17 0.21 0.27 -0.18 -0.03 -0.54 0.08 -0.47 -0.04 -0.58 0.01 0.49 0.23 0.03 0.59 -0.39 -0.29 0.25 -0.23 0.04 0.62 0.22 0.00 -0.07 0.11 0.16 -0.68 0.23 0.03 0.45 0.14 -0.01 -0.30 0.28 0.34 0.68 0.18
The Resulting decomposition is the following
{U} =
![Page 31: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/31.jpg)
Example
{} =
3.342.54
2.351.64
1.501.31
0.850.56
0.36
![Page 32: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/32.jpg)
Example
{V} =
0.20 0.61 0.46 0.54 0.28 0.00 0.01 0.02 0.08-0.06 0.17 -0.13 -0.23 0.11 0.19 0.44 0.62 0.53 0.11 -0.50 0.21 0.57 -0.51 0.10 0.19 0.25 0.08-0.95 -0.03 0.04 0.27 0.15 0.02 0.02 0.01 -0.03 0.05 -0.21 0.38 -0.21 0.33 0.39 0.35 0.15 -0.60-0.08 -0.26 0.72 -0.37 0.03 -0.30 -0.21 0.00 0.36 0.18 -0.43 -0.24 0.26 0.67 -0.34 -0.15 0.25 0.04-0.01 0.05 0.01 -0.02 -0.06 0.45 -0.76 0.45 -0.07-0.06 0.24 0.02 -0.08 -0.26 -0.62 0.02 0.52 -0.45
![Page 33: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/33.jpg)
Example We will perform a 2 rank reconstruction:
We select the first two vectors in each matrix and set the rest of the matrix to zero
We reconstruct the document matrix
![Page 34: Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d535503460f94a2fe77/html5/thumbnails/34.jpg)
Examplec1 c2 c3 c4 c5 m1 m2 m3 m4
human 0.16 0.40 0.38 0.47 0.18 -0.05 -0.12 -0.16 -0.09
interface 0.14 0.37 0.33 0.40 0.16 -0.03 -0.07 -0.10 -0.04
computer 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12
user 0.26 0.84 0.61 0.70 0.39 0.03 0.08 0.12 0.19
system 0.45 1.23 1.05 1.27 0.56 -0.07 -0.15 -0.21 -0.05
response 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22
time 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22
EPS 0.22 0.55 0.51 0.63 0.24 -0.07 -0.14 -0.20 -0.11
survey 0.10 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42
trees -0.06 0.23 -0.14 -0.27 0.14 0.24 0.55 0.77 0.66
graph -0.06 0.34 -0.15 -0.30 0.20 0.31 0.69 0.98 0.85
minors -0.04 0.25 -0.10 -0.21 0.15 0.22 0.50 0.71 0.62
The word user seems to have presence in the documents where the word human appears