latent semantic indexing 1. term-document matrices are very large, though most cells are “zeros”...
TRANSCRIPT
![Page 1: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/1.jpg)
Latent Semantic Indexing
1
![Page 2: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/2.jpg)
Latent Semantic Indexing
Term-document matrices are very large, though most cells are “zeros”
But the number of topics that people talk about is small (in some sense)
Clothes, movies, politics, … Each topic can be represented as a
cluster of (semantically) related terms, e.g.: clothes=golf, jacket, shoe..
Can we represent the term-document space by a lower dimensional “latent” space (latent space=set of topics)?
2
![Page 3: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/3.jpg)
Searching with latent topics
3
Given a collection of documents, LSI learns clusters of frequently co-occurring terms (ex: information retrieval, ranking and web)
If you query with ranking, information retrieval LSI “automatically” extends the search to documents including also (and even ONLY) web
![Page 4: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/4.jpg)
4
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
Selection based on ‘Golf’
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
CarPetrol
TopgearGTIPolo
Document base (20)
MotorBikeOil
PetrolTourer
BedlacelegalPetrolbutton
softPetrol
catline
yellow
windfullsail
harbourbeach
reportPetrol
TopgearJune
Speed
FishPondgold
PetrolKoi
PCDellRAMPetrolFloppy
CorePetrolApplePipTree
PeaPod
FreshGreenFrench
LupinPetrolSeedMay April
OfficePenDesk
Petrol VDU
FriendPal
HelpPetrolCan
PaperPetrolPastePencilRoof
CardStampGlue
HappySend
ToilPetrolWorkTimeCost
With standard VSM4 documents are selected
Golf
![Page 5: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/5.jpg)
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
Selection based on ‘Golf’
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
CarPetrol
TopgearGTIPolo
20 documents
MotorBikeOil
PetrolTourer
BedlacelegalPetrolbutton
softPetrol
catline
yellow
windfullsail
harbourbeach
reportPetrol
TopgearJune
Speed
FishPondgold
PetrolKoi
PCDellRAMPetrolFloppy
CorePetrolApplePipTree
PeaPod
FreshGreenFrench
LupinPetrolSeedMay April
OfficePenDesk
Petrol VDU
FriendPal
HelpPetrolCan
PaperPetrolPastePencilRoof
CardStampGlue
HappySend
ToilPetrolWorkTimeCost
The most relevant wordsassociated with golf in these
docs are:Car, Topgear and Petrol
Rank ofselected
documents
Car 2 *(20/3) = 13
Topgear 2 *(20/3) = 13
Petrol 3 *(20/16) = 4
wf.idf
![Page 6: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/6.jpg)
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
Rank ofselected
docsSelezione basata su ‘Golf’
Car 2 *(20/3) = 13
Topgear 2 *(20/3) = 13
Petrol 3 *(20/16) = 4
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
CarPetrol
TopgearGTIPolo
20 docs
MotorBikeOil
PetrolTourer
BedlacelegalPetrolbutton
softPetrol
catline
yellow
windfullsail
harbourbeach
reportPetrol
TopgearJune
Speed
FishPondgold
PetrolKoi
PCDellRAMPetrolFloppy
CorePetrolApplePipTree
PeaPod
FreshGreenFrench
LupinPetrolSeedMay April
OfficePenDesk
Petrol VDU
FriendPal
HelpPetrolCan
PaperPetrolPastePencilRoof
CardStampGlue
HappySend
ToilPetrolWorkTimeCost
Since words are weighted also wrt the idf,
car e topgear turn out to be related to Golfmore than petrol weight of
co-occurrences
![Page 7: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/7.jpg)
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
CarPetrol
TopgearGTIPolo
20 docs
MotorBikeOil
PetrolTourer
BedlacelegalPetrolbutton
softPetrol
catline
yellow
windfullsail
harbourbeach
reportPetrol
TopgearJune
Speed
FishPondgold
PetrolKoi
PCDellRAMPetrolFloppy
CorePetrolApplePipTree
PeaPod
FreshGreenFrench
LupinPetrolSeedMay April
OfficePenDesk
Petrol VDU
FriendPal
HelpPetrolCan
PaperPetrolPastePencilRoof
CardStampGlue
HappySend
ToilPetrolWorkTimeCost
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
rankdf selected
docsSelection based on‘Golf’
Selection based on the semantic domain of GolfCar
2 *(20/3) = 13Topgear
2 *(20/3) = 13Petrol
3 *(20/16) = 4
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
CarWheel
TopgearGTIPolo
We now search with allthe words in the “semantic domain” of Golf .
The list of retrieved docs nowIncludes one new document and
does not include a previouslyretrieved document.
![Page 8: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/8.jpg)
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
rankof
selecteddocs
Selezione basata su ‘Golf’
Selection based on semantic domainCar
2 *(20/3) = 13Topgear
2 *(20/3) = 13Petrol
3 *(20/16) = 4
Rank 2617 17 030
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
CarPetrol
TopgearGTIPolo
20 docs
MotorBikeOil
PetrolTourer
BedlacelegalPetrolbutton
softPetrol
catline
yellow
windfullsail
harbourbeach
reportPetrol
TopgearJune
Speed
FishPondgold
PetrolKoi
PCDellRAMPetrolFloppy
CorePetrolApplePipTree
PeaPod
FreshGreenFrench
LupinPetrolSeedMay April
OfficePenDesk
Petrol VDU
FriendPal
HelpPetrolCan
PaperPetrolPastePencilRoof
CardStampGlue
HappySend
ToilPetrolWorkTimeCost
GolfCar
TopgearPetrolGTI
GolfCar
ClarksonPetrolBadge
GolfPetrol
TopgearPoloRed
GolfTiger
WoodsBelfryTee
CarWheel
TopgearGTIPolo
The co-occurrence basedranking improves the performance.
Note that one of the most relevant doc does NOT include the word Golf, and
a doc with a “spurious”sense disappears
![Page 9: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/9.jpg)
Linear Algebra Background
9
The LSI method: how to detect “topics”
![Page 10: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/10.jpg)
Eigenvalues & Eigenvectors
Eigenvectors (for a square mm matrix S)
How many eigenvalues are there at most?
only has a non-zero solution if
this is a m-th order equation in λ which can have at most m distinct solutions (roots of the characteristic polynomial) – can be complex even though S is real.
eigenvalue(right) eigenvector
Example
10
![Page 11: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/11.jpg)
Example of eigenvector/eigenvalues
Def: A vector v Rn, v ≠ 0, is an eigenvector of a matrix mxm S with corresponding eigenvalue , if:Sv = v
![Page 12: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/12.jpg)
Example of computationremember
2 and 4 are the eigenvalues of S Characteristic
polynomial
Notice that we computed only the DIRECTION of eigenvectors
![Page 13: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/13.jpg)
Matrix-vector multiplication
13
Therefore, any generic vector (say x= ) can be viewed as
a combination of the eigenvectors: x = 2v1 + 4v2 + 6v3
2, 4 and 6 are the coordinates of v in the orthonormal space O
Eigenvectors define an orthonormal space O; two eigenvectors of different eigenvalues of a symmetric matrix are orthogonal.
6
4
2
Notice, there is not “the” eigenvectors for a matrix, but rather an orthonormal basis (an eigenspace)
![Page 14: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/14.jpg)
Example of 2 bidimensional orthonormal spaces
14
![Page 15: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/15.jpg)
Difference between orthonormal and orthogonal?
Orthogonal mean that the dot product is null. Orthonormal mean that the dot product is null and the norm is equal to 1.
If two or more vectors are orthonormal they are also orthogonal but the inverse is not true.
15
![Page 16: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/16.jpg)
Matrix vector multiplication
Thus a matrix-vector multiplication such as Sx (S, x a generic vector as in the previous slide) can be rewritten in terms of the eigenvalues/vectors of S:
Even though x is an arbitrary vector, the action of S on x (transformation) is determined by the eigenvalues/vectors.
Why?16
X=
6
4
2
![Page 17: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/17.jpg)
Matrix vector multiplication
17
Matrix multipliication by a vector = a linear transformation of the initial vector
![Page 18: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/18.jpg)
Geometric explanation
Multiplying a matrix and a vector has two effects over the vector: rotation (the coordinates of the vector change) and scaling (the length changes). The max compression and rotation depends on the matrix’s eigenvalues λi (s1,s2 and s3 in the picture)
x is a generic vector, v are the eigenvectors)
![Page 19: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/19.jpg)
Geometric explanation
In the distorsion, the max effect is played by thebiggest eigenvalues (s1 and s2 in the picture )
The eigenvalues describe the distorsion operated by the matrix on the original vector
![Page 20: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/20.jpg)
Geometric explanation
In this linear transformation, the image of “La Gioconda” is modifiedbut the central vertical axis is the same. The blue vector has changed, not the red one. Hence the red is an eigenvector of the linear transformation. Furthermore, the red vector has not been enlarged, not shortened, nor inverted, therefore its eigenvalue is 1 (the eigenvalue is a constant of translation of the image pixels along the blue direction) .
Sv=v
![Page 21: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/21.jpg)
Let be a square matrix with m orthogonal eigenvectors
Theorem: Exists an eigen decomposition
(cf. matrix diagonalization theorem)
Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of
Eigen/diagonal Decomposition
diagonal
Unique for
distinct eigen-values
21
![Page 22: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/22.jpg)
Diagonal decomposition: why/how
nvvU ...1Let U have the eigenvectors as columns:
n
nnnn vvvvvvSSU
............
1
1111
Then, SU can be written
And S=UU–1.
Thus SU=U, or U–1SU=
22
![Page 23: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/23.jpg)
Example
23
From which we get v21=0 and v22 any real
from which we get v11=−2v12
![Page 24: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/24.jpg)
Diagonal decomposition – example 2
Recall .3,1 ;21
1221
S
The eigenvectors and form
1
1
1
1
11
11U
Inverting, we have
2/12/1
2/12/11U
Then, S=UU–1 =
2/12/1
2/12/1
30
01
11
11
RecallUU–1 =1.
24
![Page 25: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/25.jpg)
So what?
What do these matrices have to do with IR? Recall M N term-document matrices … But everything so far needs square matrices – so
you need to be patient and learn one last notion
25
![Page 26: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/26.jpg)
Singular Value Decomposition for non-square matrixes
TVUA
MM MN V is NN
For an M N matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:
The columns of U are orthogonal eigenvectors of AAT
(left singular vectors).The columns of V are orthogonal eigenvectors of ATA (right
singular eigenvector)
ii rdiag ...1 Singular values.
Eigenvalues 1 … r of AAT = eigenvalues of ATA and:
26
![Page 27: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/27.jpg)
An example
27
![Page 28: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/28.jpg)
Singular Value Decomposition
Illustration of SVD dimensions and sparseness
![Page 29: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/29.jpg)
If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in red
Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N This is referred to as the reduced SVD
It is the convenient (space-saving) and usual form for computational applications
http://www.bluebit.gr/matrix-calculator/calculate.aspx
Reduced SVD
k 29
![Page 30: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/30.jpg)
Let’s recap
30
Since the yellow part is zero, an exact representation of A is:
But for some k<r, a good approximation is:
M<N
M>N
![Page 31: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/31.jpg)
Example of rank approximation
31
= x x
0.981 0.000 0.000 0.000 1.9850.000 0.000 3.000 0.000 0.0000.000 0.000 0.000 0.000 0.0000.000 4.000 0.000 0.000 0.000
A*=
![Page 32: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/32.jpg)
Approximation error
How good (bad) is this approximation? It’s the best possible, measured by the Frobenius
norm of the error:
where the i are ordered such that i i+1. Suggests why Frobenius error drops as k increases.
1)(:
min
kFkFkXrankX
AAXA
32
ii
![Page 33: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/33.jpg)
SVD Low-rank approximation
Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000)
We can construct an approximation A100 with rank 100!!! Of all rank 100 matrices, it would have the lowest
Frobenius error.
C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.
33
![Page 34: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/34.jpg)
Images gives a better intuition
34
![Page 35: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/35.jpg)
K=10
35
![Page 36: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/36.jpg)
K=20
36
![Page 37: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/37.jpg)
K=30
37
![Page 38: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/38.jpg)
K=100
38
![Page 39: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/39.jpg)
K=322 (the original image)
39
![Page 40: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/40.jpg)
We save space!! But this is only one issue
40
![Page 41: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/41.jpg)
41
But what about texts and IR? (finally!!)
Latent Semantic Indexing via the SVD
If A is a term/document matrix, then AAT and AT A are the matrixes of term and document co-occurrences, repectively
![Page 42: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/42.jpg)
Example 2
![Page 43: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/43.jpg)
Term-document matrix
A
![Page 44: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/44.jpg)
Co-occurrences of terms in documents
Lij is the number of documents in which wi and wj co-occurr
LijT is the number of co-occurring words in docs di and dj
(or viceversa if A is a document-term matrix)
Word i in doc k Word j in doc k
![Page 45: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/45.jpg)
Term co-occurrences example
L trees,graph = (000001110) (000000111)T=2
![Page 46: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/46.jpg)
A numerical example
2 2 0
2 2 0
0 0 1
w1
w2
w3
d1 d2 d3
1 0 1
1 0 1
0 1 0
AAT
1 1 0
0 0 1
1 1 0
=
w11 w12 w13 w21 w22 w23w31 w32 w33
But wij=wji hencew11 w12 w13 w12 w22 w23w13 w23 w33Symmetrical
![Page 47: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/47.jpg)
LSI: the intuition
47
t1
t3t2
Projecting A in the termspace: green yellow andred vector are documents
The black vectorare the unary eigenvector of A: they represent the main “directions” of thedocument vectors
The blue segments give theintuition of eigenvalues of A:bigger values are those forwhich the projection of allvectors on the direction ofcorrespondent eigenvectorIs higher
![Page 48: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/48.jpg)
LSI intuition
48
t1
t3t2We now project our document vectors on the reference orthonormalsystem represented by the 3 black vectors
![Page 49: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/49.jpg)
LSI intuition
49
..and we remove the dimension(s)with lower eigenvalues
Now the two axis represent a combination of words e.g.a latent semantic space
![Page 50: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/50.jpg)
Example
50
1.000 1.000 0.000 0.0001.000 0.000 0.000 0.0000.000 0.000 1.000 1.0000.000 0.000 1.000 1.000
t1
t2
t3
t4
d1 d2 d3 d4
0.000 -0.851 -0.526 0.000 0.000 -0.526 0.851 0.000-0.707 0.000 0.000 -0.707-0.707 0.000 0.000 0.707
2.000 0.000 0.000 0.0000.000 1.618 0.000 0.0000.000 0.000 0.618 0.0000.000 0.000 0.000 0.000
0.000 0.000 -0.707 -0.707-0.851 -0.526 0.000 0.000 0.526 -0.851 0.000 0.000 0.000 0.000 -0.707 0.707
x x
1.172 0.724 0.000 0.0000.724 0.448 0.000 0.0000.000 0.000 1.000 1.0000.000 0.000 1.000 1.000
We project termsand docs on two dimensions, v1 and v2(the principal eigenvectors)
We have two latent semanticcoordinates: s1:(t1,t2) ands2:(t3,t4)
Approximatednew term-docmatrix
Even if t2 does not occur in d2, now if we query with t2 the system will return also d2!!
![Page 51: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/51.jpg)
Co-occurrence space
In principle, the space of document or term co-occurrences is much (much!) higher than the original space of terms!!
But with SVD we consider only the most relevant ones, trough rank reduction
51
![Page 52: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/52.jpg)
LSI: what are the steps
From term-doc matrix A, compute the approximation Ak. with SVD
Project docs, terms and queries in a space of k<<r dimensions (the k “survived” eigenvectors) and compute cos-similarity as usual These dimensions are not the original
axes, but those defined by the orthonormal space of the reduced matrix Ak
52
Where σiqi (i=1,2..k<<r) are the new coordinates of q in the orthonormal space of Ak
![Page 53: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/53.jpg)
Projecting terms documents and queries in the LS space
53
A
If A=UVT we also have that:
V = ATU-1
t = t∧TVT
d = d∧TU-1
q = q∧TU-1
After rank k-
approximation :A≅Ak=UkkVk
T
dk ≅ dTUkk-1
qk ≅ qTUkk-1
sim(q, d) = sim(qTUkk
-1, dTUkk-
1)
![Page 54: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/54.jpg)
Consider a term-doc matrix MxN (M=11, N=3) and a query q
query
A
![Page 55: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/55.jpg)
1. Compute SVD: A= UVT
![Page 56: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/56.jpg)
2. Obtain a low rank approximation (k=2) Ak= UkkVT
k
![Page 57: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/57.jpg)
3a. Compute doc/query similarity
For N documents, Ak has N columns, each representing the coordinates of a document d i
projected in the k LSI dimensions
A query is considered like a document, and is projected in the LSI space
![Page 58: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/58.jpg)
3c. Compute the query vector
qk = qTUkk-1
q is projected in the 2-dimension LSI space!
![Page 59: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/59.jpg)
Documents and queries projected in the LSI space
![Page 60: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/60.jpg)
q/d similarity
![Page 61: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/61.jpg)
61
An overview of a semantic network of terms based on the top 100 most significant latent semantic dimensions (Zhu&Chen)
![Page 62: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/62.jpg)
Another LSI Example
![Page 63: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/63.jpg)
![Page 64: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/64.jpg)
t1t2t3t4
d1 d2 d3
AT
t1t2t3t4
Term co-occurrencesAAT
![Page 65: Latent Semantic Indexing 1. Term-document matrices are very large, though most cells are “zeros” But the number of topics that people talk about is small](https://reader031.vdocument.in/reader031/viewer/2022013101/56649e965503460f94b99e8a/html5/thumbnails/65.jpg)
Ak=UkΣkVkT≈A
Now it is like if t3 belongs to d1!