1 latent semantic indexing jieping ye department of computer science & engineering arizona state...
Post on 20-Dec-2015
217 Views
Preview:
TRANSCRIPT
1
Latent Semantic Indexing
Jieping YeDepartment of Computer Science &
EngineeringArizona State University
http://www.public.asu.edu/~jye02
4
Some Properties of SVD
• That is, Ak is the optimal approximation in terms of the approximation error measured by the Frobenius norm, among all matrices of rank k
• Forms the basics of LSI (Latent Semantic Indexing) in informational retrieval
6
Applications of SVD Pseudoinverse Range, null space and rank Matrix approximation Other examples
http://en.wikipedia.org/wiki/Singular_value_decomposition
7
LSI (Latent Semantic Indexing) Introduction Latent Semantic Indexing
LSI Query Updating
An example
8
Problem Introduction Traditional term-matching method
doesn’t work well in information retrieval
We want to capture the concepts instead of words. Concepts are reflected in the words. However, One term may have multiple meaning Different terms may have the same
meaning.
9
LSI (Latent Semantic Indexing) LSI approach tries to overcome the
deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem.
The goal is to find effective models to represent the relationship between terms and documents. Hence a set of terms, which is by itself incomplete and unreliable, will be replaced by some set of entities which are more reliable indicants.
11
LSI, the Method (cont.)
Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.
12
Query A query q is also mapped into this
space, by
Compare the similarity in the new space
Intuition: Dimension reduction through LSI brings together “related” axes in the vector space.
1 kkT
k Uqq
18
How to set the value of k? LSI is useful only if k << n. If k is too large, it doesn't capture
the underlying latent semantic space; if k is too small, too much is lost.
No principled way of determining the best k.
19
How well does LSI work?
Effectiveness of LSI compared to regular term-matching depends on nature of documents.
Typical improvement: 0 to 30% better precision. Advantage greater for texts in which synonymy and
ambiguity are more prevalent. Best when recall is high.
Costs of LSI might outweigh improvement. SVD is computationally expensive; limited use for really
large document collections Inverted index not possible
top related