latent semantic indexing or how i learned to stop worrying and love math i don’t understand
DESCRIPTION
Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand. Adam Carlson. Outline. Discourse Segmentation LSI Motivation Math - How to do LSI Applications More Math - Why does it work Wacky Ideas. Discourse Segmentation. - PowerPoint PPT PresentationTRANSCRIPT
Latent Semantic Indexingor
How I Learned to Stop Worrying and Love Math I Don’t Understand
Adam Carlson
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
2
Outline
• Discourse Segmentation
• LSI Motivation
• Math - How to do LSI
• Applications
• More Math - Why does it work
• Wacky Ideas
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
3
Discourse Segmentation
• Some collections (like the web) have high variance in document length
• Sometimes things like sentences or paragraphs work, sometimes they don’t
• Would like to segment documents according to topic
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
4
TextTiling
• Break document into units of fixed length
• Score cohesion between units
• Look for patterns of low cohesion surrounded by high cohesion– Indicates change of subject
• Found good agreement with human judges
• Possible application for LSI measures of coherence
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
5
Using Co-occurrence Information
• Major problems with word-matching– Synonymy (one meaning, many words)– Polysemy (one word, many meanings)
• Solutions– Concept search– Query expansion– Clustering
Latent Semantic Indexingalmost
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
6
Latent Semantic Indexing is ...
• Latent– Captures underlying structure of corpus
• Semantic– Groups words by “conceptual” similarity
• Cool– Lots of neat applications
• Not Silver Bullet– Not really semantic, just MDS, expensive
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
7
What is LSI
• Restructures vector space so that co-occurrences are mapped together
• Captures transitive co-occurrence relations
• Application of dimensional reduction to term-document matrix– Throw out da noise, bring in da regularities
• Form of clustering
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
8
Document vector space
Doc 1
Doc 2
House
Home
Domicile
Kumquat
Apple
Orange
Pear
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
9
Semantic Space
House
Home
Domicile
Kumquat
Apple
Orange
Pear
LSI Dim 1
LSI Dim 2
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
10
Singular Value Decomposition
=
=
mxnA
mxrU
rxrD
rxnVT
Terms
Documents
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
11
Term-Document Matrix Approximation
=
=
mxn
Âk
mxkUk
kxkDk
kxnVT
k
Terms
Documents
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
12
Properties of Â
• Best least-squares approximation of A given only k dimensions
• Terms and documents which were similar in A are more similar in Â
• This measure of similarity is transitive
So what can we do with this?
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
13
LSI Tricks and Tips
• Use  to query using standard cosine measure
• Use Uk·Dk for term similarity
• Use Dk·VkT for document similarity
=
=
mxn
Âk
mxkUk
kxkDk
kxnVT
k
Terms
Documents
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
14
Applications
• Information Retrieval– Improve retrieval
– Cross-language retrieval
– Document routing/filtering
– Measuring text coherence
• Cognitive Science– Learning synonyms
– Subject matter knowledge
– Word sorting behavior
– Lexical priming
• Education– Essay grading
– Text selection
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
15
Standard Vector Space Retrieval in LSI Space
• Improves recall at expense of precision• Compared to term-document vector space,
SMART and Vorhees [Deerwester et al. 1990]
– LSI did best on MED dataset– SMART did best on CISI dataset– but LSI was comparable to SMART when
stemming was added
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
16
Cross Language Retrieval
• Train on multilingual corpora using “combined” documents
• Add in single language documents
• Query in LSI space[Landauer & Littman 1990] French & English
[Landauer, Littman & Stornetta 1992] Japanese & English
[Young 1994] Greek & English
[Dumais, Landauer & Littman 1996] Comparisons between LSI, no-LSI and Machine Translation
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
17
Document Routing/Filtering
• Match reviewers with papers to be reviewed based on reviewers’ publications [Dumais & Nielsen 1992]
• Select papers for researchers to read based on other papers they liked [Foltz & Dumais 1992]
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
18
LSI goes to college
• Train LSI on encyclopedia articles
• Test against TOEFL synonym test
• Results comparable to (non-native) college applicants
[Landauer & Dumais 1996]
• Train on introductory Psychology texts
• Receive passing grade on multiple-choice questions (but did worse than students)
[Landauer, Foltz & Laham 1998]
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
19
Essay Grading
• Several techniques– Use essay (or sentences from essay) to query
into textbook or database of graded essays– Grade based on cosine from text or closest
graded essay– More consistent than expert human graders– Is that good?[Landauer, Laham & Foltz 1998]
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
20
Routing Meets Education
• Run LSA on a bunch of texts at different levels of sophistication
• Have student write short essay about topic
• Use essay as query to select most appropriate text for student
[Wolfe, Schreiner, Rehder, Laham Foltz, Kintch and Landauer 1998]
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
21
Measuring Text Coherence
• Use LSI to compute cosine of each sentence with following one [Foltz, Kintch & Landauer 1998]
• Correlates highly with established methods
• Can indicate where coherence breaks down
• Can be used to measure how semantic content changes across a text (discourse segmentation?)
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
22
Outline
• Discourse Segmentation
• LSI Motivation
• Math - How to do LSI
• Applications
• More Math - Why does it work
• Wacky Ideas
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
23
Least Squares ApproximationWhy does it work? 1st Attempt
• Â is best least-squares approximation to A using just k dimensions
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
24
Least Squares cont.
• Why does this work
• Are these the regularities we want to capture
• Why approximate at all? (hint: overfitting)
Not very convincing
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
25
Neural Network ExplanationWhy does it work? 2nd Attempt
• Consider fully connected 3 layer network– First layer is terms– Middle layer has k units– Last layer is documents– Weights on hidden layer will adjust to group
terms that appear in similar documents and documents containing similar terms
– This is analogous to the SVD matrices
T1
T2
T3
T4
D1
D2
D3
D4
D5
H1
H2
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
26
Spectral AnalysisWhy does it work? 3rd Attempt
• Kleinberg’s “Authoritative Sources”– A link provides evidence of authority
• Authoritative sources are pointed to by hubs
• Hubs point to authoritative sources
– Give every page some “weight”– Move weight back and forth across links– Stabilizes with authority and hubs– Equivalent to spectral analysis - eigenstuff
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
27
Spectral Analysis cont.
• Co-occurrence instead of authority
• Links are documents with the same word• Similar documents have many similar words
• Similar words occur in similar documents
• Turn Kleinberg crank and get:• Authoritative sources = similar documents
• Hubs = words that occur in similar documents
– Doesn’t exactly fit (asymmetric)
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
28
More EigenexplanationWhy does it work? 4th Attempt
• Rank of a matrix is a measure of how much information it contains
• Rows which are linear combinations of each other can be removed
• In this case, some singular values will be 0
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
29
Eigenvalues cont.
• Consider vectors of terms X, Y and Z• X = [1 1 0 0 1 0 ... ]
• Y = [0 0 1 1 0 0 ... ]
• Z = [1 1 2 2 0 1 ... ]
– Z X + 2Y– Some singular value of A is low– By forcing that singular value to 0, we merge
X, Y and Z
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
30
LSI Theory
• Under certain assumptions– Corpus has k topics
– Each topic has n>l unique terms
– Documents can cover multiple topics
– 95% of content words in document are on-topic
• LSI is guaranteed to separate documents into proper topics
• Speedup with random projection[Papdimitriou, Raghavan, Tamaki & Vempala 1998]
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
31
Related Techniques
• PCA/Factor analysis/Multi-dimensional scaling
• Neural nets
• Kohonen Maps
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
32
Dimensionality Reduction
• Dimensionality reduction takes high-dimensional data and re-expresses it in a lower dimension
• PCA– If you were only allowed 1 line to represent all
the data, what would it be• The one that explains the greatest variance
– Recur
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
33
PCA cont.
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
34
Wacky ideas
• Hierarchical concept clustering
• Measure spatial deviations– Communication barriers– Language drift
• Statistical/Symbolic Hybrids
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
35
Bat
BallGlove
Umpire
Moth
SonarBird
Bat
BallGlove
Umpire
Moth
SonarBird
Bat
BallGlove
Umpire
Moth
SonarBird
Hierarchical Concept Clustering
• LSI doesn’t handle polysemy well
• Find subspaces which separate polysemous words into different clusters
• Hopefully those subspaces correspond to topics
• Lather, rinse, repeat
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
36
Finding Communication Barriers
• Want to find terms which have different meanings in different corpora
• Judge words by the company they keep
• Look for words which are in cohesive clusters in both corpora but the terms in those clusters are different
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
37
Communication Barriers cont.
• Tried with pro-choice/pro-life corpora
• Poor results– Didn’t use cohesive clusters– Not enough data– Highly variable data
• Possible fix - start with baseline corpus and measure drift as other corpora are merged in
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
38
Tracking Language Drift
• Follow changes in clusters as a corpus grows
• Hierarchical Agglomerative Clustering may have discontinuities– Use these to mark significant changes
3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson
39
Hybrid Approach
• Merge statistical analysis (LSI) with symbolic analysis (MindNet)
• Use LSI term similarity metric to assign strengths to MindNet relations
• Incorporate syntactic information– Preprocess documents, adding POS or
attachment information to words– Time-N Flies-V Like-AVP An-Det Arrow-N