latent semantic indexing or how i learned to stop worrying and love math i don’t understand

39
Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand Adam Carlson

Upload: yardley-johnston

Post on 31-Dec-2015

20 views

Category:

Documents


0 download

DESCRIPTION

Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand. Adam Carlson. Outline. Discourse Segmentation LSI Motivation Math - How to do LSI Applications More Math - Why does it work Wacky Ideas. Discourse Segmentation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

Latent Semantic Indexingor

How I Learned to Stop Worrying and Love Math I Don’t Understand

Adam Carlson

Page 2: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

2

Outline

• Discourse Segmentation

• LSI Motivation

• Math - How to do LSI

• Applications

• More Math - Why does it work

• Wacky Ideas

Page 3: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

3

Discourse Segmentation

• Some collections (like the web) have high variance in document length

• Sometimes things like sentences or paragraphs work, sometimes they don’t

• Would like to segment documents according to topic

Page 4: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

4

TextTiling

• Break document into units of fixed length

• Score cohesion between units

• Look for patterns of low cohesion surrounded by high cohesion– Indicates change of subject

• Found good agreement with human judges

• Possible application for LSI measures of coherence

Page 5: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

5

Using Co-occurrence Information

• Major problems with word-matching– Synonymy (one meaning, many words)– Polysemy (one word, many meanings)

• Solutions– Concept search– Query expansion– Clustering

Latent Semantic Indexingalmost

Page 6: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

6

Latent Semantic Indexing is ...

• Latent– Captures underlying structure of corpus

• Semantic– Groups words by “conceptual” similarity

• Cool– Lots of neat applications

• Not Silver Bullet– Not really semantic, just MDS, expensive

Page 7: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

7

What is LSI

• Restructures vector space so that co-occurrences are mapped together

• Captures transitive co-occurrence relations

• Application of dimensional reduction to term-document matrix– Throw out da noise, bring in da regularities

• Form of clustering

Page 8: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

8

Document vector space

Doc 1

Doc 2

House

Home

Domicile

Kumquat

Apple

Orange

Pear

Page 9: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

9

Semantic Space

House

Home

Domicile

Kumquat

Apple

Orange

Pear

LSI Dim 1

LSI Dim 2

Page 10: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

10

Singular Value Decomposition

=

=

mxnA

mxrU

rxrD

rxnVT

Terms

Documents

Page 11: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

11

Term-Document Matrix Approximation

=

=

mxn

Âk

mxkUk

kxkDk

kxnVT

k

Terms

Documents

Page 12: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

12

Properties of Â

• Best least-squares approximation of A given only k dimensions

• Terms and documents which were similar in A are more similar in Â

• This measure of similarity is transitive

So what can we do with this?

Page 13: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

13

LSI Tricks and Tips

• Use  to query using standard cosine measure

• Use Uk·Dk for term similarity

• Use Dk·VkT for document similarity

=

=

mxn

Âk

mxkUk

kxkDk

kxnVT

k

Terms

Documents

Page 14: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

14

Applications

• Information Retrieval– Improve retrieval

– Cross-language retrieval

– Document routing/filtering

– Measuring text coherence

• Cognitive Science– Learning synonyms

– Subject matter knowledge

– Word sorting behavior

– Lexical priming

• Education– Essay grading

– Text selection

Page 15: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

15

Standard Vector Space Retrieval in LSI Space

• Improves recall at expense of precision• Compared to term-document vector space,

SMART and Vorhees [Deerwester et al. 1990]

– LSI did best on MED dataset– SMART did best on CISI dataset– but LSI was comparable to SMART when

stemming was added

Page 16: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

16

Cross Language Retrieval

• Train on multilingual corpora using “combined” documents

• Add in single language documents

• Query in LSI space[Landauer & Littman 1990] French & English

[Landauer, Littman & Stornetta 1992] Japanese & English

[Young 1994] Greek & English

[Dumais, Landauer & Littman 1996] Comparisons between LSI, no-LSI and Machine Translation

Page 17: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

17

Document Routing/Filtering

• Match reviewers with papers to be reviewed based on reviewers’ publications [Dumais & Nielsen 1992]

• Select papers for researchers to read based on other papers they liked [Foltz & Dumais 1992]

Page 18: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

18

LSI goes to college

• Train LSI on encyclopedia articles

• Test against TOEFL synonym test

• Results comparable to (non-native) college applicants

[Landauer & Dumais 1996]

• Train on introductory Psychology texts

• Receive passing grade on multiple-choice questions (but did worse than students)

[Landauer, Foltz & Laham 1998]

Page 19: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

19

Essay Grading

• Several techniques– Use essay (or sentences from essay) to query

into textbook or database of graded essays– Grade based on cosine from text or closest

graded essay– More consistent than expert human graders– Is that good?[Landauer, Laham & Foltz 1998]

Page 20: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

20

Routing Meets Education

• Run LSA on a bunch of texts at different levels of sophistication

• Have student write short essay about topic

• Use essay as query to select most appropriate text for student

[Wolfe, Schreiner, Rehder, Laham Foltz, Kintch and Landauer 1998]

Page 21: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

21

Measuring Text Coherence

• Use LSI to compute cosine of each sentence with following one [Foltz, Kintch & Landauer 1998]

• Correlates highly with established methods

• Can indicate where coherence breaks down

• Can be used to measure how semantic content changes across a text (discourse segmentation?)

Page 22: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

22

Outline

• Discourse Segmentation

• LSI Motivation

• Math - How to do LSI

• Applications

• More Math - Why does it work

• Wacky Ideas

Page 23: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

23

Least Squares ApproximationWhy does it work? 1st Attempt

• Â is best least-squares approximation to A using just k dimensions

Page 24: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

24

Least Squares cont.

• Why does this work

• Are these the regularities we want to capture

• Why approximate at all? (hint: overfitting)

Not very convincing

Page 25: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

25

Neural Network ExplanationWhy does it work? 2nd Attempt

• Consider fully connected 3 layer network– First layer is terms– Middle layer has k units– Last layer is documents– Weights on hidden layer will adjust to group

terms that appear in similar documents and documents containing similar terms

– This is analogous to the SVD matrices

T1

T2

T3

T4

D1

D2

D3

D4

D5

H1

H2

Page 26: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

26

Spectral AnalysisWhy does it work? 3rd Attempt

• Kleinberg’s “Authoritative Sources”– A link provides evidence of authority

• Authoritative sources are pointed to by hubs

• Hubs point to authoritative sources

– Give every page some “weight”– Move weight back and forth across links– Stabilizes with authority and hubs– Equivalent to spectral analysis - eigenstuff

Page 27: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

27

Spectral Analysis cont.

• Co-occurrence instead of authority

• Links are documents with the same word• Similar documents have many similar words

• Similar words occur in similar documents

• Turn Kleinberg crank and get:• Authoritative sources = similar documents

• Hubs = words that occur in similar documents

– Doesn’t exactly fit (asymmetric)

Page 28: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

28

More EigenexplanationWhy does it work? 4th Attempt

• Rank of a matrix is a measure of how much information it contains

• Rows which are linear combinations of each other can be removed

• In this case, some singular values will be 0

Page 29: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

29

Eigenvalues cont.

• Consider vectors of terms X, Y and Z• X = [1 1 0 0 1 0 ... ]

• Y = [0 0 1 1 0 0 ... ]

• Z = [1 1 2 2 0 1 ... ]

– Z X + 2Y– Some singular value of A is low– By forcing that singular value to 0, we merge

X, Y and Z

Page 30: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

30

LSI Theory

• Under certain assumptions– Corpus has k topics

– Each topic has n>l unique terms

– Documents can cover multiple topics

– 95% of content words in document are on-topic

• LSI is guaranteed to separate documents into proper topics

• Speedup with random projection[Papdimitriou, Raghavan, Tamaki & Vempala 1998]

Page 31: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

31

Related Techniques

• PCA/Factor analysis/Multi-dimensional scaling

• Neural nets

• Kohonen Maps

Page 32: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

32

Dimensionality Reduction

• Dimensionality reduction takes high-dimensional data and re-expresses it in a lower dimension

• PCA– If you were only allowed 1 line to represent all

the data, what would it be• The one that explains the greatest variance

– Recur

Page 33: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

33

PCA cont.

Page 34: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

34

Wacky ideas

• Hierarchical concept clustering

• Measure spatial deviations– Communication barriers– Language drift

• Statistical/Symbolic Hybrids

Page 35: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

35

Bat

BallGlove

Umpire

Moth

SonarBird

Bat

BallGlove

Umpire

Moth

SonarBird

Bat

BallGlove

Umpire

Moth

SonarBird

Hierarchical Concept Clustering

• LSI doesn’t handle polysemy well

• Find subspaces which separate polysemous words into different clusters

• Hopefully those subspaces correspond to topics

• Lather, rinse, repeat

Page 36: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

36

Finding Communication Barriers

• Want to find terms which have different meanings in different corpora

• Judge words by the company they keep

• Look for words which are in cohesive clusters in both corpora but the terms in those clusters are different

Page 37: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

37

Communication Barriers cont.

• Tried with pro-choice/pro-life corpora

• Poor results– Didn’t use cohesive clusters– Not enough data– Highly variable data

• Possible fix - start with baseline corpus and measure drift as other corpora are merged in

Page 38: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

38

Tracking Language Drift

• Follow changes in clusters as a corpus grows

• Hierarchical Agglomerative Clustering may have discontinuities– Use these to mark significant changes

Page 39: Latent Semantic Indexing or How I Learned to Stop Worrying and Love Math I Don’t Understand

3/1/99 CS590Q W99 - Latent Semantic Indexing - Adam Carlson

39

Hybrid Approach

• Merge statistical analysis (LSI) with symbolic analysis (MindNet)

• Use LSI term similarity metric to assign strengths to MindNet relations

• Incorporate syntactic information– Preprocess documents, adding POS or

attachment information to words– Time-N Flies-V Like-AVP An-Det Arrow-N