isp 433/633 week 10 vocabulary problem & latent semantic indexing partly based on g.furnas si503...

21
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Post on 20-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

ISP 433/633 Week 10

Vocabulary Problem &Latent Semantic Indexing

Partly based on G.Furnas SI503 slides

Page 2: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Synonymy

• Same meaning, different words

• Access, retrieval, look-up

Page 3: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Polysemy

• Same word, different meaning

• E.g. bank– River side– Financial institution

Page 4: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Exercise 1

• Get out a piece of paper

• List examplars (members) of category for 30 sec

• Category is...

Page 5: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Category is...

Flowers

Page 6: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Extended Free Recall is not Easy

Brain is NOT built that way!

Consequences:

• Trouble for generating synonyms for query terms

• Recall problem or Precision problem?

Page 7: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Exercise 2

• On a piece of paper, write the name you would give to a Web site that tells about interesting activities occurring in Albany area– E.g. this site would tell you what is

interesting to do on Friday or Saturday night

– Make the site name 20 characters or less.

Page 8: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Lexical Variability

• Very low probability of match in almost every circumstance you examine (p = .06 - .18)

• Severe consequences for lexically based access to information or functionality– Performance problems with lexically based IR

• I.e. querying• Precision problem or Recall problem???

• Note: The generation problem (a few slides back) contributes to you underestimating the lexical variability problem - you can’t generate many of the alternatives, so you think the variability is lower than it is.

Page 9: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Solutions for Lexical Variability Problem

• Direct Manipulation– Windows, Icons, Menus, Pointers (WIMP)– Recognition rather than recall– But limited number of items it can work for before

navigation gets to be an issue

• Controlled Vocabulary– Adaptive burden (learning) on users

• Adaptive Indexing– Adaptive burden on the system

• Semantic Indexing– E.g., NLP-based retrieval, Latent Semantic Indexing

Page 10: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Adaptive Index

An index that Learns from its mistakes• Adds links for words it used to miss on• Orders results by popularity for the given

query

Comments:• Learns about the most needed words first• In the natural context of their use• Success requires a sufficient density of usage

across the population...

Page 11: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Recall Vector Space Model

• D1 = “computer information retrieval”• D2 = “computer retrieval”• Q1 = “information, retrieval”

computer

information

retrieval

D1=(1, 1, 1)Q1=(0, 1, 1)

D2=(1, 0, 1)

Page 12: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Problem with Term Space

• Indexing terms contains only a fraction of terms that users may use in the query– Synonymy– Hard for user to come up with alternative

terms– Indexing terms may quite different from

user’s terms • High lexical variability

– Precision problem or recall problem?

Page 13: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Problem with Term Space (cont.)

• Terms are associated with unrelated documents– Polysemy– One solution is Controlled Vocabulary

• Bank -> financial institution• Expensive and restrictive

– Another solution is to add more terms in Boolean query

• Bank AND finance• Hard to come up with more terms• Terms may not in the index

Page 14: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Problem with Term Space (cont.)

• Terms are considered independent from each other (Orthogonal term dimensions)– Not the case– Too many dimensions

Page 15: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Semantic Space

• Space of meanings• Both terms and documents are data points• Reduced number of dimensions• Avoid Synonymy problem• Avoid Polysemy problem

Meaning 2

Meaning 1

Meaning 3

T1=(1, 1, 1)Q1=(0, 1, 1)

D2=(1, 0, 1)

Page 16: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

How?

• Meanings are hidden (latent semantics)• Take advantage of dependency among terms

– Occurrence of some patterns of words gives clue as to the likely occurrence of others

• “Bank, mortgage” co-present -> occurrence of “Finance”

– These correlated words are close to each other in semantic space

– Semantically unrelated words are far away• “River” will be far away, meaning of “Bank” will be quite

clear

Page 17: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Original Term Document Matrix

Page 18: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Latent Semantic Indexing (LSI)

• Transform term-document matrix• To get some orthogonal factors

– Using SVD– Uncover underlying independent meanings

• Map terms, query and documents to this factor space

• Do rest as if it is a vector space model– Compute similarity between query and

document

Page 19: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Semantic Space

Page 20: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Results

Page 21: ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Administrivia

• Volunteer to pick up, distribute, collect and return teaching evaluation forms?