cs 572: information retrieval

51
CS 572: Information Retrieval Lecture 9: Language Models for IR (cont’d) Acknowledgments: Some slides in this lecture were adapted from Chris Manning (Stanford) and Jin Kim (UMass’12) 2/10/2016 1 CS 572: Information Retrieval. Spring 2016

Upload: others

Post on 13-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 572: Information Retrieval

CS 572: Information Retrieval

Lecture 9: Language Models for IR (cont’d)

Acknowledgments: Some slides in this lecture were adapted

from Chris Manning (Stanford) and Jin Kim (UMass’12)

2/10/2016 1CS 572: Information Retrieval. Spring 2016

Page 2: CS 572: Information Retrieval

2

New: IR based on Language Model (LM)

query

d1

d2

dn

Information need

document collection

generation

)|( dMQP 1dM

2dM

ndM• A common search heuristic is to use words

that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!

• The LM approach directly exploits that idea!

Page 3: CS 572: Information Retrieval

Probabilistic Language Modeling

• Goal: compute the probability of a document, a sentence, or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4)

• A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1) is called a language model.

• Better: the grammar But language model or LM is standard

Page 4: CS 572: Information Retrieval

Evaluation: How good is our model?

• Does our language model prefer good sentences to bad ones?– Assign higher probability to “real” or “frequently observed”

sentences • Than “ungrammatical” or “rarely observed” sentences?

• We train parameters of our model on a training set.• We test the model’s performance on data we haven’t

seen.– A test set is an unseen dataset that is different from our

training set, totally unused.– An evaluation metric tells us how well our model does on

the test set.

Page 5: CS 572: Information Retrieval

Training on the test set

• We can’t allow test sentences into the training set

• We will assign it an artificially high probability when we set it in the test set

• “Training on the test set”

• Bad science!

5

Page 6: CS 572: Information Retrieval

Extrinsic evaluation of N-gram models

• Best evaluation for comparing models A and B

– Put each model in a task

• spelling corrector, speech recognizer, IR system

– Run the task, get an accuracy for A and for B

• How many misspelled words corrected properly

• How many relevant/non-relevant docs retrieved

– Compare accuracy for A and B

• Problematic!

– Time consuming (re-index docs/re-run search/user study) – can take days or weeks

– Difficult to pinpoint problems in complex system/task

Page 7: CS 572: Information Retrieval

Intrinsic Evaluation: Perplexity

– Bad approximation

• unless the test data looks just like the training data

• So generally only useful in pilot experiments

– But is helpful to think about.

Page 8: CS 572: Information Retrieval

• The Shannon Game:

– How well can we predict the next word?

– Unigrams are terrible at this game. (Why?)

• A better model of a text– is one which assigns a higher probability to the word

that actually occurs

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100

Intrinsic Evaluation: Perplexity

Page 9: CS 572: Information Retrieval

Perplexity (formal definition)

Perplexity is the inverse probability of the test set, normalized by the number of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set

• Gives the highest P(sentence)PP(W ) = P(w1w2...wN )

-1

N

=1

P(w1w2...wN )N

Page 10: CS 572: Information Retrieval

Perplexity as branching factor

• Let’s suppose a sentence consisting of random digits

• What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

Page 11: CS 572: Information Retrieval

Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

N-gram Order

Unigram Bigram Trigram

Perplexity 962 170 109

Page 12: CS 572: Information Retrieval

The perils of overfitting

• N-grams only work well for word prediction if the test corpus looks like the training corpus

– In real life, it often doesn’t

– We need to train robust models that generalize!

– One kind of generalization: Zeros!

• Things that don’t ever occur in the training set

–But occur in the test set

Page 13: CS 572: Information Retrieval

Summary: Discounts for Smoothing

2/10/2016 CS 572: Information Retrieval. Spring 2016 13

Page 14: CS 572: Information Retrieval

Smoothing: Interpolation

2/10/2016 CS 572: Information Retrieval. Spring 2016 14

Page 15: CS 572: Information Retrieval

15

Smoothing: Basic Interpolation Model

• General formulation of the LM for IR

– The user has a document in mind, and generates the query from this document.

– The equation represents the probability that the document that the user had in mind was in fact this one.

Qt

dMtptpdpdQp ))|()()1(()(),(

general language model

individual-document model

Page 16: CS 572: Information Retrieval

Jelinek-Mercer Smoothing

2/10/2016 CS 572: Information Retrieval. Spring 2016 16

Page 17: CS 572: Information Retrieval

Dirichlet Smoothing

2/10/2016 CS 572: Information Retrieval. Spring 2016 17

Page 18: CS 572: Information Retrieval

How to set the lambdas?

• Use a held-out corpus

• Choose λs to maximize the probability of held-out data:– Fix the N-gram probabilities (on the training data)

– Then search for λs that give largest probability to held-out set(or lowest perplexity of test set)

Training DataHeld-Out

DataTest Data

Page 19: CS 572: Information Retrieval

Huge web-scale n-grams

• How to deal with, e.g., Google N-gram corpus

• Pruning– Only store N-grams with count > threshold.

• Remove singletons of higher-order n-grams

– Entropy-based pruning

• Efficiency– Efficient data structures like tries

– Bloom filters: approximate language models

– Store words as indexes, not strings• Use Huffman coding to fit large numbers of words into two bytes

– Quantize probabilities (4-8 bits instead of 8-byte float)

Page 20: CS 572: Information Retrieval

Smoothing for Web-scale N-grams

• “Stupid backoff” (Brants et al. 2007)

• No discounting, just use relative frequencies

20

S(wi |wi-k+1

i-1 ) =

count(wi-k+1

i )

count(wi-k+1

i-1 ) if count(wi-k+1

i ) > 0

0.4S(wi |wi-k+2

i-1 ) otherwise

ì

íïï

îïï

S(wi ) =count(wi )

N

Page 21: CS 572: Information Retrieval

N-gram Smoothing Summary

• Add-1 smoothing:

– OK for text categorization, not for language modeling

• The most commonly used method in NLP:

– Extended Interpolated Kneser-Ney (see textbood)

• For very large N-grams like the Web:

– Stupid backoff

• For IR: variants of interpolation, discriminative models (choose Lambda to maximize retrieval metrics, not perplexity)

21

Page 22: CS 572: Information Retrieval

Language Modeling Toolkits

• SRILM

– http://www.speech.sri.com/projects/srilm/

• KenLM

– https://kheafield.com/code/kenlm/

Page 23: CS 572: Information Retrieval

Google N-Gram Release, August 2006

Page 24: CS 572: Information Retrieval

Google N-Gram Release

• serve as the incoming 92

• serve as the incubator 99

• serve as the independent 794

• serve as the index 223

• serve as the indication 72

• serve as the indicator 120

• serve as the indicators 45

• serve as the indispensable 111

• serve as the indispensible 40

• serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Page 25: CS 572: Information Retrieval

Google Book N-grams

• http://ngrams.googlelabs.com/

Page 26: CS 572: Information Retrieval

Higher Order LMs for IR

2/10/2016 CS 572: Information Retrieval. Spring 2016 26

Page 27: CS 572: Information Retrieval

Models of Text Generation

2/10/2016 CS 572: Information Retrieval. Spring 2016 27

Page 28: CS 572: Information Retrieval

Ranking with Language Models

2/10/2016 CS 572: Information Retrieval. Spring 2016 28

Page 29: CS 572: Information Retrieval

Ranking with LMs: Main Components

• Query probability: what is the probability to generate the given query, given a language model?

• Document Probability: what is the probability to generate the given document, given a language model?

• Model Comparison: how ”close” are two language models?

2/10/2016 CS 572: Information Retrieval. Spring 2016 29

Page 30: CS 572: Information Retrieval

Ranking Using LMs: Multinomial

2/10/2016 CS 572: Information Retrieval. Spring 2016 30

Page 31: CS 572: Information Retrieval

Ranking with LMs: Multi-Bernoulli

2/10/2016 CS 572: Information Retrieval. Spring 2016 31

Page 32: CS 572: Information Retrieval

Score: Query Likelihood

2/10/2016 CS 572: Information Retrieval. Spring 2016 32

Page 33: CS 572: Information Retrieval

Score 2: Document Likelihood

2/10/2016 CS 572: Information Retrieval. Spring 2016 33

Page 34: CS 572: Information Retrieval

Score: Likelihood ratio (odds)

2/10/2016 CS 572: Information Retrieval. Spring 2016 34

Page 35: CS 572: Information Retrieval

Score: Model Comparison

2/10/2016 CS 572: Information Retrieval. Spring 2016 35

Page 36: CS 572: Information Retrieval

• Relative entropy between the two distributions

• Cost in bits of coding using Q when true distribution is P

Kullback-Leibler Divergence

)))(log()((

))(log()()(

iPiP

iQiPQPDi

KL

i

iPiPxPH ))(log()())((

36

Page 37: CS 572: Information Retrieval

Kullback-Leibler Divergence

i

KLiQ

iPiPQPD )

)(

)(log()()(

37

Page 38: CS 572: Information Retrieval

Two-stage Smoothing [Zhai & Lafferty 02]

2008 © ChengXiang Zhai38

c(w,d)

|d|

P(w|d) =+p(w|C)

+

Stage-1

-Explain unseen words

-Dirichlet prior(Bayesian)

Collection LM

(1-) + p(w|U)

Stage-2

-Explain noise in query

-2-component mixture

User background model

Can be approximated by p(w|C)

Page 39: CS 572: Information Retrieval

Structured Document Retrieval[Ogilvie & Callan 03]

2008 © ChengXiang Zhai39

1 2

1

11

...

( | , 1) ( | , 1)

( | , 1) ( | , 1)

m

m

i

i

m k

j i j

ji

Q q q q

p Q D R p q D R

s D D R p q D R

Title

Abstract

Body-Part1

Body-Part2

D

D1

D2

D3

Dk

-Want to combine different parts of a

document with appropriate weights

-Anchor text can be treated as a “part” of a

document

- Applicable to XML retrieval

“part selection” prob. Serves as weight for Dj

Can be trained using EM

Select Dj and generate a

query word using Dj

Page 40: CS 572: Information Retrieval

LMs for IR: Rules of Thumb

2/10/2016 CS 572: Information Retrieval. Spring 2016 40

Page 41: CS 572: Information Retrieval

41

LMs vs. vector space model (1)

LMs have some things in common with vector space models.

Term frequency is directed in the model. But it is not scaled in LMs.

Probabilities are inherently “length-normalized”. Cosine normalization does something similar for vector space.

Mixing document and collection frequencies has an effect similar to idf.

Terms rare in the general collection, but common in some documents will have a greater influence on the ranking.

Page 42: CS 572: Information Retrieval

42

LMs vs. vector space model (2)

LMs vs. vector space model: commonalities Term frequency is directly in the model.

Probabilities are inherently “length-normalized”.

Mixing document and collection frequencies has an effect similar to idf.

LMs vs. vector space model: differences LMs: based on probability theory

Vector space: based on similarity, a geometric/ linear algebra notion

Collection frequency vs. document frequency

Details of term frequency, length normalization etc.

Page 43: CS 572: Information Retrieval

43

Vector space (tf-idf) vs. LM

The language modeling approach always does better in these experiments . . . . . . but note that where the approach shows significant gains is at higher levels of recall.

Page 44: CS 572: Information Retrieval

44

• The main difference is whether “Relevance” figures explicitly in the model or not

– LM approach attempts to do away with modeling relevance

• LM approach assumes that documents and expressions of information problems are of the same type

• Computationally tractable, intuitively appealing

LM vs. Prob. Model for IR

Page 45: CS 572: Information Retrieval

45

• Problems of basic LM approach

– Assumption of equivalence between document and information problem representation is unrealistic

– Very simple models of language

– Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

– Can’t easily accommodate phrases, passages, Boolean operators

• Current extensions focus on putting relevance back into the model, etc.

LM vs. Prob. Model for IR

Page 46: CS 572: Information Retrieval

Ambiguity makes queries difficult

American Airlines?

or Alcoholics Anonymous?

46

Page 47: CS 572: Information Retrieval

• Clarity score ~ low ambiguity

• Cronen-Townsend et. al. SIGIR 2002

• Compare a language model – over the relevant documents for a query

– over all possible documents

• The more difference these are, the more clear the query is

• “programming perl” vs. “the”

Query Clarity

47

Page 48: CS 572: Information Retrieval

Clarity score

Vw coll wP

QwPQwP

)(

)|(log)|( scoreClarity 2

48

Page 49: CS 572: Information Retrieval

2008 © ChengXiang Zhai49

Predicting Query Difficulty [Cronen-Townsend et al. 02]

• Observations:– Discriminative queries tend to be easier– Comparison of the query model and the collection model can indicate how

discriminative a query is

• Method:– Define “query clarity” as the KL-divergence between an estimated query

model or relevance model and the collection LM

– An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model)

• Correlation between the clarity scores and retrieval performance

( | )( ) ( | ) log

( | )

Q

Q

w

p wclarity Q p w

p w Collection

Page 50: CS 572: Information Retrieval

Clarity scores on TREC-7 collection

50

Page 51: CS 572: Information Retrieval

Can use many more features

• http://www.slideshare.net/DavidCarmel/sigir12-tutorial-query-perfromance-prediction-for-ir

2/10/2016 CS 572: Information Retrieval. Spring 2016 51