cs 572: information retrieval

CS 572: Information Retrieval

Lecture 9: Language Models for IR (cont’d)

Acknowledgments: Some slides in this lecture were adapted

from Chris Manning (Stanford) and Jin Kim (UMass’12)

2/10/2016 1CS 572: Information Retrieval. Spring 2016

2

New: IR based on Language Model (LM)

query

d1

d2

dn

…

Information need

document collection

generation

)|( dMQP 1dM

2dM

…

ndM• A common search heuristic is to use words

that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!

• The LM approach directly exploits that idea!

Probabilistic Language Modeling

• Goal: compute the probability of a document, a sentence, or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4)

• A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1) is called a language model.

• Better: the grammar But language model or LM is standard

Evaluation: How good is our model?

• Does our language model prefer good sentences to bad ones?– Assign higher probability to “real” or “frequently observed”

sentences • Than “ungrammatical” or “rarely observed” sentences?

• We train parameters of our model on a training set.• We test the model’s performance on data we haven’t

seen.– A test set is an unseen dataset that is different from our

training set, totally unused.– An evaluation metric tells us how well our model does on

the test set.

Training on the test set

• We can’t allow test sentences into the training set

• We will assign it an artificially high probability when we set it in the test set

• “Training on the test set”

• Bad science!

5

Extrinsic evaluation of N-gram models

• Best evaluation for comparing models A and B

– Put each model in a task

• spelling corrector, speech recognizer, IR system

– Run the task, get an accuracy for A and for B

• How many misspelled words corrected properly

• How many relevant/non-relevant docs retrieved

– Compare accuracy for A and B

• Problematic!

– Time consuming (re-index docs/re-run search/user study) – can take days or weeks

– Difficult to pinpoint problems in complex system/task

Intrinsic Evaluation: Perplexity

– Bad approximation

• unless the test data looks just like the training data

• So generally only useful in pilot experiments

– But is helpful to think about.

• The Shannon Game:

– How well can we predict the next word?

– Unigrams are terrible at this game. (Why?)

• A better model of a text– is one which assigns a higher probability to the word

that actually occurs

I always order pizza with cheese and ____

The 33rd President of the US was ____

I saw a ____

mushrooms 0.1

pepperoni 0.1

anchovies 0.01

….

fried rice 0.0001

….

and 1e-100

Intrinsic Evaluation: Perplexity

Perplexity (formal definition)

Perplexity is the inverse probability of the test set, normalized by the number of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set

• Gives the highest P(sentence)PP(W ) = P(w1w2...wN )

-1

N

=1

P(w1w2...wN )N

Perplexity as branching factor

• Let’s suppose a sentence consisting of random digits

• What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

N-gram Order

Unigram Bigram Trigram

Perplexity 962 170 109

The perils of overfitting

• N-grams only work well for word prediction if the test corpus looks like the training corpus

– In real life, it often doesn’t

– We need to train robust models that generalize!

– One kind of generalization: Zeros!

• Things that don’t ever occur in the training set

–But occur in the test set

Summary: Discounts for Smoothing

2/10/2016 CS 572: Information Retrieval. Spring 2016 13

Smoothing: Interpolation


15

Smoothing: Basic Interpolation Model

• General formulation of the LM for IR

– The user has a document in mind, and generates the query from this document.

– The equation represents the probability that the document that the user had in mind was in fact this one.

Qt

dMtptpdpdQp ))|()()1(()(),(

general language model

individual-document model

Jelinek-Mercer Smoothing


Dirichlet Smoothing


How to set the lambdas?

• Use a held-out corpus

• Choose λs to maximize the probability of held-out data:– Fix the N-gram probabilities (on the training data)

– Then search for λs that give largest probability to held-out set(or lowest perplexity of test set)

Training DataHeld-Out

DataTest Data

Huge web-scale n-grams

• How to deal with, e.g., Google N-gram corpus

• Pruning– Only store N-grams with count > threshold.

• Remove singletons of higher-order n-grams

– Entropy-based pruning

• Efficiency– Efficient data structures like tries

– Bloom filters: approximate language models

– Store words as indexes, not strings• Use Huffman coding to fit large numbers of words into two bytes

– Quantize probabilities (4-8 bits instead of 8-byte float)

Smoothing for Web-scale N-grams

• “Stupid backoff” (Brants et al. 2007)

• No discounting, just use relative frequencies

20

S(wi |wi-k+1

i-1 ) =

count(wi-k+1

i )

count(wi-k+1

i-1 ) if count(wi-k+1

i ) > 0

0.4S(wi |wi-k+2

i-1 ) otherwise

ì

íïï

îïï

S(wi ) =count(wi )

N

N-gram Smoothing Summary

• Add-1 smoothing:

– OK for text categorization, not for language modeling

• The most commonly used method in NLP:

– Extended Interpolated Kneser-Ney (see textbood)

• For very large N-grams like the Web:

– Stupid backoff

• For IR: variants of interpolation, discriminative models (choose Lambda to maximize retrieval metrics, not perplexity)

21

Language Modeling Toolkits

• SRILM

– http://www.speech.sri.com/projects/srilm/

• KenLM

– https://kheafield.com/code/kenlm/

http://www.speech.sri.com/projects/srilm/

https://kheafield.com/code/kenlm/

Google N-Gram Release, August 2006

…

Google N-Gram Release

• serve as the incoming 92

• serve as the incubator 99

• serve as the independent 794

• serve as the index 223

• serve as the indication 72

• serve as the indicator 120

• serve as the indicators 45

• serve as the indispensable 111

• serve as the indispensible 40

• serve as the individual 234

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Google Book N-grams

• http://ngrams.googlelabs.com/

http://ngrams.googlelabs.com/

Higher Order LMs for IR


Models of Text Generation


Ranking with Language Models


Ranking with LMs: Main Components

• Query probability: what is the probability to generate the given query, given a language model?

• Document Probability: what is the probability to generate the given document, given a language model?

• Model Comparison: how ”close” are two language models?


Ranking Using LMs: Multinomial


Ranking with LMs: Multi-Bernoulli


Score: Query Likelihood


Score 2: Document Likelihood


Score: Likelihood ratio (odds)


Score: Model Comparison


• Relative entropy between the two distributions

• Cost in bits of coding using Q when true distribution is P

Kullback-Leibler Divergence

)))(log()((

))(log()()(

iPiP

iQiPQPDi

KL

i

iPiPxPH ))(log()())((

36

Kullback-Leibler Divergence

i

KLiQ

iPiPQPD )

)(

)(log()()(

37

Two-stage Smoothing [Zhai & Lafferty 02]

2008 © ChengXiang Zhai38

c(w,d)

|d|

P(w|d) =+p(w|C)

+

Stage-1

-Explain unseen words

-Dirichlet prior(Bayesian)

Collection LM

(1-) + p(w|U)

Stage-2

-Explain noise in query

-2-component mixture

User background model

Can be approximated by p(w|C)

Structured Document Retrieval[Ogilvie & Callan 03]


1 2

1

11

...

( | , 1) ( | , 1)

( | , 1) ( | , 1)

m

m

i

i

m k

j i j

ji

Q q q q

p Q D R p q D R

s D D R p q D R

Title

Abstract

Body-Part1

Body-Part2

…

D

D1

D2

D3

Dk

-Want to combine different parts of a

document with appropriate weights

-Anchor text can be treated as a “part” of a

document

- Applicable to XML retrieval

“part selection” prob. Serves as weight for Dj

Can be trained using EM

Select Dj and generate a

query word using Dj

LMs for IR: Rules of Thumb


41

LMs vs. vector space model (1)

LMs have some things in common with vector space models.

Term frequency is directed in the model. But it is not scaled in LMs.

Probabilities are inherently “length-normalized”. Cosine normalization does something similar for vector space.

Mixing document and collection frequencies has an effect similar to idf.

Terms rare in the general collection, but common in some documents will have a greater influence on the ranking.

42

LMs vs. vector space model (2)

LMs vs. vector space model: commonalities Term frequency is directly in the model.

Probabilities are inherently “length-normalized”.

Mixing document and collection frequencies has an effect similar to idf.

LMs vs. vector space model: differences LMs: based on probability theory

Vector space: based on similarity, a geometric/ linear algebra notion

Collection frequency vs. document frequency

Details of term frequency, length normalization etc.

43

Vector space (tf-idf) vs. LM

The language modeling approach always does better in these experiments . . . . . . but note that where the approach shows significant gains is at higher levels of recall.

44

• The main difference is whether “Relevance” figures explicitly in the model or not

– LM approach attempts to do away with modeling relevance

• LM approach assumes that documents and expressions of information problems are of the same type

• Computationally tractable, intuitively appealing

LM vs. Prob. Model for IR

45

• Problems of basic LM approach

– Assumption of equivalence between document and information problem representation is unrealistic

– Very simple models of language

– Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance

– Can’t easily accommodate phrases, passages, Boolean operators

• Current extensions focus on putting relevance back into the model, etc.

LM vs. Prob. Model for IR

Ambiguity makes queries difficult

American Airlines?

or Alcoholics Anonymous?

46

• Clarity score ~ low ambiguity

• Cronen-Townsend et. al. SIGIR 2002

• Compare a language model – over the relevant documents for a query

– over all possible documents

• The more difference these are, the more clear the query is

• “programming perl” vs. “the”

Query Clarity

47

Clarity score

Vw coll wP

QwPQwP

)(

)|(log)|( scoreClarity 2

48


Predicting Query Difficulty [Cronen-Townsend et al. 02]

• Observations:– Discriminative queries tend to be easier– Comparison of the query model and the collection model can indicate how

discriminative a query is

• Method:– Define “query clarity” as the KL-divergence between an estimated query

model or relevance model and the collection LM

– An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model)

• Correlation between the clarity scores and retrieval performance

( | )( ) ( | ) log

( | )

Q

Q

w

p wclarity Q p w

p w Collection

Clarity scores on TREC-7 collection

50

Can use many more features

• http://www.slideshare.net/DavidCarmel/sigir12-tutorial-query-perfromance-prediction-for-ir


http://www.slideshare.net/DavidCarmel/sigir12-tutorial-query-perfromance-prediction-for-ir

cs 572: information retrieval

Documents