information models for ad hoc information retrieval, sigir 2010

Information-Based Models for Ad Hoc IR

Stephane Clinchant 1,2 Eric Gaussier 2

1 Xerox Research Centre Europe

2 Laboratoire d’Informatique de GrenobleUniv. Grenoble 1

SIGIR’10, 20 July 2010

S.Clinchant E.Gaussier (XRCE-LIG) Information-Based Models for Ad Hoc IR SIGIR’10, 20 July 2010 1 / 33

Overview

Information ModelsNormalization

Probability DistributionRSV

Heuristic Constraints

Condition 1Condition 2Condition 3Condition 4

BurstinessPhenomenon

Property of Prob.Distributions

Informative Content

Use Shannon’s information to weigh words in documents

P(X)−log P(X)

Inf(x) = − log P(x |ΘC ) = Informative ContentDeviation from an average behavior

- Observation by Harter (70): non-specialty words deviates from a Poisson- Informative Content, core to Divergence From Randomness Models

Informative Content

Use Shannon’s information to weigh words in documents

P(X)−log P(X)

Inf(x) = − log P(x |ΘC ) = Informative ContentDeviation from an average behavior- Observation by Harter (70): non-specialty words deviates from a Poisson- Informative Content, core to Divergence From Randomness Models

Information-based Model

Main idea:

1 Discrete terms frequencies x are renormalized into continuousvalues t(x), due to different document length

2 For each term w , values t(x) are assumed to follow a distribution Pwith parameter λw on the corpus, ie Tfw |λw ∼ P

3 Queries and documents are compared with a surprise measure, amean information:

RSV (q, d) =∑w∈q

−xqw log P(Tfw > t(xd

w )|λw )

Main idea:

w )|λw )

Main idea:

w )|λw )

Outline

1 Model PropertiesI Retrieval HeuristicsI Burstiness Phenomenon

2 Two Power-Law InstancesI log-logistic modelI smoothed power-law model

3 Experiments

4 Extension to PRF

Notations

xdw frequency of word w in document d , xq

w in querytdw normalized term frequency

Tfw random variable for frequency of word w

ld length of document didfw corpus parameter for word wθ model parameter.

Most (Ad-Hoc) IR models can be written as:

f (xqw )h(xd

w , ld , idfw , θ)

⇒ What do we know about h?

Notations

Tfw random variable for frequency of word wld length of document didfw corpus parameter for word wθ model parameter.

f (xqw )h(xd

w , ld , idfw , θ)

Notations

Tfw random variable for frequency of word wld length of document didfw corpus parameter for word wθ model parameter.

f (xqw )h(xd

w , ld , idfw , θ)

Overview

Condition 1Docs with more occurrences of query terms get higher scores than docswith less occurrences

∀(l , idf , θ),∂h(x , l , idf , θ)

∂x> 0 (h increases with x)

0 5 10 15

"Good" h: increasing"Bad" h: decreasing

Condition 2The increase in the retrieval score should be smaller for larger termfrequencies. Ex: 2→4, 50→ 52

∀(l , idf , θ),∂2h(x , l , idf , θ)

∂x2< 0 (h concave)

0 5 10 15

"Good" h: Concave"Bad" h: Convex

Difference of scores decreases

Difference of scores increases

Condition 3

Longer documents, when compared to shorter ones with exactly thesame number of occurrences of query terms, should be penalized(likely to cover additional topics)

∀(x , idf , θ),∂h(x , l , idf , θ)

∂l< 0 (h decreasing with l)

Condition 4: IDF EffectIt is important to downweight terms occurring in many documents

∀(x , l , θ),∂h(x , l , idf , θ)

∂idf> 0 (IDF Effect)

0 5 10 15

h(x,IDF=10)h(x,IDF=5)

IDF Effect: h(x,IDF=10)>h(x,IDF=5)

Condition 1: h increases with x

Condition 2: h is concave

Condition 3: h decreases with l

Condition 4: h increases with idf (IDF Effect)

Additionnal conditions in the paper

⇒ Analytical Reformulation of TFC1, TFC2, LNC1 and TDC:

Fang et al, A Formal Study of Information Retrieval Heuristics, SIGIR’04

Condition 1: h increases with x

Condition 2: h is concave

Condition 3: h decreases with l

Condition 4: h increases with idf (IDF Effect)

Additionnal conditions in the paper

⇒ Analytical Reformulation of TFC1, TFC2, LNC1 and TDC:

Fang et al, A Formal Study of Information Retrieval Heuristics, SIGIR’04

Overview

Burstiness Phenomenon

We proceed to Word Frequency distributions:

Church and Gale 1 showed that a 2-Poisson model yields a poor fit toword frequencies

A possible explanation: the behavior of words which tend to appear inbursts, ie burstiness

Once a word appears in a document, it is much more likely to appearagain

Recent works on Dirichlet Coumpound Multinomial

⇒ Which distributions can account for burstiness?

1Poisson MixturesS.Clinchant E.Gaussier (XRCE-LIG) Information-Based Models for Ad Hoc IR SIGIR’10, 20 July 2010 14 / 33

Burstiness Phenomenon

We proceed to Word Frequency distributions:

Church and Gale 1 showed that a 2-Poisson model yields a poor fit toword frequencies

A possible explanation: the behavior of words which tend to appear inbursts, ie burstiness

Once a word appears in a document, it is much more likely to appearagain

Recent works on Dirichlet Coumpound Multinomial

⇒ Which distributions can account for burstiness?

1Poisson MixturesS.Clinchant E.Gaussier (XRCE-LIG) Information-Based Models for Ad Hoc IR SIGIR’10, 20 July 2010 14 / 33

Burstiness Property of Probabilility Distribution

Definition

A distribution P is bursty iff the function gε defined by:

gε(x) = P(X ≥ x + ε|X ≥ x)

is a strictly increasing function of x ( ∀ε > 0)

Interpretation: it becomes easier to generate more occurrences

gε(x) strictly increasing ⇐⇒ ∆ = log gε(x) strictly increasing⇐⇒ ∆ = log P(X ≥ x + ε)− log P(X ≥ x) is increasing

As ∆ < 0, absolute values of successive difference ∆ decreases

Burstiness Property of Probabilility Distribution

Definition

A distribution P is bursty iff the function gε defined by:

gε(x) = P(X ≥ x + ε|X ≥ x)

is a strictly increasing function of x ( ∀ε > 0)

Interpretation: it becomes easier to generate more occurrences

gε(x) strictly increasing ⇐⇒ ∆ = log gε(x) strictly increasing⇐⇒ ∆ = log P(X ≥ x + ε)− log P(X ≥ x) is increasing

As ∆ < 0, absolute values of successive difference ∆ decreases

Geometric Interpretation of Burstiness

0 5 10 15

−5−4

−3−2

Delta = log P(X>x+e) − log P(X>x) increases

As Delta<0, absolute value decreases

Gaussian(mean=5,std=1) is not bursty

0 5 10 15

Overview

Information Models & Heuristics Constraints:

Models defined by:

Function h︷︸︸︷(− log P(Tfw > td

w |λw )) (1)

Condition 1: h increasing with x X

Condition 3: h penalizes long documents X

Condition 2: h concave

Theorem

If the distribution P is bursty, then the information model defined with Pis concave

IDF effect and 2 additional Conditions depend on the choice of P

Models defined by:

w |λw )) (1)

Theorem

Models defined by:

w |λw )) (1)

Theorem

Models defined by:

w |λw )) (1)

Theorem

Characterization of Information Models

1 Normalisation of FrequenciesIncreasing in x , decreasing in lex: DFR normalization td

w = xdw log(1 + c avg l

2 Probability Distribution Continuous and Bursty. Support = [0,+∞)

3 Retrieval Function

−xqw log P(Tfw > td

w |λw )

w∈q∩d

w |λw )

λw =Fw

where:-Fw Frequency of w in the corpus-Nw Document Frequency of w-N Number of documents in the collection

w |λw )

w∈q∩d

w |λw )

λw =Fw

w |λw )

w∈q∩d

w |λw )

λw =Fw

w |λw )

w∈q∩d

w |λw )

λw =Fw

w |λw )

w∈q∩d

w |λw )

λw =Fw

Two Power-law Instances

The log-logistic and smoothed power law models

Log-Logistic Model

Log-Logistic distribution

P(Tfw > tdw |λw ) =

(tdw + λw )

The LGD model is defined by

1 DFR Normalization with parameter c

2 Tfw ∼ LogLogistic(λw = NwN )

3 Ranking Model (as before):

RSV (q, d) =∑

w∈q∩d

[− log P(Tfw > td

Meets all conditions for all parameter values

Log-Logistic Model

(tdw + λw )

RSV (q, d) =∑

w∈q∩d

[− log P(Tfw > td

Log-Logistic Model

(tdw + λw )

RSV (q, d) =∑

w∈q∩d

[− log P(Tfw > td

Log-Logistic Model

(tdw + λw )

RSV (q, d) =∑

w∈q∩d

[− log P(Tfw > td

Smoothed Power Law SPL

Distribution on [0,+∞) with parameter 0 < λ < 1:

tdwtdw +1w − λw

1− λw

IR Model:

2 Tfw ∼ SPL(λw = NwN )

RSV (q, d) =∑

w∈q∩d

[− log P(Tfw > td

Meets all conditions

Smoothed Power Law SPL

Distribution on [0,+∞) with parameter 0 < λ < 1:

tdwtdw +1w − λw

1− λw

IR Model:

2 Tfw ∼ SPL(λw = NwN )

RSV (q, d) =∑

w∈q∩d

[− log P(Tfw > td

Meets all conditions

Experiments

Comparison with language models, BM25, DFR models

Corpus: ROBUST, TREC-3, CLEF03, GIRT with short (-t) and longqueries (-d)

6 query sets: ROB-d, ROB-t, T3-t, GIRT, CLEF-d, CLEF-t

Methodology:

1 Divide each collection into 10 splits training/test

2 Learn best parameter (µ, c , k1) to optimize MAP or P10 on thetraining set

3 Measure MAP or P10 on the 10 splits and test difference with a t-test.

Comparison with Dirichlet Smoothing

Table: LGD and SPL versus LM-Dirichlet after 10 splits; bold indicates significantdifference

MAP ROB-d ROB-t GIR T3-t CL-t CL-d

DIR 27.1 25.1 41.1 25.6 36.2 48.5LGD 27.4 25.0 42.1 24.8 36.8 49.7P10 ROB-d ROB-t GIR T3-t CL-t CLF-d

DIR 45.6 43.3 68.6 54.0 28.4 33.8LGD 46.2 43.5 69.0 54.3 28.6 34.5

DIR 26.7 25.0 40.9 27.1 36.2 50.2SPL 25.6 24.9 42.1 26.8 36.4 46.9

P10 ROB-d ROB-t GIR T3-t CL-t CL-d

DIR 45.2 43.8 68.2 52.8 27.3 32.8SPL 46.6 44.7 70.8 55.3 27.1 32.9

Comparison with Dirichlet Smoothing

Table: LGD and SPL versus LM-Dirichlet after 10 splits; bold indicates significantdifference

DIR 27.1 25.1 41.1 25.6 36.2 48.5LGD 27.4 25.0 42.1 24.8 36.8 49.7P10 ROB-d ROB-t GIR T3-t CL-t CLF-d

DIR 45.6 43.3 68.6 54.0 28.4 33.8LGD 46.2 43.5 69.0 54.3 28.6 34.5

DIR 26.7 25.0 40.9 27.1 36.2 50.2SPL 25.6 24.9 42.1 26.8 36.4 46.9

DIR 45.2 43.8 68.2 52.8 27.3 32.8SPL 46.6 44.7 70.8 55.3 27.1 32.9

Comparison with DFR models

Table: LGD and SPL versus PL2 after 10 splits; bold indicates significantdifference

PL2 26.2 24.8 40.6 24.9 36.0 47.2LGD 27.3 24.7 40.5 24.0 36.2 47.5

PL2 46.4 44.1 68.2 55.0 28.7 33.1LGD 46.6 43.2 66.7 53.9 28.5 33.7

PL2 26.3 25.2 42.8 25.8 37.3 45.7SPL 26.3 25.2 42.7 25.3 37.4 44.1

PL2 46.0 45.2 69.3 54.8 26.2 32.7SPL 47.0 45.2 69.8 55.4 25.9 32.9

Extension to Pseudo Relevance Feedback

Mean information of the top retrieved documents

InfoR(w) =1

|R|∑d∈R

− log P(Tfw > tdw ;λw )

Query Update:

xq2w =

maxw xqw

+ βInfoR(w)

maxw Info(w)

Comparison with others PRF Models

Mixture Model (Zhai)

R comes from a mixture of a relevant topic model θwand the corpus language model (multinomialdistribution)Query Update :

p(w |q2) = αp(w |q) + (1− α)θw

Bo2 Model (Amati)

Documents in R are merged together. A Geometricprobability model measures the informative content of awordQuery Update:

xq2w =

maxw xqw

+ βInfoBo2(w)

maxw InfoBo2(w)

Comparison with others PRF Models

Mixture Model (Zhai)

R comes from a mixture of a relevant topic model θwand the corpus language model (multinomialdistribution)Query Update :

p(w |q2) = αp(w |q) + (1− α)θw

Bo2 Model (Amati)

Documents in R are merged together. A Geometricprobability model measures the informative content of awordQuery Update:

xq2w =

maxw xqw

+ βInfoBo2(w)

maxw InfoBo2(w)

Pseudo Relevance Feedback Experiments

1 Divide each collection in 10 splits training/test

2 Learn best interpolation weight (β, α) to optimize MAP on thetraining set

3 Measure MAP on the 10 splits and test difference with a t-test

4 Change |R| and termCount TC to add to the queries

5 Repeat

Table: MAP, bold indicates best performance, ∗ significant difference over LMand Bo2 models

Model |R| TC ROB-t GIRT TREC3-t CLEF-t

LM+MIX 5 5 27.5 44.4 30.7 36.6INL+Bo2 5 5 26.5 42.0 30.6 37.6

LGD 5 5 28.3∗ 44.3 32.9∗ 37.6

LM+MIX 5 10 28.3 45.7∗ 33.6 37.4INL+Bo2 5 10 27.5 42.7 32.6 37.5

LGD 5 10 29.4∗ 44.9 35.0∗ 40.2∗

LM+MIX 10 10 28.4 45.5 31.8 37.6INL+Bo2 10 10 27.2 43.0 32.3 37.4

LGD 10 10 30.0∗ 46.8∗ 35.5∗ 38.9LM+MIX 10 20 29.0 46.2 33.7 38.2INL+Bo2 10 20 27.7 43.5 33.8 37.7

LGD 10 20 30.3∗ 47.6∗ 37.4∗ 38.6

Table: Mean average precision (MAP) of PRF experiments; bold indicates bestperformance, ∗ significant difference over LM and Bo2 models

Model |R| TC ROB-t GIR T3-t CL-t

LGD 5 5 28.3∗ 44.3 32.9∗ 37.6SPL 5 5 28.9∗ 45.6∗ 32.9∗ 39.0∗

LGD 5 10 29.4∗ 44.9 35.0∗ 40.2∗

SPL 5 10 29.6∗ 47.0∗ 34.6∗ 39.5∗

LGD 10 10 30.0∗ 46.8∗ 35.5∗ 38.9SPL 10 10 30.0∗ 48.9∗ 33.8∗ 39.1∗

LGD 10 20 30.3∗ 47.6∗ 37.4∗ 38.6SPL 10 20 29.9∗ 50.2∗ 34.3 39.7∗

LGD 20 20 29.5∗ 48.9∗ 37.2∗ 41.0∗

SPL 20 20 28.8 50.3∗ 33.9 39.0∗

Conclusion

Can we design IR models compatible with empirical evidence?

⇒ Proposal: Information Models modelling burstiness (better fit to data)

Analytical Characterization of Retrieval Constraints

Definition of Burstiness for Probabilility distributions

Information-Based Models compliant with Retrieval ConstraintsI Bursty Distribution ⇒ Concave Model

Extension to PRF

The Log-logistic and Smoothed Power Law ModelsI Similar/Better Performance to LM and DFR without PRF, better with

Questions ?

Conclusion

Can we design IR models compatible with empirical evidence?

⇒ Proposal: Information Models modelling burstiness (better fit to data)

Analytical Characterization of Retrieval Constraints

Definition of Burstiness for Probabilility distributions

Information-Based Models compliant with Retrieval ConstraintsI Bursty Distribution ⇒ Concave Model

Extension to PRF

The Log-logistic and Smoothed Power Law ModelsI Similar/Better Performance to LM and DFR without PRF, better with

Questions ?

Relation with DFR

DFR Models are defined by:

RSV (q, d) =∑

w∈q∩d

−xqw Inf2(td

w ) log P(tdw )

We can show that:

Inf2 makes DFR models concave (condition 2)

Without Inf2 , DFR models have poor performances

Discrete Laws with continues values

2 Notions of informations (non homogenous)

⇒ Information Models uses continuous laws and a single concept ofinformation

information models for ad hoc information retrieval, sigir 2010

Documents

amean information

ir models

term w

clinchant e

ir sigir10

distribution pwith parameter

word w model parameter

tfw w p3 queries