query-drift prevention for robust query expansion - presentation

Robust Query Expansion Based on Query-Drift Prevention

Robust Query Expansion Based on Query-DriftPrevention

Liron ZighelnicAcademic advisor: Dr. Oren Kurland

Based on our work at SIGIR 08’

The Faculty of Industrial Engineering and ManagementTechnion - Israel Institute of Technology

30.6.2009 - Information Systems Seminar


Background

retrieval

Outline1 Background

Ad Hoc Retrieval

2 Query ExpansionMotivationPseudo Relevance FeedbackThe Performance Robustness ProblemQuery Expansion Models

3 Query-Drift PreventionImproving Robustness Using FusionImproving Robustness Using Re-ordering Methods

4 Experimental Evaluation

5 Related Work

6 Summary

7 Questions


Background

retrieval

Our Mission - Ad Hoc Retrieval

Information Need

Corpus C

Ranked list of

documents

initD

1d

2d

nd

1nd

+

⋯

⋯

⋯

id C∈

Retrieval

System

3d

4d

(d)Scoreinit

Query q

documents


Query Expansion

Motivation

Outline1 Background

Ad Hoc Retrieval




5 Related Work

6 Summary

7 Questions


Query Expansion

Motivation

Query Expansion - Motivation

Users tend to use (very) short queriesThe polysemy problem (e.g., q: "Paris Hilton")The vocabulary mismatch problem (e.g., q: "view photos" d:"nature picures")

Expansion: Relevance Feedback vs. Pseudo RelevanceFeedback (a.k.a. blind feedback)(Buckley et al. 94’, Xu and Croft96’)


Query Expansion

Pseudo Relevance Feedback

Outline1 Background

Ad Hoc Retrieval




5 Related Work

6 Summary

7 Questions


Query Expansion



Expanded

Query

2'd

⋯

⋯

⋯

3'd

4'd

⋯ ⋯

(d)Scorepf

Retrieval

System

( )init

PF D

1'd

'nd

Expansion-Based List


Query Expansion

The Performance Robustness Problem

Outline1 Background

Ad Hoc Retrieval




5 Related Work

6 Summary

7 Questions


Query Expansion



Problems:

Dinit may contain many non relevant documents.

The initially retrieved document list Dinit may not manifest allquery-related aspects (Buckley 04’)

Consequences:

query drift- the shift in “intention” from the original query to itsexpanded form. (Mitra et al. 98’) (e.g., q: "Paris Hilton", q’: "ParisHilton Whitney model heiress")

While on average, pseudo-feedback-based query expansion methodsimprove retrieval effectiveness over that of retrieval using the originalquery, there are numerous queries for which this is not true


Query Expansion


The Performance Robustness Problem - Cont.

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

302 304

306 308

310 312

314 316

318 320

322 324

326 328

330 332

334 336

338 340

342 344

346 348

350

Diffe

ren

ce

in

Eff

ective

ne

ss

Queries

RM1 Query Drift - ROBUST Corpus Queries 301-350


Query Expansion

Query Expansion Models

Outline1 Background

Ad Hoc Retrieval




5 Related Work

6 Summary

7 Questions


Query Expansion



The Relevance Model - RM1 (Lavrenko and Croft 01’): The relevancemodel paradigm assumes that there exists a (language) model RM1that generates terms both in the query and in the relevant documents 1

pRM1(w)def= ∑

d∈Dinit

pd(w)pq(d)

The Interpolated Relevance Model - RM3 (Abdul-Jaleel et al. 04’):query-anchoring at the model level:

pRM3(w)def= λpq(w)+(1−λ )pRM1(w)

—————————————

1. px (y) denotes the "similarity" between x and y


Query Expansion


Query Expansion Models- Cont.

Rocchio-1: If we take RM1 model and set pq(d) to a uniformdistribution we get the following model:

pRocchio1(w)def= ∑

d∈Dinit

pd(w)∗ 1|Dinit|

where all documents in Dinit are equal contributors to the constructed model.

Rocchio-3 (Rocchio 71’): query-anchoring at the model level:

pRocchio3(w)def= λpq(w)+(1−λ ) ∑

d∈Dinit

pd(w)∗ 1|Dinit|


Query Expansion


Query Expansion Models- Cont.

Model Weigh Interpolationwith respect with the

to pq(d) original queryRM1 3 7

∑d∈Dinitpd(w)pq(d)

RM3 3 3

λpq(w)+(1−λ )∑d∈Dinitpd(w)pq(d)

Rocchio1 7 7

∑d∈Dinitpd(w)∗ 1

|Dinit|Rocchio3 7 3

λpq(w)+(1−λ )∑d∈Dinitpd(w)∗ 1

|Dinit|


Query-Drift Prevention

Improving Robustness Using Fusion

Outline1 Background

Ad Hoc Retrieval




5 Related Work

6 Summary

7 Questions




Our Idea - Using Fusion

Data fusion - combining retrieval methods or query representations.Data fusion - motivation:

Using a variety of methods (results) will utilize different aspects ofthe search space and hence will return more relevant results.

Performance effectiveness due to minimal overhead.




Improving Robustness Using Fusion - Motivation

Documents ranked high by both retrieved lists are potentiallyrelevant since they constitute a good match to both forms of thepresumed information need.

A document ranked high by the initial retrieval can be assumed tohave a high surface level similarity to the original query

Query expansion can add aspects that were not in the originalquery but may be relevant to the information need and mayimprove the retrieval.

A document that is ranked high by both the initial retrieval and theexpansion is assumed (potentially) to suffer less from query drift.

Documents that are retrieved using a variety of queryrepresentations have a high chance of being relevant. (Belkin etal. 93’, Robertson 97’)





The following retrieval methods operate on Dinit∪PF(Dinit).

Combmnz (Fox and Shaw 94’) rewards documents that are rankedhigh in both Dinit and PF(Dinit): 23

Scorecombmnz(d)def= (δ[d ∈Dinit]+δ[d ∈ PF(Dinit)])

·( Scoreinit(d)

∑d ′∈DinitScoreinit(d ′)

+Scorepf(d)

∑d ′∈PF(Dinit) Scorepf(d ′)

).

—————————————

2. For statement s, δ[s] = 1 if s is true and 0 otherwise.




Improving Robustness Using Fusion - Cont.

The interpolation algorithm:Differentially weights the initial score and the pseudo-feedback-basedscore using an interpolation parameter λ :

Scoreinterpolation(d)def=

λδ[d ∈Dinit]Scoreinit(d)

∑d ′∈DinitScoreinit(d ′)

+(1−λ )δ[d ∈ PF(Dinit)]Scorepf(d)

∑d ′∈PF(Dinit) Scorepf(d ′).



Improving Robustness Using Re-ordering Methods

Outline1 Background

Ad Hoc Retrieval




5 Related Work

6 Summary

7 Questions



Improving Robustness Using Re-ordering Methods

Improving Robustness Using Re-ordering MethodsrerankThe rerank method (e.g. Kurland and Lee 04’) re-orders the (top)pseudo-feedback-based retrieval results by the initial scores ofdocuments. This method anchors the documents in PF(Dinit) to thequery by using their initial scores.

Scorererank(d)def= δ[d ∈ PF(Dinit)]Scoreinit(d).

rev_rerankThe rev_rerank method re-orders the (top) initial retrieval results by thepseudo-feedback-based scores of documents

Scorerev_rerank(d)def= δ[d ∈Dinit]Scorepf(d).


Experimental Evaluation

Evaluation

Evaluation methods:MAP - Mean Average Precision - effectiveness measurement<Init - Percentage of queries for which the expansion-basedperformance is worse than that of using the original query(measure of robustness)

TREC collections:corpus queries disksTREC 51-200 1-3ROBUST 301-450, 601-700 4,5WSJ 151-200 1-2SJMN 51-150 3AP 51-150 1-3



Query Drift Prevention Methods Applied for RM1

10

15

20

25

30

35

MAP

TREC ROBUST WSJ SJMN AP

Corpus

Query Drift Prevention Methods Applied for RM1 - MAP

RM1

Interpolation

combmnz

rerank

rev_rerank

RM3

10

15

20

25

30

35

40

45

50

<Init


Corpus

Query Drift Prevention Methods Applied for RM1- Robustness

RM1

Interpolation

combmnz

rerank

rev_rerank

RM3

i,ei,ei,e

i i i

i,ei,ei,e

i ii

i,ei i

i

i

i,e

ii

iii,e

i,e

i i

i i,ei,e

i

“i” and “e” indicate

statistically

significant MAP

differences with the

initial ranking and

RM1 respectively



Robustness of Expansion Methods w/o Combmnz

0

10

20

30

40

50

<Init


Corpus

Robustness of Expansion Methods

RM1

RM3

Rocchio1

Rocchio3

0

10

20

30

40

50

<Init


Corpus

Robustness of Combmnz Applied for Expansion Methods

RM1

combmnz

RM3

combmnz

Rocchoi1

combmnz

Rocchio3

combmnz



Robustness Improvement Due to Combmnz

0

0.1

0.2

0.3

0.4

0.5

0.6

% Improvement

TREC ROBUST WSJ SJMN AP AVERAGE

Corpus

Robustness Improvement Due to Combmnz

RM3

combmnz

RM1

combmnz

Rocchio3

combmnz

Rocchoi1

combmnz



RM3 - The Impact of λ on Effectiveness and Robustness

pRM3(w)def= λpq(w)+(1−λ )pRM1(w)



Comparison with a Cluster-Based Re-sampling Method(Lee et al. 08’)

TREC ROBUST WSJ SJMN APMAP < Init MAP < Init MAP < Init MAP < Init MAP < Init

RM3 20 28.7 30 28.1 34.8 20 24.6 29 29.1 28.3RM3 combmnz 17.9 16.7 27.1 19.3 30.7 18 21.6 23 26.5 16.2RM3 rerank 16.9 22.7 25.5 15.3 28.4 14 19.9 11 25.1 12.1Clusters 19.8 31.3 29.9 32.9 32.7 24 25 31 29.4 28.3

1


Related Work

Related Work

Improving Robustness

Selecting sampling and weighting documents from the initialsearch (e.g, Billerbeck and Zobel 03’, Li and Croft 05’, Tao andZhai 06’, Collins-Thompson and Callan 07’)

Selecting and weighting terms (Mitra et al. 98’, Carpineto et al01’, Cao et al 08’)


Related Work

Related Work - Cont.

Using clustering (Lu et al. 97’ Buckley et al. 98’, Lee et al. 08’)

Predicting whether a given expanded query will be more effectivethan the original one (Cronen-Townsend et al 04’)

Predicting which expansion form from a set of candidates willperform best (Winaver et al. 07’)

Query-anchoring at the model level (Zhai and Lafferty 01’,Abdul-Jaleel et al 04’)


Summary

Summary

Fusion can potentially ameliorate query drift (similarity based vs.rank based)

Trade-off between effectiveness and robustness

Pre-retrieval vs. post-retrieval query anchoring


Questions

Questions?

Thank you for your time

query-drift prevention for robust query expansion - presentation

Documents

query drift

original query

queryrelated aspects

d dscoreinit query q

d nd expansionbased

nature picures expansion

polysemy problem

related work