a formal study of information retrieval heuristics

30
A Formal Study of Information Retrieval Heuristics Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented by CHU Huei-Ming 2004/01/17

Upload: gavan

Post on 18-Mar-2016

38 views

Category:

Documents


1 download

DESCRIPTION

A Formal Study of Information Retrieval Heuristics. Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004. Presented by CHU Huei-Ming 2004/01/17. Outline. Formal Definitions of Heuristic Retrieval Constraints - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Formal Study of Information Retrieval Heuristics

A Formal Study of Information Retrieval Heuristics

Hui Fang , Tao Tao , ChengXiang ZhaiUniversity of Illinois at Urbana Champaign Urbana

SIGIR 2004

Presented by CHU Huei-Ming 2004/01/17

Page 2: A Formal Study of Information Retrieval Heuristics

2

Outline

• Formal Definitions of Heuristic Retrieval Constraints• Analysis of Three Representative Retrieval Formulas

– Pivoted Normalization Method– Okapi Method– Dirichlet Prior Method

• Experiments• Conclusion and Future Work

Page 3: A Formal Study of Information Retrieval Heuristics

3

Formal Definitions of Heuristic Retrieval Constraints

• Six intuitive and desirable constraints

• Any reasonable retrieval formula should satisfy– Term Frequency Constraints (TFCs)– Term Discrimination Constraints (TDC)– Length Normalization Constraints (LNCs)– TF-Length Constraints (TF-LNC)

Page 4: A Formal Study of Information Retrieval Heuristics

4

Formal Definitions of Heuristic Retrieval Constraints

• Term Frequency Constraints (TFCs)– TFC1:

q={w} , Assume |d1|=|d2|. If c(w,d1) > c(w,d2), then f(d1,q) > f(d2,q)

– TFC2: q={w} , Assume |d1|=|d2|=|d3| , c(w,d1)>0,

If c(w,d2) - c(w,d1) =1 , c(w,d3) - c(w,d2) =1

then f(d2,q) - f(d1,q) > f(d3,q) - f(d2,q)

Page 5: A Formal Study of Information Retrieval Heuristics

5

Formal Definitions of Heuristic Retrieval Constraints

• Term Discrimination Constraints (TDC)– Let q be a query , and w1,w2 q be two query term

– Assume |d1|=|d2| , c(w1,d1) + c(w2,d1)= c(w1,d2) + c(w2,d2)

– If idf(w1) ≥ idf(w2) and c(w1,d1) > c(w2,d2) , then f(d1,q) ≥ f(d2,q)

Page 6: A Formal Study of Information Retrieval Heuristics

6

Formal Definitions of Heuristic Retrieval Constraints

• Length Normalization Constraints (LNCs)– LNC1

• Let q be a query , d1 and d2 are two documents

• If some word w’ q , c(w’,d2) = c(w’,d1) +1 but for any query term w, c(w,d2) = c(w,d1)then f(d1,q) ≥ f(d2,q)

– LNC2• Let q be a query ,∀ k >1 , d1 and d2 are two documents

• If |d1| = k · |d2| and for all terms w , c(w, d1) = k · c(w, d2),

• then f(d1, q) ≥ f(d2, q).

Page 7: A Formal Study of Information Retrieval Heuristics

7

Formal Definitions of Heuristic Retrieval Constraints

• TF-Length Constraints (TF-LNC)– q={w}, d1 and d2 are two documents

– If c(w,d1) > c(w,d2) and |d1|=|d2| + c(w,d1) - c(w,d2)

– then f(d1,q) > f(d2,q)

Page 8: A Formal Study of Information Retrieval Heuristics

8

Formal Definitions of Heuristic Retrieval Constraints

Page 9: A Formal Study of Information Retrieval Heuristics

9

Analysis of Three Representative Retrieval Formulas

• Pivoted Normalization Method• Okapi Method• Dirichlet Prior Method

Page 10: A Formal Study of Information Retrieval Heuristics

10

Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method

• Retrieval function

• Analyzing

)(1ln),(||)1(

))),((ln(1ln(1wdf

Ndwc

avdldss

dwcdqw

Page 11: A Formal Study of Information Retrieval Heuristics

11

Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method

• Check TF-LNC constraint when |d1|=avdl , it is equivalent to the

• TF-LNC is satisfied only if s is blow a certain upper bound

avdldwchdwcdwc

dwchdwchs

))),((1()),(),(()),(()),((

121

21

))ln(1ln()( xxh

Page 12: A Formal Study of Information Retrieval Heuristics

12

Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method

• Check the LNC2 constraint

12

22

21

)1||()1||( tfavdldktf

avdldk

tftfs

)(1ln),(||)1(

))),((ln(1ln(1

)(1ln),(||)1(

))),((ln(1ln(1

2

2

2

2

wdfNqwc

avdldss

dwc

wdfNqwc

avdldkss

dwck

Page 13: A Formal Study of Information Retrieval Heuristics

13

Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method

• Consider common case when |d2|=avdl

• Performance can be bad for a large s

)1()1(

1

2

1

tftf

ks

))),((ln(1ln(1))),((ln(1ln(1

22

21

dwctfdwcktf

Page 14: A Formal Study of Information Retrieval Heuristics

14

Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method

• Check TDC constraint – It is equivalent to c(w2,d1) ≥ c(w1,d2) this is conditional satisfied

Page 15: A Formal Study of Information Retrieval Heuristics

15

Analysis of Three Representative Retrieval Formulas Okapi Method

• Retrieval function

• k1 (between 1.0~2.0 ) b (usually 0.75) and k3 (between 0 ~1000)

),(

),()1(

),()||)1((

),()1(5.0)(

5.0)(ln 3

3

1

1

dqw dwckdwck

dwcavdl

dbbk

dwckwdf

wdfN

Page 16: A Formal Study of Information Retrieval Heuristics

16

Analysis of Three Representative Retrieval Formulas Okapi Method

• Analysis– When df(w)> N/2 , the IDF part in the formula will be a negative

value– When the IDF part is positive (mostly true for keyword query)– TFC and LNCs are satisfied– TF-LNC constraint : considering a common case when |d2|=avdl

the constraint is equivalent to b ≤ avdl / c(w, d2)

– TDC is equivalent to c(w2,d1) ≥ c(w1,d2) same as the formula above

Page 17: A Formal Study of Information Retrieval Heuristics

17

Analysis of Three Representative Retrieval Formulas Okapi Method

• Modify Okapi Method– Solve the problem of negative IDF – Replace the original IDF in Okapi with the regular IDF in the

pivoted normalization formula– The performance is better on the verbose queries

• Analysis result

Page 18: A Formal Study of Information Retrieval Heuristics

18

Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method

• Retrieval function

• Use Dirichlet prior smoothing method to smooth a document language model

• Rank the documents according to the likelihood of the query according to the estimated language model of each document

d

qCwP

qwcqwcdqw

||

ln||))|(

),(1ln(),(

w

dwcCwpdwcdwp

),()|(),()|(

Page 19: A Formal Study of Information Retrieval Heuristics

19

Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method

• Analysis– LNC2 constraint is equivalent to c(w ,d2) ≥ |d2| p(w|C)

• Which is usually satisfied for content-carrying words– TDC constraint led to some lower bound for parameter

||ln2)

)|(),(1ln()

)|(),(1ln(

||ln2)

)|(),(1ln()

)|(),(1ln(

22

22

1

21

12

12

1

11

dCwPdwc

CwPdwc

dCwPdwc

CwPdwc

)|()|(),(),(

12

2211

CwpCwpdwcdwc

Page 20: A Formal Study of Information Retrieval Heuristics

20

Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method

• Analysis– TDC : consider a common case of w2 , p(w2|C)=1/avdl

– Means for discriminative words with a high term frequency in a document , needs to be sufficiently large

– In order to balance the TF and IDF appropriately

)),(),(( )|(

),(),(

2211

2

2211

dwcdwcavdlCwp

dwcdwc

Page 21: A Formal Study of Information Retrieval Heuristics

21

ExperimentsSetup

• Document set

– AP: news article , DOE: technical report, FR: government documents,

– ADF :combination of AP, DOE, FR– Web: web data used in the TREC8– Trec7: ad hoc data used in the TREC7– Trec8: ad hoc data used in the TREC8

Page 22: A Formal Study of Information Retrieval Heuristics

22

ExperimentsSetup

• Query combination– Short-keyword (SK, keyword title)– Shot-verbose (SV, one sentence description)– Long-keyword (LK, keyword list)– Long-verbose (LV, multiple sentences)

• Preprocessing– Only stemming with the Porter’s stemmer– No stop words have been removed

Page 23: A Formal Study of Information Retrieval Heuristics

23

ExperimentsParameter Sensitivity

• Pivoted normalization method

• The analysis of LNC2 constraint for the pivoted normalization methods suggests the s should be smaller than 0.4

Page 24: A Formal Study of Information Retrieval Heuristics

24

ExperimentsParameter Sensitivity

• Okapi method k1 =1.2, k3 =1000, b changes from 0.1 to 1.0

Page 25: A Formal Study of Information Retrieval Heuristics

25

ExperimentsParameter Sensitivity

• Dirichlet prior method

Page 26: A Formal Study of Information Retrieval Heuristics

26

ExperimentsParameter Sensitivity

• Dirichlet prior method

Page 27: A Formal Study of Information Retrieval Heuristics

27

ExperimentsPerformance Comparison

Page 28: A Formal Study of Information Retrieval Heuristics

28

ExperimentsPerformance Comparison

• For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method

• For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas

• For verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula

Page 29: A Formal Study of Information Retrieval Heuristics

29

ExperimentsPerformance Comparison

• Average precision comparison

Page 30: A Formal Study of Information Retrieval Heuristics

30

Conclusion and Future Work

• Define six basic constraints that any reasonable retrieval function should satisfy

• When the constraints is not satisfied, it often indicates non-optimality of the method