a formal study of information retrieval heuristics
DESCRIPTION
A Formal Study of Information Retrieval Heuristics. Hui Fang , Tao Tao , ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004. Presented by CHU Huei-Ming 2004/01/17. Outline. Formal Definitions of Heuristic Retrieval Constraints - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/1.jpg)
A Formal Study of Information Retrieval Heuristics
Hui Fang , Tao Tao , ChengXiang ZhaiUniversity of Illinois at Urbana Champaign Urbana
SIGIR 2004
Presented by CHU Huei-Ming 2004/01/17
![Page 2: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/2.jpg)
2
Outline
• Formal Definitions of Heuristic Retrieval Constraints• Analysis of Three Representative Retrieval Formulas
– Pivoted Normalization Method– Okapi Method– Dirichlet Prior Method
• Experiments• Conclusion and Future Work
![Page 3: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/3.jpg)
3
Formal Definitions of Heuristic Retrieval Constraints
• Six intuitive and desirable constraints
• Any reasonable retrieval formula should satisfy– Term Frequency Constraints (TFCs)– Term Discrimination Constraints (TDC)– Length Normalization Constraints (LNCs)– TF-Length Constraints (TF-LNC)
![Page 4: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/4.jpg)
4
Formal Definitions of Heuristic Retrieval Constraints
• Term Frequency Constraints (TFCs)– TFC1:
q={w} , Assume |d1|=|d2|. If c(w,d1) > c(w,d2), then f(d1,q) > f(d2,q)
– TFC2: q={w} , Assume |d1|=|d2|=|d3| , c(w,d1)>0,
If c(w,d2) - c(w,d1) =1 , c(w,d3) - c(w,d2) =1
then f(d2,q) - f(d1,q) > f(d3,q) - f(d2,q)
![Page 5: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/5.jpg)
5
Formal Definitions of Heuristic Retrieval Constraints
• Term Discrimination Constraints (TDC)– Let q be a query , and w1,w2 q be two query term
– Assume |d1|=|d2| , c(w1,d1) + c(w2,d1)= c(w1,d2) + c(w2,d2)
– If idf(w1) ≥ idf(w2) and c(w1,d1) > c(w2,d2) , then f(d1,q) ≥ f(d2,q)
![Page 6: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/6.jpg)
6
Formal Definitions of Heuristic Retrieval Constraints
• Length Normalization Constraints (LNCs)– LNC1
• Let q be a query , d1 and d2 are two documents
• If some word w’ q , c(w’,d2) = c(w’,d1) +1 but for any query term w, c(w,d2) = c(w,d1)then f(d1,q) ≥ f(d2,q)
– LNC2• Let q be a query ,∀ k >1 , d1 and d2 are two documents
• If |d1| = k · |d2| and for all terms w , c(w, d1) = k · c(w, d2),
• then f(d1, q) ≥ f(d2, q).
![Page 7: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/7.jpg)
7
Formal Definitions of Heuristic Retrieval Constraints
• TF-Length Constraints (TF-LNC)– q={w}, d1 and d2 are two documents
– If c(w,d1) > c(w,d2) and |d1|=|d2| + c(w,d1) - c(w,d2)
– then f(d1,q) > f(d2,q)
![Page 8: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/8.jpg)
8
Formal Definitions of Heuristic Retrieval Constraints
![Page 9: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/9.jpg)
9
Analysis of Three Representative Retrieval Formulas
• Pivoted Normalization Method• Okapi Method• Dirichlet Prior Method
![Page 10: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/10.jpg)
10
Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method
• Retrieval function
• Analyzing
)(1ln),(||)1(
))),((ln(1ln(1wdf
Ndwc
avdldss
dwcdqw
![Page 11: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/11.jpg)
11
Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method
• Check TF-LNC constraint when |d1|=avdl , it is equivalent to the
• TF-LNC is satisfied only if s is blow a certain upper bound
avdldwchdwcdwc
dwchdwchs
))),((1()),(),(()),(()),((
121
21
))ln(1ln()( xxh
![Page 12: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/12.jpg)
12
Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method
• Check the LNC2 constraint
12
22
21
)1||()1||( tfavdldktf
avdldk
tftfs
)(1ln),(||)1(
))),((ln(1ln(1
)(1ln),(||)1(
))),((ln(1ln(1
2
2
2
2
wdfNqwc
avdldss
dwc
wdfNqwc
avdldkss
dwck
![Page 13: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/13.jpg)
13
Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method
• Consider common case when |d2|=avdl
• Performance can be bad for a large s
)1()1(
1
2
1
tftf
ks
))),((ln(1ln(1))),((ln(1ln(1
22
21
dwctfdwcktf
![Page 14: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/14.jpg)
14
Analysis of Three Representative Retrieval Formulas Pivoted Normalization Method
• Check TDC constraint – It is equivalent to c(w2,d1) ≥ c(w1,d2) this is conditional satisfied
![Page 15: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/15.jpg)
15
Analysis of Three Representative Retrieval Formulas Okapi Method
• Retrieval function
• k1 (between 1.0~2.0 ) b (usually 0.75) and k3 (between 0 ~1000)
),(
),()1(
),()||)1((
),()1(5.0)(
5.0)(ln 3
3
1
1
dqw dwckdwck
dwcavdl
dbbk
dwckwdf
wdfN
![Page 16: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/16.jpg)
16
Analysis of Three Representative Retrieval Formulas Okapi Method
• Analysis– When df(w)> N/2 , the IDF part in the formula will be a negative
value– When the IDF part is positive (mostly true for keyword query)– TFC and LNCs are satisfied– TF-LNC constraint : considering a common case when |d2|=avdl
the constraint is equivalent to b ≤ avdl / c(w, d2)
– TDC is equivalent to c(w2,d1) ≥ c(w1,d2) same as the formula above
![Page 17: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/17.jpg)
17
Analysis of Three Representative Retrieval Formulas Okapi Method
• Modify Okapi Method– Solve the problem of negative IDF – Replace the original IDF in Okapi with the regular IDF in the
pivoted normalization formula– The performance is better on the verbose queries
• Analysis result
![Page 18: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/18.jpg)
18
Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method
• Retrieval function
• Use Dirichlet prior smoothing method to smooth a document language model
• Rank the documents according to the likelihood of the query according to the estimated language model of each document
d
qCwP
qwcqwcdqw
||
ln||))|(
),(1ln(),(
w
dwcCwpdwcdwp
),()|(),()|(
![Page 19: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/19.jpg)
19
Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method
• Analysis– LNC2 constraint is equivalent to c(w ,d2) ≥ |d2| p(w|C)
• Which is usually satisfied for content-carrying words– TDC constraint led to some lower bound for parameter
||ln2)
)|(),(1ln()
)|(),(1ln(
||ln2)
)|(),(1ln()
)|(),(1ln(
22
22
1
21
12
12
1
11
dCwPdwc
CwPdwc
dCwPdwc
CwPdwc
)|()|(),(),(
12
2211
CwpCwpdwcdwc
![Page 20: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/20.jpg)
20
Analysis of Three Representative Retrieval Formulas Dirichlet Prior Method
• Analysis– TDC : consider a common case of w2 , p(w2|C)=1/avdl
– Means for discriminative words with a high term frequency in a document , needs to be sufficiently large
– In order to balance the TF and IDF appropriately
)),(),(( )|(
),(),(
2211
2
2211
dwcdwcavdlCwp
dwcdwc
![Page 21: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/21.jpg)
21
ExperimentsSetup
• Document set
– AP: news article , DOE: technical report, FR: government documents,
– ADF :combination of AP, DOE, FR– Web: web data used in the TREC8– Trec7: ad hoc data used in the TREC7– Trec8: ad hoc data used in the TREC8
![Page 22: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/22.jpg)
22
ExperimentsSetup
• Query combination– Short-keyword (SK, keyword title)– Shot-verbose (SV, one sentence description)– Long-keyword (LK, keyword list)– Long-verbose (LV, multiple sentences)
• Preprocessing– Only stemming with the Porter’s stemmer– No stop words have been removed
![Page 23: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/23.jpg)
23
ExperimentsParameter Sensitivity
• Pivoted normalization method
• The analysis of LNC2 constraint for the pivoted normalization methods suggests the s should be smaller than 0.4
![Page 24: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/24.jpg)
24
ExperimentsParameter Sensitivity
• Okapi method k1 =1.2, k3 =1000, b changes from 0.1 to 1.0
![Page 25: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/25.jpg)
25
ExperimentsParameter Sensitivity
• Dirichlet prior method
![Page 26: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/26.jpg)
26
ExperimentsParameter Sensitivity
• Dirichlet prior method
![Page 27: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/27.jpg)
27
ExperimentsPerformance Comparison
![Page 28: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/28.jpg)
28
ExperimentsPerformance Comparison
• For any query type, the performance of Dirichlet prior method is comparable to pivoted normalization method
• For keyword queries, the performance of Okapi is comparable to the other two retrieval formulas
• For verbose queries, the performance of Okapi may be worse than others due to the possible negative IDF part in the formula
![Page 29: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/29.jpg)
29
ExperimentsPerformance Comparison
• Average precision comparison
![Page 30: A Formal Study of Information Retrieval Heuristics](https://reader030.vdocument.in/reader030/viewer/2022020118/56814c1f550346895db921af/html5/thumbnails/30.jpg)
30
Conclusion and Future Work
• Define six basic constraints that any reasonable retrieval function should satisfy
• When the constraints is not satisfied, it often indicates non-optimality of the method