![Page 1: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/1.jpg)
Relevance Feedback Main Idea:
Modify existing query based on relevance judgements Extract terms from relevant documents and add them to
the query and/or re-weight the terms already in the query
Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents
Users/system select terms from an automatically-generated list
![Page 2: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/2.jpg)
Relevance Feedback
Usually do both: expand query with new terms re-weight terms in query
There are many variations usually positive weights for terms from
relevant docs sometimes negative weights for terms
from non-relevant docs Remove terms ONLY in non-relevant
documents
![Page 3: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/3.jpg)
Relevance Feedback for Vector Model
Crdj
CrNCrdj
CroptdjdjQ 11
Cr = Set of documents that are truly relevant to QN = Total number of documents
In the “ideal” case where we know the relevant Documents a priori
![Page 4: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/4.jpg)
Rocchio Method
Dndj
DnDrdj
Dr djdjQQ ||||01
Qo is initial query. Q1 is the query after one iterationDr are the set of relevant docsDn are the set of irrelevant docs Alpha =1; Beta=.75, Gamma=.25 typically.
Other variations possible, but performance similar
![Page 5: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/5.jpg)
Rocchio/Vector Illustration
Retrieval
Information
0.5
1.0
0 0.5 1.0
D1
D2
Q0
Q’
Q”
Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)
Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)
![Page 6: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/6.jpg)
Example Rocchio Calculation
)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(
12
25.0
75.0
1
)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0(
)00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.
)120,.100,.100,.025,.050,.002,.020,.009,.020(.
)120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.
121
1
2
1
new
new
Q
SRRQQ
Q
S
R
R
Relevantdocs
Non-rel doc
Original Query
Constants
Rocchio Calculation
Resulting feedback query
![Page 7: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/7.jpg)
Rocchio Method
Rocchio automatically re-weights terms adds in new terms (from relevant docs)
have to be careful when using negative terms
Rocchio is not a machine learning algorithm Most methods perform similarly
results heavily dependent on test collection Machine learning methods are proving to
work better than standard IR approaches like Rocchio
![Page 8: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/8.jpg)
Relevance feedback in Probabilistic Model
sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) )
P(ki | R) P(ki | R) Probabilities P(ki | R) and P(ki | R) ?
Estimates based on assumptions: P(ki | R) = 0.5 P(ki | R) = ni
Nwhere ni is the number of docs that contain ki
Use this initial guess to retrieve an initial ranking Improve upon this initial ranking
![Page 9: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/9.jpg)
Improving the Initial Ranking sim(dj,q) ~
~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R)
Let V : set of docs initially retrieved Vi : subset of docs retrieved that contain ki
Reevaluate estimates: P(ki | R) = Vi V P(ki | R) = ni - Vi N - V
Repeat recursively
Relevance Feedback..
![Page 10: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/10.jpg)
Improving the Initial Ranking sim(dj,q) ~ ~ wiq * wij * (log
P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R)
To avoid problems with V=1 and Vi=0: P(ki | R) = Vi + 0.5
V + 1 P(ki | R) = ni - Vi + 0.5
N - V + 1 Also,
P(ki | R) = Vi + ni/N V + 1
P(ki | R) = ni - Vi + ni/N N - V + 1
Relevance Feedback..
![Page 11: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/11.jpg)
Using Relevance Feedback Known to improve results
in TREC-like conditions (no user involved)
What about with a user in the loop? How might you measure this?
Precision/Recall figures for the unseen documents need to be computed
![Page 12: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/12.jpg)
Relevance Feedback Summary Iterative query modification can
improve precision and recall for a standing query
In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them
![Page 13: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/13.jpg)
Query ExpansionAdd terms that are closely related to the query termsto improve precision and recall. Two variants: Local only analyze the closeness among the set of documents that are returned Global Consider all the documents in the corpus a prioriHow to decide closely related terms? THESAURI!! -- Hand-coded thesauri (Roget and his brothers) -- Automatically generated thesauri --Correlation based (association, nearness) --Similarity based (terms as vectors in doc space) --Statistical (clustering techniques)
![Page 14: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/14.jpg)
Correlation/Co-occurrence analysisCo-occurrence analysis: Terms that are related to terms in the original query may be
added to the query. Two terms are related if they have high co-occurrence in
documents.
Let n be the number of documents;
n1 and n2 be # documents containing terms t1 and t2,
m be the # documents having both t1 and t2
If t1 and t2 are independent
If t1 and t2 are correlated
mn
n
n
nn 21
mn
n
n
nn 21
Mea
sure
degree
of corre
lation
![Page 15: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/15.jpg)
Association Clusters Let Mij be the term-document matrix
For the full corpus (Global) For the docs in the set of initial results (local) (also sometimes, stems are used instead of terms)
Correlation matrix C = MMT (term-doc Xdoc-term = term-term)
djtvdjtudj
ffCuv ,,
CuvCvvCuuCuvSuv
CuvSuv
Un-normalized Association Matrix
Normalized Association Matrix
Nth-Association Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk
![Page 16: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/16.jpg)
Example
d1d2d3d4d5d6d7
K1 2 1 0 2 1 1 0
K2 0 0 1 0 2 2 5
K3 1 0 3 0 4 0 0
11 4 6
4 34 11
6 11 26
1.0 0.097 0.193
0.097 1.0 0.224
0.193 0.224 1.0
Correlatio
n
Matrix
Norm
alized
Correlation
Matrix
1th Assoc Cluster for K2 is K3
4)12(34)22(11)11(4)12(
12 ssssS
![Page 17: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/17.jpg)
Scalar clusters
Consider the normalized association matrix S
The “association vector” of term u Au is (Su1,Su2…Suk)
To measure neighborhood-induced correlation between terms:Take the cosine-theta between the association vectors of terms u and v
Even if terms u and v have low correlations, they may be transitively correlated (e.g. a term w has high correlation with u and v).
|||| AvAuAvAuSuv
Nth-scalar Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk
![Page 18: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/18.jpg)
Exampled1d2d3d4d5d6d7
K1 2 1 0 2 1 1 0
K2 0 0 1 0 2 2 5
K3 1 0 3 0 4 0 0
1.0 0.097 0.193
0.097 1.0 0.224
0.193 0.224 1.0
Normalized Correlation Matrix
AK1
USER(43): (neighborhood normatrix)
0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (1.0 0.09756097 0.19354838))
0: returned 1.0
0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (0.09756097 1.0 0.2244898))
0: returned 0.22647195
0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (0.19354838 0.2244898 1.0))
0: returned 0.38323623
0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (1.0 0.09756097 0.19354838))
0: returned 0.22647195
0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (0.09756097 1.0 0.2244898))
0: returned 1.0
0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (0.19354838 0.2244898 1.0))
0: returned 0.43570948
0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (1.0 0.09756097 0.19354838))
0: returned 0.38323623
0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (0.09756097 1.0 0.2244898))
0: returned 0.43570948
0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (0.19354838 0.2244898 1.0))
0: returned 1.0
1.0 0.226 0.383
0.226 1.0 0.435
0.383 0.435 1.0
Scalar (neighborhood)Cluster Matrix
1th Scalar Cluster for K2 is still K3
![Page 19: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/19.jpg)
Metric Clusters
Let r(ti,tj) be the minimum distance (in terms of number of separating words) between ti and tj in any single document (infinity if they never occur together in a document) Define cluster matrix Suv= 1/r(ti,tj)
Nth-metric Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk
R(ti,tj)
is al
so usef
ul
For proxim
ity queri
es
And phrase q
ueries
average..
![Page 20: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/20.jpg)
Similarity Thesaurus
The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. obtained by considering that the terms are
concepts in a concept space. each term is indexed by the documents in
which it appears. Terms assume the original role of
documents while documents are interpreted as indexing elements
![Page 21: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/21.jpg)
Motivation
Ka KbQ
Kv
Ki
Kj
![Page 22: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/22.jpg)
Similarity Thesaurus Inverse term frequency for document dj
To ki is associated a vector
Where
jj t
titf log
),....,,(k ,2,1,i Niii www
Terminology
t: number of terms in the collection
N: number of documents in the collection
Fi,j: frequency of occurrence of the term ki in the document dj
tj: vocabulary of document dj
itfj: inverse term frequency for document dj
N
lj
lil
li
jjij
ji
ji
itff
f
itff
f
w
1
22
,
,
,
,
,
))(max
5.05.0(
))(max
5.05.0(
Idea: It is no surprise if
Oxford dictionary
Mentions the word!
![Page 23: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/23.jpg)
Similarity Thesaurus
The relationship between two terms ku and kv is computed as a correlation factor cu,v given by
The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection
Expensive
Possible to do incremental updates…
jd
jv,ju,vuvu, wwkkc
Simila
r to th
e scal
ar clu
sters
Idea, but fo
r the t
f/itf w
eightin
g
Defining th
e term
vector
![Page 24: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/24.jpg)
three steps as follows: Represent the query in the concept space
used for representation of the index terms2 Based on the global similarity thesaurus,
compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q.
3 Expand the query with the top r ranked terms according to sim(q,kv)
Query expansion with Global Thesaurus
![Page 25: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/25.jpg)
Query Expansion - step one To the query q is associated a
vector q in the term-concept space given by
where wi,q is a weight associated to the index-query pair[ki,q]
iqk
qi kwi
,q
![Page 26: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/26.jpg)
Query Expansion - step two
Compute a similarity sim(q,kv) between each term kv and the user query q
where Cu,v is the correlation factor
qk
vu,qu,vv
u
cwkq)ksim(q,
![Page 27: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/27.jpg)
Query Expansion - step three Add the top r ranked terms according to sim(q,kv) to
the original query q to form the expanded query q’ To each expansion term kv in the query q’ is
assigned a weight wv,q’ given by
The expanded query q’ is then used to retrieve new documents to the user
qkqu,
vq'v,
u
w
)ksim(q,w
![Page 28: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/28.jpg)
Expansion terms must be low frequency terms However, it is difficult to cluster low
frequency terms Idea: Cluster documents into classes
instead and use the low frequency terms in these documents to define our thesaurus classes. This algorithm must produce small and
tight clusters.
Statistical Thesaurus formulation
![Page 29: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/29.jpg)
A clustering algorithm (Complete Link)
This is document clustering algorithm with produces small and tight clusters Place each document in a distinct cluster. Compute the similarity between all pairs of
clusters. Determine the pair of clusters [Cu,Cv] with
the highest inter-cluster similarity. Merge the clusters Cu and Cv Verify a stop criterion. If this criterion is not
met then go back to step 2. Return a hierarchy of clusters.
Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents
C1
D1 D2D3 D4
C2C3 C4
C1,3
0.99
C1,3,2
0.29
C1,3,2,4
0.00
![Page 30: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/30.jpg)
Selecting the terms that compose each class
Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows Obtain from the user three parameters
TC: Threshold class NDC: Number of documents in class MIDF: Minimum inverse document
frequency
C1
D1 D2D3 D4
C2C3 C4
C1,3
0.99
C1,3,2
0.29
C1,3,2,4
0.00
![Page 31: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/31.jpg)
Selecting the terms that compose each class Use the parameter TC as threshold value for
determining the document clusters that will be used to generate thesaurus classes
This threshold has to be surpassed by sim(Cu,Cv) if the documents in the clusters Cu and Cv are to be selected as sources of terms for a thesaurus class
Use the parameter NDC as a limit on the size of clusters (number of documents) to be considered.
A low value of NDC might restrict the selection to the smaller cluster Cu+v
![Page 32: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/32.jpg)
Selecting the terms that compose each class Consider the set of document in each
document cluster pre-selected above. Only the lower frequency documents are
used as sources of terms for the thesaurus classes The parameter MIDF defines the minimum
value of inverse document frequency for any term which is selected to participate in a thesaurus class
![Page 33: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/33.jpg)
Query Expansion based on a Statistical Thesaurus Use the thesaurus class to query
expansion. Compute an average term weight
wtc for each thesaurus class C
C
wwtc
C
1iCi,
![Page 34: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/34.jpg)
Query Expansion based on a Statistical Thesaurus wtc can be used to compute a
thesaurus class weight wc as
5.0C
wtcWc
![Page 35: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/35.jpg)
Query Expansion Sample
TC = 0.90 NDC = 2.00 MIDF = 0.2
sim(1,3) = 0.99sim(1,2) = 0.40sim(1,2) = 0.40sim(2,3) = 0.29sim(4,1) = 0.00sim(4,2) = 0.00sim(4,3) = 0.00
Doc1 = D, D, A, B, C, A, B, CDoc2 = E, C, E, A, A, DDoc3 = D, C, B, B, D, A, B, C, ADoc4 = A
C1
D1 D2D3 D4
C2C3 C4
C1,3
0.99
C1,3,2
0.29
C1,3,2,4
0.00
idf A = 0.0idf B = 0.3idf C = 0.12idf D = 0.12idf E = 0.60 q'=A B E
E
q= A E E
![Page 36: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/36.jpg)
Query Expansion based on a Statistical Thesaurus Problems with this approach
initialization of parameters TC,NDC and MIDF
TC depends on the collection Inspection of the cluster hierarchy is
almost always necessary for assisting with the setting of TC.
A high value of TC might yield classes with too few terms
![Page 37: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/37.jpg)
Conclusion
Thesaurus is a efficient method to expand queries
The computation is expensive but it is executed only once
Query expansion based on similarity thesaurus may use high term frequency to expand the query
Query expansion based on statistical thesaurus need well defined parameters
![Page 38: Relevance Feedback n Main Idea: u Modify existing query based on relevance judgements F Extract terms from relevant documents and add them to the query](https://reader038.vdocument.in/reader038/viewer/2022103005/56649d825503460f94a67130/html5/thumbnails/38.jpg)
Using correlation for term change Low freq to Medium Freq
By synonym recognition High to medium frequency
By phrase recognition