chap. 5 chapter 5 query operations. 2 chap. 5 contents introduction user relevance feedback...
TRANSCRIPT
Chap. 5
Chapter 5
Query Operations
2
Chap. 5
Contents
IntroductionUser relevance feedbackAutomatic local analysisAutomatic global analysis Trends and research issues
3
Chap. 5
5.1 Introduction
Difficulty of formulating user queries– Insufficient knowledge of the collection make-u
p and of the retrieval environment
Query reformulation– Two basic steps Query expansion
Expanding the original query with new terms
Term reweighting Reweighting the terms in the expanded query
4
Chap. 5
Introduction (Cont.)
Query reformulation (cont.)– Three approaches User relevance feedback
Based on feedback information from the user
Local feedback Based on information derived from the set of documents
initially retrieved(local set)
Global feedback Based on global information derived from the document
collection
5
Chap. 5
5.2 User Relevance Feedback
User’s role in relevance feedback cycle– is presented with a list of the retrieved documents
– marks relevant documents
Main idea of relevance feedback– Selecting important terms, or expressions, attached to the
documents that have been identified as relevant by the user
– Enhancing the importance of these terms in new query formulation
The new query will be moved towards the relevant documents and away from the non-relevant ones
6
Chap. 5
User Relevance Feedback (Cont.)
Experiments have shown good improvements in precision for small test collections when RF is used.
Advantage of RF– Shields the user from the details of the query
reformulation process.
– Breaks down the whole searching task into a sequence of small steps which are easier to grasp.
– Provides a controlled process designed to emphasize relevant terms and de-emphasize non-relevant terms.
7
Chap. 5
5.2.1 Query Expansion and Term Reweighting for the Vector Model
Application of RF to the vector model– Considers that the term-weight vectors of the
documents identified as relevant have similarities among themselves.
– It is assumed that non-relevant documents have term-weight vectors which are dissimilar from the ones for the relevant documents.
– The basic idea is to reformulate the query such that it gets closer to the term-weight vector space of the relevant documents
8
Chap. 5
Query Expansion and Term Reweighting for the VM (Cont.)
Optimal query
– But, the relevant documents(Cr) are not known a priori
rjrj Cd
jrCd
jr
opt dCN
dC
q
11
Dr: set of relevant documents among the retrieved documents
Dn: set of non-relevant documents among the retrieved documents
Cr: set of relevant documents among all documents in the collection
|Dr|,|Dn|,|Cr|: number of documents in the sets Dr, Dn, and Cr, respectively
, , : tuning constants
9
Chap. 5
Query Expansion and Term Reweighting for the VM (Cont.)
Incremental change of the initial query vector– Standard_Rocchio
– Ide_Regular
– Ide_Dec_hi
The information contained in the relevant documents is more important than information provided by the non-relevant documents. ( > )
Positive feedback strategy: = 0
njrj Dd
jnDd
jr
m dD
dD
njrj Dd
jDd
jm ddqq
)(max jrelevantnonDd
jm ddqqrj
: the highest ranked non-relevant document)(max jrelevantnon d
10
Chap. 5
Query Expansion and Term Reweighting for the VM (Cont.)
Advantages– Simplicity
The modified term weights are computed directly from the set of retrieved documents.
– Good results The modified query vector does reflect a portion of the
intended query semantics.
Disadvantages– No optimality criterion is adopted.
11
Chap. 5
5.2.2 Term Reweighting for the Probabilistic Model
Probabilistic ranking formula
– But, the probabilities and are unknown.
t
i i
i
i
ijiqij RkP
RkP
RkP
RkPwwqdsim
1,, )|(
)|(1log
)|(1
)|(log~),(
: probability of observing the term in the set R of relevant documents
: probability of observing the term in the set of non-relevant documents
)|( RkP i
)|( RkP i R
ik
ik
)|( RkP i
)|( RkP i
12
Chap. 5
Term Reweighting for the Probabilistic Model (Cont.)
Probability estimation– Initial search
is constant for all terms :
The term probability distribution can be approximated
by the distribution in the whole collection.:
)|( RkP i
)|( RkP i
ik 5.0)|( RkP i
N
nRkP i
i )|(
t
i i
ijiqijinitial n
nNwwqdsim
1,, log),(
: the number of documents in the collection which contain the term inik
13
Chap. 5
Term Reweighting for the Probabilistic Model (Cont.)
Probability estimation (cont.)– Feedback search
Accumulated statistics related to the relevance or non-relevance of previously retrieved documents are used
t
i irir
iri
irr
irjiqij
DnDN
Dn
DD
Dwwqdsim
1 ,
,
,
,,, log),(
Dr : set of relevant retrieved documents
Dr,i : subset of Dr composed of the documents which contain the term ki
r
iri
ir
ir
i DN
DnRkP
D
DRkP
,,
)|( ;)|(
14
Chap. 5
Term Reweighting for the Probabilistic Model (Cont.)
Probability estimation (cont.)– Feedback search (cont.)
No query expansion occurs. The same query terms are being reweighted using feedback inf
ormation provided by the user. Problems for small values of and
Adjustment factor is often added: 0.5 or or
rDirD ,
Nni riri DNDn ,
1)|( ;
1)|(
,,
r
iiri
ir
iir
i DNN
nDn
RkPD
N
nD
RkP
1
5.0)|( ;
1
5.0)|(
,,
r
iri
ir
ir
i DN
DnRkP
D
DRkP
15
Chap. 5
Term Reweighting for the Probabilistic Model (Cont.)
Advantages of probabilistic RF– Feedback process is directly related to the derivation of new weigh
ts for query terms.– Term reweighting is optimal under the assumptions of term indepe
ndence and binary document indexing. Disadvantages of probabilistic RF
– Document term weights are not taken into account during the feedback loop.
– Weights of terms in the previous query formulations are also disregarded.
– No query expansion is used. Probabilistic RF methods do not in general operate as effectively a
s the conventional vector modification methods
16
Chap. 5
5.2.3 A Variant of Probabilistic Term Reweighting
Probabilistic ranking formula
Initial search
– Parameter C and K should be adjusted according to the collection For automatically indexed collections, C=0
Feedback search
t
iqjijiqij Fwwqdsim
1,,,,~),(
)max(
1 ;,
,,,,,
ji
jijijiiqji f
fKKffidfCF
jii
i
i
iqji f
RkP
RkP
RkP
RkPCF ,,, )|(
)|(1log
)|(1
)|(log
: normalized within-document frequencyjif ,
17
Chap. 5
A Variant of Probabilistic Term Reweighting (Cont.)
Advantages– It takes into account the within-document frequencies.
– It adopts a normalized version of these frequencies.
– It introduces the constant C and K which provide for greater flexibility.
Disadvantages– More complex formulation
– No query expansion
18
Chap. 5
5.2.4 Evaluation of Relevance Feedback Strategies
A simplistic evaluation– Retrieve a set of document using modified query.– Measure recall-precision figures relative to the set of
relevant documents for the original query.– The results show spectacular improvements.
Significant part of this improvement results from the higher ranks assigned to the set R of documents already identified as relevant during the feedback process.
Since the user has seen these documents already, such evaluation is unrealistic.
It masks any real gains in retrieval performance due to documents not seen by the user yet.
19
Chap. 5
Evaluation of Relevance Feedback Strategies (Cont.)
Residual collection evaluation
– Evaluate the retrieval performance of the modified query considering only the residual collection.
– Our main purpose is to compare the performance of distinct RF strategies.
– Any experimentation involving RF strategies should always evaluate recall-precision figures relative to the residual collection.
Residual collection: the set of all documents minus the set of feedback documents provided by the user
20
Chap. 5
5.3 Automatic Local Analysis
User relevance feedback– Expanded query will retrieve more relevant documents.
– There is an underlying notion of clustering Known relevant documents contain terms which can be used
to describe a larger cluster of relevant documents. The description of this larger cluster of relevant documents is
built interactively with assistance from the user.
21
Chap. 5
Automatic Local Analysis (Cont.)
Automatic relevance feedback– Obtain a description for a larger cluster of relevant
documents automatically.
– Involves identifying terms which are related to the query terms Synonyms, stemming variations, terms which are close to the
query terms in the text, …
– Global feedback & local feedback
22
Chap. 5
Automatic Local Analysis (Cont.)
Global feedback– All documents in the collection are used to determine a
global thesaurus-like structure which defines term relationships.
– This structure is then shown to the user who selects terms for query expansion.
Local feedback– The documents retrieved for a given query q are
examined at query time to determine terms for query expansion.
– Is done without assistance from the user Local clustering & local context analysis
23
Chap. 5
5.3.1 Query Expansion Through Local Clustering
Global clustering– Build global structures such as association matrices whi
ch quantify term correlations– Use correlated terms for query expansion– Main problem
There is not consistent evidence that global structures can be used effectively to improve retrieval performance with general collections.
Global structures do not adapt well to the local context defined by the current query.
Local clustering [Attar & Fraenkel 1977]– Aim at optimizing the current search.
24
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Basic terminology– Stem
V(s): a non-empty subset of words which are grammatical variant of each other
A canonical form s of V(s) is called a stem. If V(s) = {polish, polishing, polished} then s=polish.
– Local document set The set of documents retrieved for a given query q
– Local vocabulary The set of all distinct words in the local document set The set of all distinct stems derived from the set is referred t
o as lS
lD
lVlV
25
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Local clustering– Operate solely on the documents retrieved for the
current query
– The application to the Web is unlikely at this time It is frequently necessary to access the text of such documents At a client machine
Retrieving the text for local analysis would take too long Reduce drastically the interactive nature of Web interface and
the satisfaction of the user
At the search engine site Analyzing the text would represent an extra spending of CPU
time which is not cost effective at this time.
26
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Local clustering (cont.)– Quite useful in the environment of intranets
– Of great assistance for searching information in specialized document collections (e.g. medical document collection)
– Local feedback strategies are based on expanding the query with terms correlated to the query terms. Such correlated terms are those present in local clusters built from the local document set.
27
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Association clusters– Based on the frequency of co-occurrence of stems (or te
rms) inside documents
– This idea is that terms which co-occur frequently inside documents have a synonymity association.
– Correlation between the stems andus vs
lj
vuDd
jsjsvu ffc ,,,
: frequency of a stem in a document , jsif , is jd lj Dd
28
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Association clusters (cont.)– Unnormalized local association matrix
– Normalized local association matrix
: association matrix with rows and columns, where
: transpose of
: local stem-stem association matrixtmms
)( ijmm
lS lD
jsij ifm ,
tm
m
vuvu cs ,,
vuvvuu
vuvu ccc
cs
,,,
,,
29
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Association Clusters (cont.)– Building local association clusters
Consider the u-th row in the association matrix (i.e., the row with all the associations for the stem ).
Let be a function which takes the u-th row and returns the set of n largest values , where v varies over the set of local stems and .
Then defines a local association cluster around the stem .
If is unnormalized, the association cluster is said to be unnormalized.
If is normalized, the association cluster is said to be normalized.
s
us)(nSu
vus ,uv
)(nSu
usvus ,
vus ,
30
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Metric clusters– Two terms which occur in the same sentence seem
more correlated than two terms which occur far apart in a document
– It might be worthwhile to factor in the distance between two terms in the computation of their correlation factor.
– Correlation between the stems andus vs
)( )(
, ),(
1
ui vjsVk sVk jivu kkr
c
: distance between two keywords and (the number of words between them in a same document)
: and are in distinct documents
),( ji kkr ik jk
),( ji kkr ik jk
31
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Metric clusters (cont.)– Unnormalized local association matrix
– Normalized local association matrix
vuvu cs ,,
)()(,
,vu
vuvu sVsV
cs
: local stem-stem metric correlation matrixs
32
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Metric clusters (cont.)– Building local metric clusters
Consider the u-th row in the metric correlation matrix (i.e., the row with all the associations for the stem ).
Let be a function which takes the u-th row and returns the set of n largest values , where v varies over the set of local stems and .
Then defines a local metric cluster around the stem . If is unnormalized, the metric cluster is said to be unnorm
alized. If is normalized, the metric cluster is said to be normalize
d.
s
us)(nSu
vus ,
uv )(nSu us
vus ,
vus ,
33
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Scalar clusters– The idea is that two stems with similar neighborhoods h
ave some synonymity relationship.
– The relationship is indirect or induced by the neighborhood.
– Quantifying such neighborhood relationships Arrange all correlation values in a vector Arrange all correlation values in another vector Compare these vectors through a scalar measure The cosine of the angle between the two vectors is a popular sc
alar similarity measure.
ius , us
vs
ivs ,
34
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Scalar clusters (cont.)– Scalar association matrix
– Building scalar clusters Let be a function which returns the set of n largest valu
es , . Then defines a scalar cluster around the stem .
vu
vuvu ss
sss
,
: scalar association matrix
and
s
),...,,( ,2,1, nvvvv ssss ),...,,( ,2,1, nuuuu ssss
)(nSu
vus , uv )(nSu us
35
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Interactive search formulation– Neighbor
A stem which belongs to a cluster associated to another stem is said to be a neighbor of .
While neighbor stems are said to have a synonymity relationship, they are not necessarily synonyms in the grammatical sense.
Represent distinct keywords which are though correlated by the current query context
The local aspect of this correlation is reflected in the fact that the documents and stems considered in the correlation matrix are all local.
Neighbors of the query stems can be used to expand the original query.
us
vs vs
36
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Interactive search formulation (cont.)– Neighbor
Important product of the local clustering process They can be used for extending a search formulation in a
promising unexpected direction, rather than merely complementing it with missing synonyms.
x
x
xx
x
x x
x
xx
x
x x
x
x
x
)(nSv
vsus
stem as a neighbor of the stemus vs
37
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Expanding a given query q with neighbor stems– For each stem
Select m neighbor stems from the cluster Add them to the query
– Merging of normalized and unnormalized clusters To cover a broader neighborhood, the set might be comp
osed of stems obtained using correlation factors normalized and unnormalized.
Unnormalized cluster tends to group stems whose ties are due to their large frequencies.
Normalized cluster tends to group stems which are more rare. The union of the two clusters provides a better representation o
f the possible correlations.
qsv )(nSv
)(nSv
vuc ,
38
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Expanding a given query q with neighbor stems (cont.)– Use information about correlated stems to improve the
search If correlation factor is larger than a predefined threshold the
n a neighbor stem of can also be interpreted as a neighbor stem of and vice versa.
This provides greater flexibility, particularly with Boolean queries.
vuc ,
us
vs
and
: a neighbor stem of
)( vu ss )( vu ss
us us
)( uu ss
39
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Experimental results– Usually support the hypothesis of the usefulness of
local clustering methods
– Metric clusters seem to perform better than purely association clusters. This strengthens the hypothesis that there is a correlation
between the association of two terms and the distance between them.
40
Chap. 5
Query Expansion Through Local Clustering (Cont.)
Qualitative arguments– Qualitative arguments in this section are explicitly
based on the fact that all the clusters are local.
– In a global context, clusters are derived from all the documents in the collection which implies that our qualitative argumentation might not stand.
– The main reason is that correlations valid in the whole corpora might not be valid for the current query.
41
Chap. 5
5.3.2 Query Expansion Through Local Context Analysis
Local clustering vs. global analysis– Local clustering
Based on term co-occurrence inside the top ranked documents retrieved for the original query.
Terms which are the best neighbors of each query term are used to expand the original query.
– Global analysis Search for term correlations in the whole collection Usually involve the building of a thesaurus which identifies
term relationships in the whole collection. The terms are treated as concepts The thesaurus is viewed as a concept relationship structure. Consider the use of small contexts and phrase structures
42
Chap. 5
Query Expansion Through Local Context Analysis (Cont.)
Local context analysis [Xu & Croft 1996]– Combines global & local analysis
1. Retrieve the top n ranked passages using the original query. Break up the documents initially retrieved by the query in fixed lengt
h passages
Rank these passages as if they were documents
2. For each concept c in the top ranked passages, the similarity sim(q, c) between the whole query q and the concept c is computed using a variant of tf*idf ranking.
3. The top m ranked concepts are added to the original query q.
(terms in the original query q)
i: position of the concept in the final concept ranking
2
/9.01 miweight
43
Chap. 5
Query Expansion Through Local Context Analysis (Cont.)
Similarity between each related concept c and the original query q
– Correlation between the concept c and the query term
qk
idf
ci
i
i
n
idfkcfcqsim
log
)),(log(),(
n
jjcjii pfpfkcf
1,,),(
ik
n: the number of top ranked passages considered
: frequency of term in the j-th passage
: frequency of the concept c in the j-th passage
jipf , ik
jcpf ,
44
Chap. 5
Query Expansion Through Local Context Analysis (Cont.)
Similarity (cont.)– Inverse document frequency
– factor in the exponent is introduced to emphasize infrequent query terms.
– Adjusted for operation with TREC data.– Tuning might be required for operation with a different
collection.
)5
/log,1max( ; )
5
/log,1max( 1010 c
ci
i
npNidf
npNidf
N: the number of passages in the collection
: the number of passages containing the term
: the number of passages containing the concept c
: constant parameter which avoids a value equal to zero for sim(q,c)
inp
cnpik
iidf
45
Chap. 5
5.4 Automatic Global Analysis
Global analysis– Expand the query using information from the whole set
of documents in the collection.
– Until the beginning of the 1990s, global analysis was considered to be a technique which failed to yield consistent improvements in retrieval performance with general collections.
– This perception has changed with the appearance of modern procedures for global analysis. Similarity thesaurus & statistical thesaurus
46
Chap. 5
5.4.1 Query Expansion based on a Similarity Thesaurus
Similarity thesaurus– Based on term to term relationships rather than on a
matrix of co-occurrence.
– Terms for expansion are selected based on their similarity to the whole query rather than on their similarities to individual query terms.
Building similarity thesaurus– The terms are concepts in a concept space.
– In this concept space, each term is indexed by the documents in which it appears.
47
Chap. 5
Query Expansion based on a Similarity Thesaurus (Cont.)
Building similarity thesaurus (cont.)– Term vector
jj t
titf log
; ),...,,( ,2,1, Niiii wwwk
N
l jlil
li
jjij
ji
ji
itff
f
itff
f
w
1
2
2
,
,
,
,
,
)(max5.05.0
)(max
5.05.0
N: the number of documents in the collection
t: the number of terms in the collection
: frequency of occurrence of the term in the document
: the number of distinct index terms in the document
: the inverse term frequency
: the maximum of all factors for the i-th term
jif , ik jd
jt jd
jitf jd)(max , jij f jif ,
48
Chap. 5
Query Expansion based on a Similarity Thesaurus (Cont.)
Building similarity thesaurus (cont.)– Relationship between two terms and
The weights are based on interpreting documents as indexing elements instead of repositories for term co-occurrence.
– Built through the computation of the correlation factor for each pair of indexing terms in the collection. Computationally expensive However, global similarity thesaurus has to be computed only
once and can be updated incrementally.
jd
jvjuvuvu wwkkc ,,,
uk vk
49
Chap. 5
Query Expansion based on a Similarity Thesaurus (Cont.)
Query expansion1. Represent the query in the concept space used for repr
esentation of the index terms.
2. Based on the global similarity thesaurus, compute a similarity between each term correlated to the query terms and the whole query q.
qk
iqi
i
kwq
,
: weight associated to the index-query pair qiw , ],[ qki
),( vkqsim vk
Qk
vuquvv
u
cwkqkqsim ,,),(
50
Chap. 5
Query Expansion based on a Similarity Thesaurus (Cont.)
Query expansion (cont.)
3. Expand the query with the top r ranked terms according to To each expansion term in the query is assigned a weigh
t given by
),( vkqsim
qvk
qk qu
vqv
uw
kqsimw
,,
),( : expanded queryq
aK bK
cQ
iK
vK
jK
},{ ba KKQ
The distance of a given term to the query centroid might be quite distinct from the distances of to the individual query terms.
vk
cQ
vk
51
Chap. 5
Query Expansion based on a Similarity Thesaurus (Cont.)
Query-document similarity in the term-concept space
– Analogous to the formula for query-document similarity in the generalized vector space model.
– Thus, GVSM can be interpreted as a query expansion technique.
– Main difference Weight computation Only the top r ranked terms are used for query expansion with
the term-concept technique.
ji dk
ijij kwd
,; ~),( ,,,
jv udk qk
vuqujvj cwwdqsim
52
Chap. 5
5.4.2 Query Expansion based on a Statistical Thesaurus
Global statistical thesaurus– The terms selected for expansion must have high term
discrimination values which implies that they must be low frequency terms.
– However, it is difficult to cluster low frequency terms effectively due to the small amount of information about them.
– To circumvent this problem, we cluster documents into classes instead and use the low frequency terms in these documents to define our thesaurus classes. The document clustering algorithm must produce small and
tight clusters. (complete link algorithm)
53
Chap. 5
Query Expansion based on a Statistical Thesaurus (Cont.)
Complete link algorithm1. Initially, place each document in a distinct cluster.2. Compute the similarity between all pairs of cluster.
The similarity between two clusters is defined as the minimum of the similarities between all pairs of inter-cluster documents.
3. Determine the pair of cluster with the highest inter-cluster similarity
4. Merge the cluster and .5. Verify a stop criterion. If this criterion is not met then
go back to step 2.6. Return a hierarchy of clusters.
],[ vu CC
uC vC
54
Chap. 5
Query Expansion based on a Statistical Thesaurus (Cont.)
Hierarchy of three clusters generated by the complete link algorithm– Inter-cluster similarities indicated in the ovals
0.15
0.11
uC vC
zC
55
Chap. 5
Query Expansion based on a Statistical Thesaurus (Cont.)
The terms that compose each class of the global thesaurus are selected as follows.– Obtain from the user three parameters
Threshold class(TC) Number of documents in a class(NDC) Maximum inverse document frequency(MIDF)
– Use the parameter TC as a threshold value for determining the document clusters that will be used to generate thesaurus classes. This threshold has to be surpassed by if the docum
ents in the clusters and are to be selected as sources of terms for a thesaurus class.
uC vC),( vu CCsim
56
Chap. 5
Query Expansion based on a Statistical Thesaurus (Cont.)
The terms that compose each class of the global thesaurus are selected as follows. (cont.)– Use the parameter NDC as a limit on the size of
clusters (number of documents) to be considered.
– The parameter MIDF defines the maximum value of inverse document frequency for any term which is selected to participate in a thesaurus class. By doing so, it is possible to ensure that only low frequency
terms participate in the thesaurus classes generated (terms too generic are not good synonyms).
57
Chap. 5
Query Expansion based on a Statistical Thesaurus (Cont.)
Query expansion– Average term weight for each thesaurus class C
– Thesaurus class weight
C
wwt
C
i CiC
1 ,|C| : number of terms in the thesaurus class C
: precomputed weight associated with the term-class pair
Ciw ,
],[ Cki
5.0C
wtw C
C Weight formulation have been verified through experimentation and have yielded good results.
58
Chap. 5
Query Expansion based on a Statistical Thesaurus (Cont.)
Experiments with four test collections (ADI, Medlars, CACM, and ISI)– Indicate that global analysis using a thesaurus built by the complet
e link algorithm might yield consistent improvements in retrieval performance.
Main problem: initialization of the parameters TC, NDC, and MIDF– TC depends on the collection and can be difficult to set properly– Inspection of the cluster hierarchy is almost always necessary.– High value of TC might yield classes with too few terms while a lo
w value of TC might yield too few classes.– NDC can be decided more easily once TC has been set.– MIDF might be difficult and also requires careful consideration.
59
Chap. 5
5.5 Trends and Research Issues
Relevance strategies for dealing with visual displays– New techniques for capturing feedback information from the user
are desirable.
Investigating the utilization of global analysis techniques in the Web
The application of local analysis techniques to the Web– Development of techniques for speeding up query processing at
the search engine site
The combination of local analysis, global analysis, visual displays, and interactive interfaces