chap. 5 chapter 5 query operations. 2 chap. 5 contents introduction user relevance feedback...

Chap. 5

Chapter 5

Query Operations

2

Chap. 5

Contents

IntroductionUser relevance feedbackAutomatic local analysisAutomatic global analysis Trends and research issues

3

Chap. 5

5.1 Introduction

Difficulty of formulating user queries– Insufficient knowledge of the collection make-u

p and of the retrieval environment

Query reformulation– Two basic steps Query expansion

Expanding the original query with new terms

Term reweighting Reweighting the terms in the expanded query

4

Chap. 5

Introduction (Cont.)

Query reformulation (cont.)– Three approaches User relevance feedback

Based on feedback information from the user

Local feedback Based on information derived from the set of documents

initially retrieved(local set)

Global feedback Based on global information derived from the document

collection

5

Chap. 5

5.2 User Relevance Feedback

User’s role in relevance feedback cycle– is presented with a list of the retrieved documents

– marks relevant documents

Main idea of relevance feedback– Selecting important terms, or expressions, attached to the

documents that have been identified as relevant by the user

– Enhancing the importance of these terms in new query formulation

The new query will be moved towards the relevant documents and away from the non-relevant ones

6

Chap. 5

User Relevance Feedback (Cont.)

Experiments have shown good improvements in precision for small test collections when RF is used.

Advantage of RF– Shields the user from the details of the query

reformulation process.

– Breaks down the whole searching task into a sequence of small steps which are easier to grasp.

– Provides a controlled process designed to emphasize relevant terms and de-emphasize non-relevant terms.

7

Chap. 5

5.2.1 Query Expansion and Term Reweighting for the Vector Model

Application of RF to the vector model– Considers that the term-weight vectors of the

documents identified as relevant have similarities among themselves.

– It is assumed that non-relevant documents have term-weight vectors which are dissimilar from the ones for the relevant documents.

– The basic idea is to reformulate the query such that it gets closer to the term-weight vector space of the relevant documents

8

Chap. 5

Query Expansion and Term Reweighting for the VM (Cont.)

Optimal query

– But, the relevant documents(Cr) are not known a priori

rjrj Cd

jrCd

jr

opt dCN

dC

q

11

Dr: set of relevant documents among the retrieved documents

Dn: set of non-relevant documents among the retrieved documents

Cr: set of relevant documents among all documents in the collection

|Dr|,|Dn|,|Cr|: number of documents in the sets Dr, Dn, and Cr, respectively

, , : tuning constants

9

Chap. 5


Incremental change of the initial query vector– Standard_Rocchio

– Ide_Regular

– Ide_Dec_hi

The information contained in the relevant documents is more important than information provided by the non-relevant documents. ( > )

Positive feedback strategy: = 0

njrj Dd

jnDd

jr

m dD

dD

qq

njrj Dd

jDd

jm ddqq

)(max jrelevantnonDd

jm ddqqrj

: the highest ranked non-relevant document)(max jrelevantnon d

10

Chap. 5


Advantages– Simplicity

The modified term weights are computed directly from the set of retrieved documents.

– Good results The modified query vector does reflect a portion of the

intended query semantics.

Disadvantages– No optimality criterion is adopted.

11

Chap. 5

5.2.2 Term Reweighting for the Probabilistic Model

Probabilistic ranking formula

– But, the probabilities and are unknown.

t

i i

i

i

ijiqij RkP

RkP

RkP

RkPwwqdsim

1,, )|(

)|(1log

)|(1

)|(log~),(

: probability of observing the term in the set R of relevant documents

: probability of observing the term in the set of non-relevant documents

)|( RkP i

)|( RkP i R

ik

ik

)|( RkP i

)|( RkP i

12

Chap. 5

Term Reweighting for the Probabilistic Model (Cont.)

Probability estimation– Initial search

is constant for all terms :

The term probability distribution can be approximated

by the distribution in the whole collection.:

)|( RkP i

)|( RkP i

ik 5.0)|( RkP i

N

nRkP i

i )|(

t

i i

ijiqijinitial n

nNwwqdsim

1,, log),(

: the number of documents in the collection which contain the term inik

13

Chap. 5


Probability estimation (cont.)– Feedback search

Accumulated statistics related to the relevance or non-relevance of previously retrieved documents are used

t

i irir

iri

irr

irjiqij

DnDN

Dn

DD

Dwwqdsim

1 ,

,

,

,,, log),(

Dr : set of relevant retrieved documents

Dr,i : subset of Dr composed of the documents which contain the term ki

r

iri

ir

ir

i DN

DnRkP

D

DRkP

,,

)|( ;)|(

14

Chap. 5


Probability estimation (cont.)– Feedback search (cont.)

No query expansion occurs. The same query terms are being reweighted using feedback inf

ormation provided by the user. Problems for small values of and

Adjustment factor is often added: 0.5 or or

rDirD ,

Nni riri DNDn ,

1)|( ;

1)|(

,,

r

iiri

ir

iir

i DNN

nDn

RkPD

N

nD

RkP

1

5.0)|( ;

1

5.0)|(

,,

r

iri

ir

ir

i DN

DnRkP

D

DRkP

15

Chap. 5


Advantages of probabilistic RF– Feedback process is directly related to the derivation of new weigh

ts for query terms.– Term reweighting is optimal under the assumptions of term indepe

ndence and binary document indexing. Disadvantages of probabilistic RF

– Document term weights are not taken into account during the feedback loop.

– Weights of terms in the previous query formulations are also disregarded.

– No query expansion is used. Probabilistic RF methods do not in general operate as effectively a

s the conventional vector modification methods

16

Chap. 5

5.2.3 A Variant of Probabilistic Term Reweighting

Probabilistic ranking formula

Initial search

– Parameter C and K should be adjusted according to the collection For automatically indexed collections, C=0

Feedback search

t

iqjijiqij Fwwqdsim

1,,,,~),(

)max(

1 ;,

,,,,,

ji

jijijiiqji f

fKKffidfCF

jii

i

i

iqji f

RkP

RkP

RkP

RkPCF ,,, )|(

)|(1log

)|(1

)|(log

: normalized within-document frequencyjif ,

17

Chap. 5

A Variant of Probabilistic Term Reweighting (Cont.)

Advantages– It takes into account the within-document frequencies.

– It adopts a normalized version of these frequencies.

– It introduces the constant C and K which provide for greater flexibility.

Disadvantages– More complex formulation

– No query expansion

18

Chap. 5

5.2.4 Evaluation of Relevance Feedback Strategies

A simplistic evaluation– Retrieve a set of document using modified query.– Measure recall-precision figures relative to the set of

relevant documents for the original query.– The results show spectacular improvements.

Significant part of this improvement results from the higher ranks assigned to the set R of documents already identified as relevant during the feedback process.

Since the user has seen these documents already, such evaluation is unrealistic.

It masks any real gains in retrieval performance due to documents not seen by the user yet.

19

Chap. 5

Evaluation of Relevance Feedback Strategies (Cont.)

Residual collection evaluation

– Evaluate the retrieval performance of the modified query considering only the residual collection.

– Our main purpose is to compare the performance of distinct RF strategies.

– Any experimentation involving RF strategies should always evaluate recall-precision figures relative to the residual collection.

Residual collection: the set of all documents minus the set of feedback documents provided by the user

20

Chap. 5

5.3 Automatic Local Analysis

User relevance feedback– Expanded query will retrieve more relevant documents.

– There is an underlying notion of clustering Known relevant documents contain terms which can be used

to describe a larger cluster of relevant documents. The description of this larger cluster of relevant documents is

built interactively with assistance from the user.

21

Chap. 5

Automatic Local Analysis (Cont.)

Automatic relevance feedback– Obtain a description for a larger cluster of relevant

documents automatically.

– Involves identifying terms which are related to the query terms Synonyms, stemming variations, terms which are close to the

query terms in the text, …

– Global feedback & local feedback

22

Chap. 5

Automatic Local Analysis (Cont.)

Global feedback– All documents in the collection are used to determine a

global thesaurus-like structure which defines term relationships.

– This structure is then shown to the user who selects terms for query expansion.

Local feedback– The documents retrieved for a given query q are

examined at query time to determine terms for query expansion.

– Is done without assistance from the user Local clustering & local context analysis

23

Chap. 5

5.3.1 Query Expansion Through Local Clustering

Global clustering– Build global structures such as association matrices whi

ch quantify term correlations– Use correlated terms for query expansion– Main problem

There is not consistent evidence that global structures can be used effectively to improve retrieval performance with general collections.

Global structures do not adapt well to the local context defined by the current query.

Local clustering [Attar & Fraenkel 1977]– Aim at optimizing the current search.

24

Chap. 5

Query Expansion Through Local Clustering (Cont.)

Basic terminology– Stem

V(s): a non-empty subset of words which are grammatical variant of each other

A canonical form s of V(s) is called a stem. If V(s) = {polish, polishing, polished} then s=polish.

– Local document set The set of documents retrieved for a given query q

– Local vocabulary The set of all distinct words in the local document set The set of all distinct stems derived from the set is referred t

o as lS

lD

lVlV

25

Chap. 5


Local clustering– Operate solely on the documents retrieved for the

current query

– The application to the Web is unlikely at this time It is frequently necessary to access the text of such documents At a client machine

Retrieving the text for local analysis would take too long Reduce drastically the interactive nature of Web interface and

the satisfaction of the user

At the search engine site Analyzing the text would represent an extra spending of CPU

time which is not cost effective at this time.

26

Chap. 5


Local clustering (cont.)– Quite useful in the environment of intranets

– Of great assistance for searching information in specialized document collections (e.g. medical document collection)

– Local feedback strategies are based on expanding the query with terms correlated to the query terms. Such correlated terms are those present in local clusters built from the local document set.

27

Chap. 5


Association clusters– Based on the frequency of co-occurrence of stems (or te

rms) inside documents

– This idea is that terms which co-occur frequently inside documents have a synonymity association.

– Correlation between the stems andus vs

lj

vuDd

jsjsvu ffc ,,,

: frequency of a stem in a document , jsif , is jd lj Dd

28

Chap. 5


Association clusters (cont.)– Unnormalized local association matrix

– Normalized local association matrix

: association matrix with rows and columns, where

: transpose of

: local stem-stem association matrixtmms

)( ijmm

lS lD

jsij ifm ,

tm

m

vuvu cs ,,

vuvvuu

vuvu ccc

cs

,,,

,,

29

Chap. 5


Association Clusters (cont.)– Building local association clusters

Consider the u-th row in the association matrix (i.e., the row with all the associations for the stem ).

Let be a function which takes the u-th row and returns the set of n largest values , where v varies over the set of local stems and .

Then defines a local association cluster around the stem .

If is unnormalized, the association cluster is said to be unnormalized.

If is normalized, the association cluster is said to be normalized.

s

us)(nSu

vus ,uv

)(nSu

usvus ,

vus ,

30

Chap. 5


Metric clusters– Two terms which occur in the same sentence seem

more correlated than two terms which occur far apart in a document

– It might be worthwhile to factor in the distance between two terms in the computation of their correlation factor.

– Correlation between the stems andus vs

)( )(

, ),(

1

ui vjsVk sVk jivu kkr

c

: distance between two keywords and (the number of words between them in a same document)

: and are in distinct documents

),( ji kkr ik jk

),( ji kkr ik jk

31

Chap. 5


Metric clusters (cont.)– Unnormalized local association matrix

– Normalized local association matrix

vuvu cs ,,

)()(,

,vu

vuvu sVsV

cs

: local stem-stem metric correlation matrixs

32

Chap. 5


Metric clusters (cont.)– Building local metric clusters

Consider the u-th row in the metric correlation matrix (i.e., the row with all the associations for the stem ).

Let be a function which takes the u-th row and returns the set of n largest values , where v varies over the set of local stems and .

Then defines a local metric cluster around the stem . If is unnormalized, the metric cluster is said to be unnorm

alized. If is normalized, the metric cluster is said to be normalize

d.

s

us)(nSu

vus ,

uv )(nSu us

vus ,

vus ,

33

Chap. 5


Scalar clusters– The idea is that two stems with similar neighborhoods h

ave some synonymity relationship.

– The relationship is indirect or induced by the neighborhood.

– Quantifying such neighborhood relationships Arrange all correlation values in a vector Arrange all correlation values in another vector Compare these vectors through a scalar measure The cosine of the angle between the two vectors is a popular sc

alar similarity measure.

ius , us

vs

ivs ,

34

Chap. 5


Scalar clusters (cont.)– Scalar association matrix

– Building scalar clusters Let be a function which returns the set of n largest valu

es , . Then defines a scalar cluster around the stem .

vu

vuvu ss

sss

,

: scalar association matrix

and

s

),...,,( ,2,1, nvvvv ssss ),...,,( ,2,1, nuuuu ssss

)(nSu

vus , uv )(nSu us

35

Chap. 5


Interactive search formulation– Neighbor

A stem which belongs to a cluster associated to another stem is said to be a neighbor of .

While neighbor stems are said to have a synonymity relationship, they are not necessarily synonyms in the grammatical sense.

Represent distinct keywords which are though correlated by the current query context

The local aspect of this correlation is reflected in the fact that the documents and stems considered in the correlation matrix are all local.

Neighbors of the query stems can be used to expand the original query.

us

vs vs

36

Chap. 5


Interactive search formulation (cont.)– Neighbor

Important product of the local clustering process They can be used for extending a search formulation in a

promising unexpected direction, rather than merely complementing it with missing synonyms.

x

x

xx

x

x x

x

xx

x

x x

x

x

x

)(nSv

vsus

stem as a neighbor of the stemus vs

37

Chap. 5


Expanding a given query q with neighbor stems– For each stem

Select m neighbor stems from the cluster Add them to the query

– Merging of normalized and unnormalized clusters To cover a broader neighborhood, the set might be comp

osed of stems obtained using correlation factors normalized and unnormalized.

Unnormalized cluster tends to group stems whose ties are due to their large frequencies.

Normalized cluster tends to group stems which are more rare. The union of the two clusters provides a better representation o

f the possible correlations.

qsv )(nSv

)(nSv

vuc ,

38

Chap. 5


Expanding a given query q with neighbor stems (cont.)– Use information about correlated stems to improve the

search If correlation factor is larger than a predefined threshold the

n a neighbor stem of can also be interpreted as a neighbor stem of and vice versa.

This provides greater flexibility, particularly with Boolean queries.

vuc ,

us

vs

and

: a neighbor stem of

)( vu ss )( vu ss

us us

)( uu ss

39

Chap. 5


Experimental results– Usually support the hypothesis of the usefulness of

local clustering methods

– Metric clusters seem to perform better than purely association clusters. This strengthens the hypothesis that there is a correlation

between the association of two terms and the distance between them.

40

Chap. 5


Qualitative arguments– Qualitative arguments in this section are explicitly

based on the fact that all the clusters are local.

– In a global context, clusters are derived from all the documents in the collection which implies that our qualitative argumentation might not stand.

– The main reason is that correlations valid in the whole corpora might not be valid for the current query.

41

Chap. 5

5.3.2 Query Expansion Through Local Context Analysis

Local clustering vs. global analysis– Local clustering

Based on term co-occurrence inside the top ranked documents retrieved for the original query.

Terms which are the best neighbors of each query term are used to expand the original query.

– Global analysis Search for term correlations in the whole collection Usually involve the building of a thesaurus which identifies

term relationships in the whole collection. The terms are treated as concepts The thesaurus is viewed as a concept relationship structure. Consider the use of small contexts and phrase structures

42

Chap. 5

Query Expansion Through Local Context Analysis (Cont.)

Local context analysis [Xu & Croft 1996]– Combines global & local analysis

1. Retrieve the top n ranked passages using the original query. Break up the documents initially retrieved by the query in fixed lengt

h passages

Rank these passages as if they were documents

2. For each concept c in the top ranked passages, the similarity sim(q, c) between the whole query q and the concept c is computed using a variant of tf*idf ranking.

3. The top m ranked concepts are added to the original query q.

(terms in the original query q)

i: position of the concept in the final concept ranking

2

/9.01 miweight

43

Chap. 5


Similarity between each related concept c and the original query q

– Correlation between the concept c and the query term

qk

idf

ci

i

i

n

idfkcfcqsim

log

)),(log(),(

n

jjcjii pfpfkcf

1,,),(

ik

n: the number of top ranked passages considered

: frequency of term in the j-th passage

: frequency of the concept c in the j-th passage

jipf , ik

jcpf ,

44

Chap. 5


Similarity (cont.)– Inverse document frequency

– factor in the exponent is introduced to emphasize infrequent query terms.

– Adjusted for operation with TREC data.– Tuning might be required for operation with a different

collection.

)5

/log,1max( ; )

5

/log,1max( 1010 c

ci

i

npNidf

npNidf

N: the number of passages in the collection

: the number of passages containing the term

: the number of passages containing the concept c

: constant parameter which avoids a value equal to zero for sim(q,c)

inp

cnpik

iidf

45

Chap. 5

5.4 Automatic Global Analysis

Global analysis– Expand the query using information from the whole set

of documents in the collection.

– Until the beginning of the 1990s, global analysis was considered to be a technique which failed to yield consistent improvements in retrieval performance with general collections.

– This perception has changed with the appearance of modern procedures for global analysis. Similarity thesaurus & statistical thesaurus

46

Chap. 5

5.4.1 Query Expansion based on a Similarity Thesaurus

Similarity thesaurus– Based on term to term relationships rather than on a

matrix of co-occurrence.

– Terms for expansion are selected based on their similarity to the whole query rather than on their similarities to individual query terms.

Building similarity thesaurus– The terms are concepts in a concept space.

– In this concept space, each term is indexed by the documents in which it appears.

47

Chap. 5

Query Expansion based on a Similarity Thesaurus (Cont.)

Building similarity thesaurus (cont.)– Term vector

jj t

titf log

; ),...,,( ,2,1, Niiii wwwk

N

l jlil

li

jjij

ji

ji

itff

f

itff

f

w

1

2

2

,

,

,

,

,

)(max5.05.0

)(max

5.05.0

N: the number of documents in the collection

t: the number of terms in the collection

: frequency of occurrence of the term in the document

: the number of distinct index terms in the document

: the inverse term frequency

: the maximum of all factors for the i-th term

jif , ik jd

jt jd

jitf jd)(max , jij f jif ,

48

Chap. 5


Building similarity thesaurus (cont.)– Relationship between two terms and

The weights are based on interpreting documents as indexing elements instead of repositories for term co-occurrence.

– Built through the computation of the correlation factor for each pair of indexing terms in the collection. Computationally expensive However, global similarity thesaurus has to be computed only

once and can be updated incrementally.

jd

jvjuvuvu wwkkc ,,,

uk vk

49

Chap. 5


Query expansion1. Represent the query in the concept space used for repr

esentation of the index terms.

2. Based on the global similarity thesaurus, compute a similarity between each term correlated to the query terms and the whole query q.

qk

iqi

i

kwq

,

: weight associated to the index-query pair qiw , ],[ qki

),( vkqsim vk

Qk

vuquvv

u

cwkqkqsim ,,),(

50

Chap. 5


Query expansion (cont.)

3. Expand the query with the top r ranked terms according to To each expansion term in the query is assigned a weigh

t given by

),( vkqsim

qvk

qk qu

vqv

uw

kqsimw

,,

),( : expanded queryq

aK bK

cQ

iK

vK

jK

},{ ba KKQ

The distance of a given term to the query centroid might be quite distinct from the distances of to the individual query terms.

vk

cQ

vk

51

Chap. 5


Query-document similarity in the term-concept space

– Analogous to the formula for query-document similarity in the generalized vector space model.

– Thus, GVSM can be interpreted as a query expansion technique.

– Main difference Weight computation Only the top r ranked terms are used for query expansion with

the term-concept technique.

ji dk

ijij kwd

,; ~),( ,,,

jv udk qk

vuqujvj cwwdqsim

52

Chap. 5

5.4.2 Query Expansion based on a Statistical Thesaurus

Global statistical thesaurus– The terms selected for expansion must have high term

discrimination values which implies that they must be low frequency terms.

– However, it is difficult to cluster low frequency terms effectively due to the small amount of information about them.

– To circumvent this problem, we cluster documents into classes instead and use the low frequency terms in these documents to define our thesaurus classes. The document clustering algorithm must produce small and

tight clusters. (complete link algorithm)

53

Chap. 5

Query Expansion based on a Statistical Thesaurus (Cont.)

Complete link algorithm1. Initially, place each document in a distinct cluster.2. Compute the similarity between all pairs of cluster.

The similarity between two clusters is defined as the minimum of the similarities between all pairs of inter-cluster documents.

3. Determine the pair of cluster with the highest inter-cluster similarity

4. Merge the cluster and .5. Verify a stop criterion. If this criterion is not met then

go back to step 2.6. Return a hierarchy of clusters.

],[ vu CC

uC vC

54

Chap. 5


Hierarchy of three clusters generated by the complete link algorithm– Inter-cluster similarities indicated in the ovals

0.15

0.11

uC vC

zC

55

Chap. 5


The terms that compose each class of the global thesaurus are selected as follows.– Obtain from the user three parameters

Threshold class(TC) Number of documents in a class(NDC) Maximum inverse document frequency(MIDF)

– Use the parameter TC as a threshold value for determining the document clusters that will be used to generate thesaurus classes. This threshold has to be surpassed by if the docum

ents in the clusters and are to be selected as sources of terms for a thesaurus class.

uC vC),( vu CCsim

56

Chap. 5


The terms that compose each class of the global thesaurus are selected as follows. (cont.)– Use the parameter NDC as a limit on the size of

clusters (number of documents) to be considered.

– The parameter MIDF defines the maximum value of inverse document frequency for any term which is selected to participate in a thesaurus class. By doing so, it is possible to ensure that only low frequency

terms participate in the thesaurus classes generated (terms too generic are not good synonyms).

57

Chap. 5


Query expansion– Average term weight for each thesaurus class C

– Thesaurus class weight

C

wwt

C

i CiC

1 ,|C| : number of terms in the thesaurus class C

: precomputed weight associated with the term-class pair

Ciw ,

],[ Cki

5.0C

wtw C

C Weight formulation have been verified through experimentation and have yielded good results.

58

Chap. 5


Experiments with four test collections (ADI, Medlars, CACM, and ISI)– Indicate that global analysis using a thesaurus built by the complet

e link algorithm might yield consistent improvements in retrieval performance.

Main problem: initialization of the parameters TC, NDC, and MIDF– TC depends on the collection and can be difficult to set properly– Inspection of the cluster hierarchy is almost always necessary.– High value of TC might yield classes with too few terms while a lo

w value of TC might yield too few classes.– NDC can be decided more easily once TC has been set.– MIDF might be difficult and also requires careful consideration.

59

Chap. 5

5.5 Trends and Research Issues

Relevance strategies for dealing with visual displays– New techniques for capturing feedback information from the user

are desirable.

Investigating the utilization of global analysis techniques in the Web

The application of local analysis techniques to the Web– Development of techniques for speeding up query processing at

the search engine site

The combination of local analysis, global analysis, visual displays, and interactive interfaces

chap. 5 chapter 5 query operations. 2 chap. 5 contents introduction user relevance feedback...

Documents