chapter 5: query operations hassan bashiri april 2009 1

46
Chapter 5: Query Operations Hassan Bashiri April 2009 1

Upload: jalen-florey

Post on 31-Mar-2015

223 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Chapter 5:

Query Operations

Hassan Bashiri

April 20091

Page 2: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Cross-Language

• What is CLIR?

• Users enter their query in one language and the search engine retrieves relevant documents in other languages.

Retrieval SystemEnglish Query French Documents

2

Page 3: Chapter 5: Query Operations Hassan Bashiri April 2009 1

11

Term-aligned Sentence-aligned Document-aligned Unaligned

Parallel Comparable

Knowledge-based Corpus-based

Controlled Vocabulary Free Text

Cross-Language Text Retrieval

Query Translation Document Translation

Text Translation Vector Translation

Ontology-based Dictionary-based

Thesaurus-based

3

Page 4: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Language

Visual languages Example: library shown on the screen. Act:

take books, open catalogs, etc.

Better Boolean queries: “I need books by

Cervantes AND Lope de Vega”?!

4

Page 5: Chapter 5: Query Operations Hassan Bashiri April 2009 1

IR Interface

Query interface

Selection interface

Examination interface

Document delivery

5

Page 6: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Retrieval System Model

Query Formulation

Detection

Delivery

Selection

Examination

Index

Docs

User

Indexing

6

Page 7: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Starfield

7

Page 8: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Formulation

No detailed knowledge of collection and retrieval environment difficult to formulate queries well designed for retrieval Need many formulations of queries for good retrieval

First formulation: naïve attempt to retrieve relevant information Documents initially retrieved:

Examined for relevance information Improved query formulations for retrieving additional relevant

documents

Query reformulation: Expanding original query with new terms Reweighting the terms in expanded query

8

Page 9: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Three approaches

Approaches based on feedback from users

(relevance feedback)

Approaches based on information derived from

set of initially retrieved documents (local set of

documents)

Approaches based on global information

derived from document collection

9

Page 10: Chapter 5: Query Operations Hassan Bashiri April 2009 1

User relevance feedback

Most popular query reformulation strategy

Cycle: User presented with list of retrieved documents User marks those which are relevant

In practice: top 10-20 ranked documents are examined Incremental

Select important terms from documents assessed relevant by users Enhance importance of these terms in a new query

Expected: New query moves towards relevant documents and away from non-

relevant documents For Instance

Q1:US Open Q2:US Open Robocup

10

Page 11: Chapter 5: Query Operations Hassan Bashiri April 2009 1

User relevance feedback

Two basic techniques Query expansion

Add new terms from relevant documents

Term reweighting

Modify term weights based on user relevance judgements

11

Page 12: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion and Term Reweighting for the Vector Model

basic idea Relevant documents resemble each other Non-relevant documents have term-weight

vectors which are dissimilar from the ones for the relevant documents

The reformulated query is moved to closer to the term-weight vector space of relevant documents

12

Page 13: Chapter 5: Query Operations Hassan Bashiri April 2009 1

13

Page 14: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion and Term Reweighting for the Vector Model (Continued)

Cr: set of relevant documents

set of non-relevant documents

Dr: set of relevant documents, as identified by the user Dn: set of non-relevant documents

the retrieveddocuments collection

14

Page 15: Chapter 5: Query Operations Hassan Bashiri April 2009 1

User relevance feedback: Vector Space Model

rj rjCd Cd

j

r

j

ropt d

CNd

Cq

||

1

||

1

rD

nD

rC

| |,| |,| |r n rD D C

, ,

: set of relevant documents, as identified by the user, among the retrieved documents;

: set of non-relevant documents among the retrieved documents;

: set of relevant documents among all documents in the collection;

: number of documents in the sets , respectively;

: tuning constants.

15

Page 16: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Standard-Rochio

Ide-Regular

Ide-Dec-Hi

, , : tuning constants (usually, >) =1 (Rochio, 1971) ===1 (Ide, 1971) =0: positive feedback

Dnd

jnDrd

jr

mjj

dD

dD

qq||||

Dnd

jDrd

jmjj

ddqq

)(max jrelevantnonDrd

jm ddqqj

the highest ranked non-relevant document

Similar performance

Calculate the modified query

16

Page 17: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Analysis

advantages simplicity good results

disadvantages No optimality criterion is adopted

17

Page 18: Chapter 5: Query Operations Hassan Bashiri April 2009 1

User relevance feedback: Probabilistic Model

The similarity of a document dj to a query q

))|(

))|(1(log

))|(1(

)|((log),(

1,, RkP

RkP

RkP

RkPwwqdsim

i

i

i

it

ijiqij

)|( RkP i : the probability of observing the term ki in the set R of relevant documents

)|( RkP i : the probability of observing the term ki in the set R of non-relevant documents

5.0)|( RkP iInitial search:N

nRkP i

i )|(

18

Page 19: Chapter 5: Query Operations Hassan Bashiri April 2009 1

i

it

ijiqi

i

it

ijiqi

i

i

i

it

ijiqij

n

nNww

NnNn

ww

RkP

RkP

RkP

RkPwwqdsim

log

))1(

log)5.01(

5.0(log

))|(

))|(1(log

))|(1(

)|((log),(

1,,

1,,

1,,

Feedback search:

||

||)|( ,

r

irii D

D

V

VRkP

||

||)|( ,

r

iriiii DN

Dn

VN

VnRkP

User relevance feedback: Probabilistic Model

19

Page 20: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Feedback search:||

||)|( ,

r

iri D

DRkP

||

||)|( ,

r

irii DN

DnRkP

||

|)|(||

||||

||log

||

||||

||1

||

||1

||

||

log

))|(

))|(1(log

))|(1(

)|((log),(

,

,

,

,

1,,

,

,

,

,

1,,

1,,

iri

irir

irr

irt

ijiqi

r

iri

r

iri

r

ir

r

irt

ijiqi

i

i

i

it

ijiqij

Dn

DnDN

DD

Dww

DN

DnDN

Dn

D

DD

D

ww

RkP

RkP

RkP

RkPwwqdsim

No query expansion occurs

User relevance feedback: Probabilistic Model

20

Page 21: Chapter 5: Query Operations Hassan Bashiri April 2009 1

1||

5.0||

1

5.0)|(

1||

5.0||

1

5.0)|(

,

,

r

iriiii

r

irii

DN

Dn

VN

VnRkP

D

D

V

VRkP

For small values of |Dr| and |Dr,i| (i.e., |Dr|=1, |Dr,i|=0)

Alternative 1:

Alternative 2:

1||

||)|(

1||

||)|(

,

,

r

iiri

i

r

iir

i

DNNn

DnRkP

DNn

DRkP

User relevance feedback: Probabilistic Model

21

Page 22: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Analysis

advantages Feedback process is directly related to the derivation of

new weights for query terms The term reweighting is optimal

disadvantages Document term weights are not considered No query expansion is used

22

Page 23: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion

Query Expansion

Global

Local

Similarity Thesaurus

Statistical Thesaurus

Context Analysis

Clustering

Scalar Clustering

Metric Clustering

Association Clustering

23

Page 24: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Automatic Local Analysis

user relevance feedback Known relevant documents contain terms which can be

used to describe a larger cluster of relevant documents with assistance from the user (clustering)

automatic analysis Obtain a description (i.t.o terms) for a larger cluster of

relevant documents automatically global strategy: global thesaurus-like structure is

trained from all documents before querying local strategy: terms from the documents retrieved for a

given query are selected at query time24

Page 25: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion based on a Similarity Thesaurus

Query expansion is done in three steps as follows:

Represent the query in the concept space used for

representation of the index terms

2 Based on the global similarity thesaurus, compute a

similarity sim(q,kv) between each term kv correlated to the

query terms and the whole query q.

3 Expand the query with the top r ranked terms according to

sim(q,kv)25

Page 26: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion – Step 1

To the query q is associated a vector q in the term-concept space given by

where wi,q is a weight associated to the index-query pair[ki,q]

iqk

qi kwi

,q

26

Page 27: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion – Step 2

Compute a similarity sim(q,kv) between each term kv and the user query q

where cu,v is the correlation factor

qk

vu,qu,vv

u

cwkq)ksim(q,

27

Page 28: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion – Step 3

Add the top r ranked terms according to sim(q,kv) to the original query q to form the expanded query q’To each expansion term kv in the query q’ is assigned a weight wv,q’ given by

The expanded query q’ is then used to retrieve new documents to the user

qkqu,

vq'v,

u

w

)ksim(q,w

28

Page 29: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion - Sample

Doc1 = D, D, A, B, C, A, B, CDoc2 = E, C, E, A, A, DDoc3 = D, C, B, B, D, A, B, C, ADoc4 = A

c(A,A) = 10.991c(A,C) = 10.781c(A,D) = 10.781...c(D,E) = 10.398c(B,E) = 10.396c(E,E) = 10.224

29

Page 30: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion - Sample

Query: q = A E Esim(q,A) = 24.298sim(q,C) = 23.833sim(q,D) = 23.833sim(q,B) = 23.830sim(q,E) = 23.435

New query: q’ = A C D E Ew(A,q')= 6.88w(C,q')= 6.75w(D,q')= 6.75w(E,q')= 6.64

30

Page 31: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Query Expansion

Methods of local analysis extract

information from local set of documents

retrieved to expand the query

An alternative is to expand the query

using information from the whole set of

documents

31

Page 32: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Local Cluster

stem V(s): a non-empty subset of words which are

grammatical variants of each othere.g., {polish, polishing, polished}

A canonical form s of V(s) is called a steme.g., polish

local document set Dl the set of documents retrieved for a given query

local vocabulary Vl (Sl) the set of all distinct words (stems) in the local

document set

32

Page 33: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Local Cluster

basic concept Expanding the query with terms correlated to the

query terms The correlated terms are presented in the local

clusters built from the local document set

local clusters association clusters: co-occurrences of pairs of

terms in documents metric clusters: distance factor between two terms scalar clusters: terms with similar neighborhoods

have some synonymity relationship33

Page 34: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Association Clusters

idea Based on the co-occurrence of stems (or terms) inside

documents

association matrix fsi,j: the frequency of a stem si in a document dj (Dl)

m=(fsi,j): an association matrix with |Sl| rows and |Dl| columns

: a local stem-stem association matrixtmms

34

Page 35: Chapter 5: Query Operations Hassan Bashiri April 2009 1

jsv

Dldj

jsuvu ffc ,,,

vuvvuu

vuvu ccc

cs

,,,

,,

: a correlation between the stems su and sv

: normalized matrix

su,v=cu,v: unnormalized matrix

:)(nsu local association cluster around the stem su

Take u-th rowReturn the set of n largest values su,v (uv)

an element in tmm

35

Page 36: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Metric Clusters

idea Consider the distance between two terms in the

computation of their correlation factor

local stem-stem metric correlation matrix r(ki,kj): the number of words between keywords ki and

kj in a same document

cu,v: metric correlation between stems su and sv

)()(

, ),(

1

svVkj jisuVki

vu kkrc

),(1

),(ji

ji kkrkkr

36

Page 37: Chapter 5: Query Operations Hassan Bashiri April 2009 1

|)(||)(|,

,vu

vuvu sVsV

cs

: normalized matrix

su,v=cu,v: unnormalized matrix

:)(nsu local metric cluster around the stem su

Take u-th rowReturn the set of n largest values su,v (uv)

37

Page 38: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Scalar Clusters

idea Two stems with similar neighborhoods have synonymity

relationship The relationship is indirect or induced by the

neighborhood

scalar association matrix

The row corresponding to a specific termin a term co-occurrence matrix forms its neighborhood

:)(nsu local scalar cluster around the stem su

Take u-th rowReturn the set of n largest values su,v (uv)

||||,

vu

vuvu

ss

sss

38

Page 39: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Interactive Search Formulation

neighbors of the query term sv

Terms su belonging to clusters associated to sv, i.e., suSv(n)

su is called a searchonym of sv

x

Su

Sv

Sv(n)

x

xx

x

x

x

x

x

x

x

x

x

x

xx x

x

x

x39

Page 40: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Similarity Thesaurus

The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence.

This relationship are not derived directly from co-occurrence of terms inside documents.

They are obtained by considering that the terms are concepts in a concept space.

In this concept space, each term is indexed by the documents in which it appears.

Terms assume the original role of documents while documents are interpreted as indexing elements

40

Page 41: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Similarity Thesaurus

Inverse term frequency for document dj

t: number of terms in the collection N: number of documents in the collection fi,j: frequency of occurrence of the term ki in the document dj tj: vocabulary of document dj itfj: inverse term frequency for document dj

To ki is associated a vector

jj t

titf log

),....,,(k ,2,1,i Niii www

41

Page 42: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Similarity Thesaurus

where wi,j is a weight associated to index-document pair[ki,dj]. These weights are computed as follows

N

lj

lil

li

jjij

ji

ji

itff

f

itff

f

w

1

22

,

,

,

,

,

))(max

5.05.0(

))(max

5.05.0(

42

Page 43: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Similarity Thesaurus

The relationship between two terms ku and kv is computed as a correlation factor cu,v given by

The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection

jd

jv,ju,vuvu, wwkkc

43

Page 44: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Represent the query in the concept space used for representation of the index terms

Based on the global similarity thesaurus, compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q

qk

iqii

kwq

,

qk

vuquvv

u

cwkqkqsim ,,,

query term expand term

44

Page 45: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Expand the query with the top r ranked terms according to sim(q,kv)

quk qu

vwkqsim

qvw,

,,

45

Page 46: Chapter 5: Query Operations Hassan Bashiri April 2009 1

Similarity Thesaurus

This computation is expensiveGlobal similarity thesaurus has to be computed only once and can be updated incrementally

46