query reformulation: user relevance feedback

Query Reformulation:User Relevance Feedback

Introduction

Difficulty of formulating user queries– Users have insufficient knowledge of the

collection make-up– Users have insufficient knowledge of the

retrieval environment

Query reformulation to improve user query– two basic methods

• query expansion– Expanding the original query with new terms

• term reweighting– Reweighting the terms in the expanded query

Introduction

Approaches for query reformulation– user relevance feedback

• based on feedback information from the user

– local analysis• based on information derived from the set of

documents initially retrieved (local set)

– global analysis• based on global information derived from the

document collection

User Relevance Feedback

User’s role in URF cycle– is presented with a list of the retrieved documents– marks relevant documents

Main idea of URF– selecting important terms, or expressions, attached

to the documents that have been identified as relevant by the user

– enhancing the importance of these terms in new query formulation

– effect: the new query will be moved towards the relevant documents and away from the non-relevant ones

User Relevance Feedback

Advantages of URF– it shields the user from the details of the

query reformulation process• users only have to provide a relevance judgment

on documents

– it breaks down the whole searching task into a sequence of small steps which are easier to grasp

– it provides a controlled process designed to emphasize relevant terms and de-emphasize non-relevant terms

URF for Vector Model

Assumptions– the term-weight vectors of the documents identified as

relevant to the query have similarities among themselves.

– non-relevant documents have term-weight vectors which are dissimilar from the ones for the relevant documents.

Basic idea– reformulate the query such that it gets closer to the

term-weight vector space of the relevant documents

The Perfect (Vector Model) Query

Assume we know what documents are relevant and which are not.

Given:– a collection of N documents

– Cr : the set of relevant documents

What is the optimal query?

Back to Reality

Actually, what we are trying to figure out is which documents are relevant and which are not.

Our ideal query & definitions:– a collection of N documents

– Cr : the set of relevant documents

– Dr : set of documents user identified as relevant

– Dn : set of retrieved documents not relevant

– α, β, γ : tuning constants

Modified Query

(Rochio)

Rochio & Ide VariationsStandard Rochio

Ide (Regular)

Ide (Dec_Hi)

where maxnonrelevant(dj): the highest ranked non-relevant document

Tuning the Feedback

Modified Query

How do we set the tuning constants α, β, γ?– Rochio originally set α = 1– Ide originally set α = β = γ = 1

Often, positive relevance feedback is more valuable than negative relevance feedback.– this implies: β > γ– purely positive feedback mechanism: γ = 0

URF for Vector Model

Includes both query expansion and term reweighting

Advantages– simplicity

• modified term weights are computed directly from the set of retrieved documents

– good results• modified query vector does reflect a portion of the intended

query semantics

Issue: As with all learning techniques, this assumes the information need is relatively static.

Evaluation of Relevance Feedback Strategies

Simplistic evaluation is to compare the results of the modified query to the original query.– Does not work!!!– Results are great but mostly due to higher

ranking of documents returned by original query.

– User has already seen these documents.

Evaluation of Relevance Feedback Strategies

More realistic evaluation– Compute precision and recall on residual

collection (those documents not returned by the original query)

– Because highly-ranked documents are removed, these results can be worse than for the original query.

– That is okay if we are comparing between relevance feedback approaches.

query reformulation: user relevance feedback

Documents