comp3410 db32: technologies for knowledge management lecture 7: query broadening to improve ir by...

COMP3410 DB32:Technologies for Knowledge Management

Lecture 7:

Query Broadening to improve IR

By Eric Atwell, School of Computing, University of Leeds

(including re-use of teaching resources from other sources, esp. Stuart Roberts, School of Computing, Univ of Leeds)

Module Objectives

“On completion of this module, students should be able to:

… describe classical and emerging information retrieval techniques, and their relevance to knowledge management; …”

Today’s objectives• first we look at a method for query broadening that

required input from the user

• then we look at an automatic method for query broadening using a thesaurus

• by the end of the lecture you should understand what a thesaurus, terminology-bank, ontology are, and how they are used to broaden queries

Some issues to be resolved• Synonyms

– football / soccer, tap / faucet: search for one, find both?

• homonyms– lead (metal or leash?), tap: find both, only want one?

• local/global contexts determine “good” terms– football articles: won’t mention word ‘football’;

will have particular meaning for the word ‘goal’

• Precoordination (proximity query): multi-word terms– “Venetian blind” vs “blind Venetian”

Evaluation/Effectiveness measures• effort - required by the users in formulation of queries

• time - between receipt of user query and production of list of ‘hits’

• presentation - of the output

• coverage - of the collection

• recall - the fraction of relevant items retrieved

• precision - the fraction of retrieved items that are relevant

• user satisfaction – with the retrieved items

Better hits: Query Broadening• User unaware of collection characteristics is likely to

formulate a ‘naïve’ query

• query broadening aims to replace the initial query with a new one featuring one or other of:– new index terms– adjusted term weights

• One method uses feedback information from the user

• Another method uses a thesaurus / term-bank / ontology

From response to initial query, gather relevance informationHR = RH = set of retrieved, relevant hitsHNR = H-R = set of retrieved, non-relevant hits

replace query q with replacement query q' :q' = q

di / |HR|

di / |HNR|

note: this moves the query vector closer to the centroid of the “relevant retrieved” document vectors and further from the centroid of the “non-relevant retrieved” documents.

di HNR

di HR

Relevance Feedback

Using terms from relevant documents• We expect documents that are similar to one another in

meaning (or usefulness) to have similar index terms.

• The system creates a replacement query (q’) based on q, but adds index terms that have been used to index known relevant documents, increases the relative weight of index terms in q that are also found in relevant documents, and reduces the weight of terms found in non-relevant documents.

How does this help?• It could help if documents were being missed because of the

synonym problem. The user uses the word ‘jam’, but some recipes use ‘jelly’ instead. Once a hit that uses ‘jelly’ has been recognized as relevant, then ‘jelly’ will appear n the next version of the query. Now hits may use ‘jelly’ but not ‘jam’.

• Conversely, it can help with the homonym problem. If the user wants references to ‘lead’ (the metal), and gets documents relating to dog-walking, then by marking the dog-walking references as not relevant, key words associated with dog-walking will be reduced in weight

pros and cons of feedback• If is set = 0, ignore non-relevant hits, a positive

feedback system; often preferred

• the feedback formula can be applied repeatedly, asking user for relevance information at each iteration

• relevance feedback is generally considered to be very effective for “high-use” systems

• one drawback is that it is not fully automatic.

Simple feedback example:

T = {pudding, jam, traffic, lane, treacle}

d1 = (0.8, 0.8, 0.0, 0.0, 0.4),

d2 = (0.0, 0.0, 0.9, 0.8, 0.0),

d3 = (0.8, 0.0, 0.0, 0.0, 0.8)

d4 = (0.6, 0.9, 0.5, 0.6, 0.0)

Recipe for jam pudding

DoT report on traffic lanes

Radio item on traffic jam in Pudding Lane

Recipe for treacle pudding

Display first 2 documents that match the following query:q = (1.0, 0.6, 0.0, 0.0, 0.0)

r = (0.91, 0.0, 0.6, 0.73)

Retrieved documents are:

d1 : Recipe for jam pudding

d4 : Radio item on traffic jam

relevant

not relevant

Suppose we set and to 0.5, to 0.2

q' = q di / | HR | di / | HNR|

= 0.5 q + 0.5 d1 0.2 d4

= 0.5 (1.0, 0.6, 0.0, 0.0, 0.0)+ 0.5 (0.8, 0.8, 0.0, 0.0, 0.4) 0.2 (0.6, 0.9, 0.5, 0.6, 0.0)

= (0.78, 0.52, 0.1, 0.12, 0.2)

(Note |Hn| = 1 and |Hnr| = 1)

di HR di HNR

Positive and Negative Feedback

Simple feedback example:

T = {pudding, jam, traffic, lane, treacle}

d1 = (0.8, 0.8, 0.0, 0.0, 0.4),

d2 = (0.0, 0.0, 0.9, 0.8, 0.0),

d3 = (0.8, 0.0, 0.0, 0.0, 0.8)

d4 = (0.6, 0.9, 0.5, 0.6, 0.0)

Display first 2 documents that match the following query:q’ = (0.78, 0.52, 0.1, 0.12, 0.2)

r’ = (0.96, 0.0, 0.86, 0.63) Retrieved documents are:

d1 : Recipe for jam pudding

d3 : Recipe for treacle pud

relevant

relevant

Thesaurus• a thesaurus or ontology may contain

– controlled vocabulary of terms or phrases describing a specific restricted topic,

– synonym classes, – hierarchy defining broader terms (hypernyms) and narrower

terms (hyponyms)– classes of ‘related’ terms.

• a thesaurus or ontology may be:– generic (as Roget’s thesaurus, or WordNet)– specific to a certain domain of knowledge, eg medical

Language normalisation

Content analysis

Uncontrolled keywords

Thesaurus

Index terms

User query

Normalised query

match

by replacing words from documents and query words with synonyms from a controlled language, we can improve precision and recall:

Thesaurus / Ontology construction

• Include terms likely to be of value in content analysis

• for each term, form classes of related words (separate classes for synonyms, hypernyms, hyponyms)

• form separate classes for each relevant meaning of the word

• terms in a class should occur with roughly equal frequency (not easy – NL has Zipf’s law word-freq )

• avoid high-frequency terms• it involves some expert judgment that will not be

easy to automate.

Example thesaurusA public-domain thesaurus (WORDNET) is available from:

http://www.cogsci.princeton.edu/~wn/

/home/cserv1_a/staff/nlplib/WordNet/2.0

/home/cserv1_a/staff/extras/nltk/1.4.2/corpora/wordnet

computer

data processor electronic computer

information processing system

synonyms (sense 1):

Example thesaurusA public-domain thesaurus (WORDNET) is available from:

http://www.cogsci.princeton.edu/~wn/

computercalculator

reckonerfigurer

estimator

synonyms (sense 2):

Hypernym is the generic term used to designate a whole class of specific instances. Y is a hypernym of X if X is a (kind of) Y.

Hyponym is the generic term used to designate a member of a class. X is a hyponym of Y if X is a (kind of) Y.

Coordinate words are words that have the same hypernym.

Hypernym synsets are preceded by "->", and hyponym synsets are preceded by "=>".

Terminology (from WordNet Help)

HypernymsSense 1computer, data processor, electronic computer, information processing system-> machine -> device -> instrumentality, instrumentation -> artifact, artefact -> object, physical object -> entity, something


HyponymsSense 1

computer, data processor, electronic computer, information processing system=> analog computer, analogue computer=> digital computer=> node, client, guest=> number cruncher=> pari-mutuel machine, totalizer, totaliser, totalizator, totalisator=> server, host


Sense 1computer, data processor, electronic computer, information processing system-> machine=> assembly=> calculator, calculating machine=> calendar=> cash machine, cash dispenser, automated teller machine, automatic teller machine, automated teller, automatic teller, ATM=> computer, data processor, electronic computer, information processing system=> concrete mixer, cement mixer=> corker=> cotton gin, gin=> decoder

Coordinate terms

Thesaurus use • replace term in document and/or query with term in

controlled language• replace term in query with related or broader term to

increase recall• suggest to user narrower terms to increase precision

Doc: <data processor>

Query: < electronic computer>

Thesaurus

computer (sense 1)

computer (sense 1)

match

S

Thesaurus use• replace term in document and/or query with term in



Thesaurus

Query: <computer (sense 1)>

match

All collection

Query: <node(sense 6)>

match

All collectionB

Thesaurus use• replace term in document and/or query with term in



Thesaurus

Query: client

match

All collection

match

All collectionN

Query: <computer (sense 1)>

User

Key points• a thesaurus or ontology can be used to normalise a

vocabulary and queries (?or documents?)

• it can be used (with some human intervention) to increase recall and precision

• generic thesaurus/ontology may not be effective in specialized collections and/or queries

• Semi-automatic construction of thesaurus/ontology based on the retrieved set of documents has produced some promising results.

comp3410 db32: technologies for knowledge management lecture 7: query broadening to improve ir by...

Documents

q d i h r d i h nr note

d i h r d i h nr positive

nonrelevant documents

replacement query q

traffic jam relevant

nave query query broadening

relevant hits h nr

initial query