chapter 2 modeling
DESCRIPTION
Chapter 2 Modeling. Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University. Indexing. Indexing. indexing: assign identifiers to text items. assign: manual vs. automatic indexing identifiers: - PowerPoint PPT PresentationTRANSCRIPT
Hsin-Hsi Chen 1
Chapter 2 Modeling
Hsin-Hsi Chen
Department of Computer Science and Information Engineering
National Taiwan University
Hsin-Hsi Chen 2
Indexing
Hsin-Hsi Chen 3
Indexing
• indexing: assign identifiers to text items.• assign: manual vs. automatic indexing• identifiers:
– objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …
– controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, …
– single-term vs. term phrase
Hsin-Hsi Chen 4
Two Issues
• Issue 1: indexing exhaustivity– exhaustive: assign a large number of terms– nonexhaustive
• Issue 2: term specificity– broad terms (generic)
cannot distinguish relevant from nonrelevant items
– narrow terms (specific)retrieve relatively fewer items, but most of them are relevant
Hsin-Hsi Chen 5
Parameters of retrieval effectiveness
• Recall
• Precision
• Goalhigh recall and high precision
P Number of relevant items retrieved
Total number of items retrieved
R Number of relevant items retrieved
Total number of relevant items in collection
Hsin-Hsi Chen 6
NonrelevantItems
RelevantItems
RetrievedPartab
c d
Precisiona
a + bRecall
a
a + d
Hsin-Hsi Chen 7
A Joint Measure
• F-score
is a parameter that encode the importance of recall and procedure.
=1: equal weight >1: precision is more important <1: recall is more important
FP R
P R
( )
2
2
1
Hsin-Hsi Chen 8
Choices of Recall and Precision
• Both recall and precision vary from 0 to 1.
• In principle, the average user wants to achieve both high recall and high precision.
• In practice, a compromise must be reached because simultaneously optimizing recall and precision is not normally achievable.
Hsin-Hsi Chen 9
Choices of Recall and Precision (Continued)
• Particular choices of indexing and search policies have produced variations in performance ranging from 0.8 precision and 0.2 recall to 0.1 precision and 0.8 recall.
• In many circumstance, both the recall and the precision varying between 0.5 and 0.6 are more satisfactory for the average users.
Hsin-Hsi Chen 10
Term-Frequency Consideration
• Function words– for example, "and", "or", "of", "but", …– the frequencies of these words are high in all
texts• Content words
– words that actually relate to document content – varying frequencies in the different texts of a
collect– indicate term importance for content
Hsin-Hsi Chen 11
A Frequency-Based Indexing Method
• Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words.
• Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di.
• Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.
Hsin-Hsi Chen 12
Discussions
• high-frequency termsfavor recall
• high precisionthe ability to distinguish individual documents from each other
• high-frequency termsgood for precision when its term frequency is not equally high in all documents.
Hsin-Hsi Chen 13
Inverse Document Frequency
• Inverse Document Frequency (IDF) for term Tj
where dfj (document frequency of term Tj) is number of documents in which Tj occurs.
– fulfil both the recall and the precision– occur frequently in individual documents but ra
rely in the remainder of the collection
idfN
dfj
j
log
Hsin-Hsi Chen 14
New Term Importance Indicator
• weight wij of a term Tj in a document ti
• Eliminating common function words
• Computing the value of wij for each term Tj in each document Di
• Assigning to the documents of a collection all terms with sufficiently high (tf x idf) factors
w tfN
dfij ij
j
log
Hsin-Hsi Chen 15
Term-discrimination Value
• Useful index termsdistinguish the documents of a collection from each other
• Document Space– two documents are assigned very similar term sets,
when the corresponding points in document configuration appear close together
– when a high-frequency term without discrimination is assigned, it will increase the document space density
Hsin-Hsi Chen 16
Original State After Assignment of good discriminator
After Assignment of poor discriminator
A Virtual Document Space
Hsin-Hsi Chen 17
Good Term Assignment
• When a term is assigned to the documents of a collection, the few items to which the term is assigned will be distinguished from the rest of the collection.
• This should increase the average distance between the items in the collection and hence produce a document space less dense than before.
Hsin-Hsi Chen 18
Poor Term Assignment
• A high frequency term is assigned that does not discriminate between the items of a collection.
• Its assignment will render the document more similar.
• This is reflected in an increase in document space density.
Hsin-Hsi Chen 19
Term Discrimination Value
• definitiondvj = Q - Qj
where Q and Qj are space densities before and after the assignments of term Tj.
• dvj>0, Tj is a good term; dvj<0, Tj is a poor term.
QN N
sim D Di kki k
N
i
N
1
1 11( )( , )
Hsin-Hsi Chen 20
DocumentFrequency
Low frequency
dvj=0Medium frequency
dvj>0
High frequency
dvj<0
N
Thesaurustransformation
Phrasetransformation
Variations of Term-Discrimination Valuewith Document Frequency
Hsin-Hsi Chen 21
Another Term Weighting
• wij = tfij x dvj
• compared with
– : decrease steadily with increasing documentfrequency
– dvj: increase from zero to positive as the document frequency of the term increase,
decrease shapely as the document frequency becomes still larger.
w tfN
dfij ij
j
log
N
df j
Hsin-Hsi Chen 22
Term Relationships in Indexing
• Single-term indexing– Single terms are often ambiguous.– Many single terms are either too specific or too
broad to be useful.
• Complex text identifiers– subject experts and trained indexers– linguistic analysis algorithms, e.g., NP chunker– term-grouping or term clustering methods
Hsin-Hsi Chen 23
Term Classification (Clustering)
T T T T
D
D
D
d d d
d d d
d d d
t
n
t
t
n n nt
1 2 3
1
2
11 12 1
21 22 2
1 2
Hsin-Hsi Chen 24
Term Classification (Clustering)
• Column partGroup terms whose corresponding column representation reveal similar assignments to the documents of the collection.
• Row partGroup documents that exhibit sufficiently similar term assignment.
Hsin-Hsi Chen 25
Linguistic Methodologies
• Indexing phrases:nominal constructions including adjectives and nouns– Assign syntactic class indicators (i.e., part of speech) to
the words occurring in document texts.
– Construct word phrases from sequences of words exhibiting certain allowed syntactic markers (noun-noun and adjective-noun sequences).
Hsin-Hsi Chen 26
Term-Phrase Formation
• Term Phrasea sequence of related text words carry a more specific meaning than the single termse.g., “computer science” vs. computer;
DocumentFrequency
Low frequency
dvj=0Medium frequency
dvj>0
High frequency
dvj<0
N
Thesaurustransformation
Phrasetransformation
Hsin-Hsi Chen 27
Simple Phrase-Formation Process
• the principal phrase component (phrase head)a term with a document frequency exceeding a stated threshold, or exhibiting a negative discriminator value
• the other components of the phrasemedium- or low- frequency terms with stated co-occurrence relationships with the phrase head
• common function wordsnot used in the phrase-formation process
Hsin-Hsi Chen 28
An Example
• Effective retrieval systems are essential for people in need of information.– “are”, “for”, “in” and “of”:
common function words– “system”, “people”, and “information”:
phrase heads
Hsin-Hsi Chen 29
The Formatted Term-Phrases
Phrase Heads and ComponentsMust Be Adjacent
Phrase Heads and ComponentsCo-occur in Sentence
1. retrieval system* 6. effective systems
2. systems essential 7. systems need
3. essential people 8. effective people
4. people need 9. retrieval people
5. need information* 10. effective information*
11. retrieval information*
12. essential information*
effective retrieval systems essential people need information
*: phrases assumed to be useful for content identification2/5 5/12
Hsin-Hsi Chen 30
The Problems
• A phrase-formation process controlled only by word co-occurrences and the document frequencies of certain words in not likely to generate a large number of high-quality phrases.
• Additional syntactic criteria for phrase heads and phrase components may provide further control in phrase formation.
Hsin-Hsi Chen 31
Additional Term-Phrase Formation Steps
• Syntactic class indicator are assigned to the terms, and phrase formation is limited to sequences of specified syntactic markers, such as adjective-noun and noun-noun sequences.
Adverb-adjective adverb-noun • The phrase elements are all chosen from within
the same syntactic unit, such as subject phrase, object phrase, and verb phrase.
Hsin-Hsi Chen 32
Consider Syntactic Unit
• effective retrieval systems are essential for people in the need of information
• subject phrase– effective retrieval systems
• verb phrase– are essential
• object phrase– people in need of information
Hsin-Hsi Chen 33
Phrases within Syntactic Components
• Adjacent phrase heads and components within syntactic components– retrieval systems*– people need– need information*
• Phrase heads and components co-occur within syntactic components– effective systems
[subj effective retrieval systems] [vp are essential ]for [obj people need information]
2/3
Hsin-Hsi Chen 34
Problems
• More stringent phrase formation criteria produce fewer phrases, both good and bad, than less stringent methodologies.
• Prepositional phrase attachment, e.g.,The man saw the girl with the telescope.
• Anaphora resolutionHe dropped the plate on his foot and broke it.
Hsin-Hsi Chen 35
Problems (Continued)
• Any phrase matching system must be able to deal with the problems of– synonym recognition
– differing word orders
– intervening extraneous word
• Example– retrieval of information vs. information retrieval
Hsin-Hsi Chen 36
Equivalent Phrase Formulation
• Base form: text analysis system• Variants:
– system analyzes the text– text is analyzed by the system– system carries out text analysis– text is subjected to system analysis
• Related term substitution– text: documents, information items– analysis: processing, transformation, manipulation– system: program, process
Hsin-Hsi Chen 37
Thesaurus-Group Generation
• Thesaurus transformation– broadens index terms whose scope is too narrow to be
useful in retrieval
– a thesaurus must assemble groups of related specific terms under more general, higher-level class indicators
DocumentFrequency
Low frequency
dvj=0Medium frequency
dvj>0
High frequency
dvj<0
N
Thesaurustransformation
Phrasetransformation
Hsin-Hsi Chen 38
Sample Classes of Roget’s Thesaurus
Class Indicator Entry Class Indicator Entrypermission offerleave presentation
760 sanction tenderallowance 763 overture
tolerance advanceauthorization submissionprohibition proposalveto proposition
761 disallowance invitationinjunction refusalban declining
taboo 764 noncompliance
consent rejection
acquiescence denial
762 compliance
agreement
acceptance
Hsin-Hsi Chen 39
The Indexing Prescription (1)
• Identify the individual words in the document collection.
• Use a stop list to delete from the texts the function words.
• Use an suffix-stripping routine to reduce each remaining word to word-stem form.
• For each remaining word stem Tj in document Di, compute wij.
• Represent each document Di byDi=(T1, wi1; T2, wi2; …, Tt, wit)
Hsin-Hsi Chen 40
Word Stemming
• effectiveness --> effective --> effect
• picnicking --> picnic
• king -\-> k
Hsin-Hsi Chen 41
Some Morphological Rules
• Restore a silent e after suffix removal from certain words to produce “hope” from “hoping” rather than “hop”
• Delete certain doubled consonants after suffix removal, so as to generate “hop” from “hopping” rather than “hopp”.
• Use a final y for an I in forms such as “easier”, so as to generate “easy” instead of “easi”.
Hsin-Hsi Chen 42
The Indexing Prescription (2)• Identify individual text words.• Use stop list to delete common function words.• Use automatic suffix stripping to produce word stems.• Compute term-discrimination value for all word stems.• Use thesaurus class replacement for all low-frequency
terms with discrimination values near zero.• Use phrase-formation process for all high-frequency terms
with negative discrimination values.• Compute weighting factors for complex indexing units.• Assign to each document single term weights, term
phrases, and thesaurus classes with weights.
Hsin-Hsi Chen 43
Query vs. Document
• Differences– Query texts are short.
– Fewer terms are assigned to queries.
– The occurrence of query terms rarely exceeds 1.
Q=(wq1, wq2, …, wqt) where wqj: inverse document frequencyDi=(di1, di2, …, dit) where dij: term frequency*inverse document frequency
sim Q D w dqj ij
j
t
( , ) ‧
1
Hsin-Hsi Chen 44
Query vs. Document• When non-normalized documents are used, the longer
documents with more assigned terms have a greater chance of matching particular query terms than do the shorter document vectors.
sim Q Diw d
d w
qj ij
j
t
ij qjj
t
j
t( , )
( ) ( )
‧
‧
1
2 2
11
sim Q Diw d
d
qj ij
j
t
ijj
t( , )
( )
‧1
2
1
or
Hsin-Hsi Chen 45
Relevance Feedback
• Terms present in previously retrieved documents that have been identified as relevant to the user’s query are added to the original formulations.
• The weights of the original query terms are altered by replacing the inverse document frequency portion of the weights with term-relevance weights obtained by using the occurrence characteristics of the terms in the previous retrieved relevant and nonrelevant documents of the collection.
Hsin-Hsi Chen 46
Relevance Feedback• Q = (wq1, wq2, ..., wqt)• Di = (di1, di2, ..., dit)• New query may be the following form
Q’ = {wq1, wq2, ..., wqt}+{w’qt+1, w’qt+2, ..., w’qt+m}
• The weights of the newly added terms Tt+1 to Tt+m may consist of a combined term-frequency and term-relevance weight.
Hsin-Hsi Chen 47
Final Indexing
• Identify individual text words.• Use a stop list to delete common words.• Use suffix stripping to produce word stems.• Replace low-frequency terms with thesaurus classes.• Replace high-frequency terms with phrases.• Compute term weights for all single terms, phrases, and th
esaurus classes.• Compare query statements with document vectors.• Identify some retrieved documents as relevant and some as
nonrelevant to the query.
Hsin-Hsi Chen 48
Final Indexing
• Compute term-relevance factors based on available relevance assessments.
• Construct new queries with added terms from relevant documents and term weights based on combined frequency and term-relevance weight.
• Return to step (7).Compare query statements with document vectors ……..
Hsin-Hsi Chen 49
Summary of expected effectiveness of automatic indexing
• Basic single-term automatic indexing -• Use of thesaurus to group related terms in the given topic area
+10% to +20%• Use of automatically derived term associations obtained from
joint term assignments found in sample document collections0% to -10%
• Use of automatically derived term phrases obtained by using co-occurring terms found in the texts of sample collections
+5% to +10%• Use of one iteration of relevant feedback to add new query
terms extracted from previously retrieved relevant documents+30% to +60%
Hsin-Hsi Chen 50
Models
Hsin-Hsi Chen 51
Ranking
• central problem of IR– Predict which documents are relevant and which are
not
• Ranking– Establish an ordering of the documents retrieved
• IR models– Different model provides distinct sets of premises to
deal with document relevance
Hsin-Hsi Chen 52
Information Retrieval Models• Classic Models
– Boolean model• set theoretic• documents and queries are represented as sets of index terms• compare Boolean query statements with the term sets used to identify
document content.
– Vector model• algebraic model• documents and queries are represented as vectors in a t-dimensional space• compute global similarities between queries and documents.
– Probabilistic model• probabilistic• documents and queries are represented on the basis of probabilistic theory• compute the relevance probabilities for the documents of a collection.
Hsin-Hsi Chen 53
Information Retrieval Models(Continued)
• Structured Models– reference to the structure present in written text– non-overlapping list model– proximal nodes model
• Browsing– flat– structured guided– hypertext
Hsin-Hsi Chen 54
Taxonomy of Information Retrieval Models
USER
TASK
Retrieval:Adhoc
Filtering
Browsing
Classic Modelsbooleanvector
probabilistic
Structured Modelsbooleanvector
probabilistic
BrowsingFlat
Structured GuidedHypertext
Set Theoretic
FuzzyExtended Boolean
Algebraic
Generalized VectorLat. Semantic Index
Neural Network
Probabilistic
Inference NetworkBrief Network
Hsin-Hsi Chen 55
Issues of a retrieval system
• Models– boolean– vector– probabilistic
• Logical views of documents– full text– set of index terms
• User task– retrieval– browsing
Hsin-Hsi Chen 56
Combinations of these issues
Index Terms Full TextFull Text+Structure
Retrieval
ClassicSet Theoretic
AlgebraicProbabilistic
Structured
Browsing FlatHypertext
Flat
ClassicSet Theoretic
AlgebraicProbabilistic
Structure GuidedHypertext
USER
TASK
LOGICAL VIEW OF DOCUMENTS
Hsin-Hsi Chen 57
Retrieval: Ad hoc and Filtering
• Ad hoc retrieval– Documents remain relatively static while new queries are
submitted
• Filtering– Queries remain relatively static while new documents come into
the system• e.g., news wiring services in the stock market
– User profile describes the user’s preferences• Filtering task indicates to the user which document might be interested to
him• Which ones are really relevant is fully reserved to the user
– Routing: a variation of filtering• Ranking filtered documents and show this ranking to users
Hsin-Hsi Chen 58
User profile
• Simplistic approach– The profile is described through a set of keywor
ds– The user provides the necessary keywords
• Elaborate approach– Collect information from the user– initial profile + relevance feedback (relevant inf
ormation and nonrelevant information)
Hsin-Hsi Chen 59
Formal Definition of IR Models
• /D, Q, F, R(qi, dj)/– D: a set composed of logical views (or representations)
for the documents in collection
– Q: a set composed of logical views (or representations) for the user information needs
– F: a framework for modeling documents representations, queries, and their relationships
– R(qi, dj): a ranking function which associations a real number with qiQ and dj D
query
Hsin-Hsi Chen 60
Formal Definition of IR Models(continued)
• classic Boolean model– set of documents– standard operations on sets
• classic vector model– t-dimensional vector space– standard linear algebra operations on vector
• classic probabilistic model– sets– standard probabilistic operations, and Bayes’ theorem
Hsin-Hsi Chen 61
Basic Concepts of Classic IR
• index terms (usually nouns): index and summarize• weight of index terms• Definition
– K={k1, …, kt}: a set of all index terms– wi,j: a weight of an index term ki of a document dj
– dj=(w1,j, w2,j, …, wt,j): an index term vector for the document dj
– gi(dj)= wi,j
• assumption– index term weights are mutually independent
wi,j associated with (ki,dj) tells us nothingabout wi+1,j associated with (ki+1,dj)
The terms computer and network in the area of computer networks
Hsin-Hsi Chen 62
Boolean Model
• The index term weight variables are all binary, i.e., wi,j{0,1}
• A query q is a Boolean expression (and, or, not)
• qdnf: the disjunctive normal form for q• qcc: conjunctive components of qdnf
• sim(dj,q): similarity of dj to q– 1: if qcc | (qcc qdnf(ki, gi(dj)=gi(qcc))– 0: otherwise
dj is relevant to q
Hsin-Hsi Chen 63
Boolean Model (Continued)
• Example– q=ka (kb kc)
– qdnf=(1,1,1) (1,1,0) (1,0,0)
(ka kb) (ka kc)= (ka kb kc) (ka kb kc)(ka kb kc) (ka kb kc)= (ka kb kc) (ka kb kc) (ka kb kc)
ka kb
kc
(1,0,0)(1,1,0)
(1,1,1)
Hsin-Hsi Chen 64
Boolean Model (Continued)
• advantage: simple
• disadvantage– binary decision (relevant or non-relevant) witho
ut grading scale– exact match (no partial match)
• e.g., dj=(0,1,0) is non-relevant to q=(ka (kb kc)
– retrieve too few or too many documents
Hsin-Hsi Chen 65
Basic Vector Space Model
• Term vector representation of documents Di=(ai1, ai2, …, ait)queries Qj=(qj1, qj2, …, qjt)
• t distinct terms are used to characterize content.
• Each term is identified with a term vector T.
• t vectors are linearly independent.
• Any vector is represented as a linear combination of the t term vectors.
• The rth document Dr can be represented as a document vector, written as
D a Tr r i
i
t
i
1
Hsin-Hsi Chen 66
Document representation in vector spacea document vector in a two-dimensional vector space
Hsin-Hsi Chen 67
Similarity Measure
• measure by product of two vectorsx • y = |x| |y| cos
• document-query similarity
• how to determine the vector components and term correlations?
D Q a q T Tr s r s i
i j
t
ji j‧ ‧
, 1
D a Tr r i
i
t
i
1
Q q
j
t
s sj jT
1
term vector:document vector:
Hsin-Hsi Chen 68
Similarity Measure (Continued)
• vector components
T T T T
A
D
D
D
a a a
a a a
a a a
t
n
t
t
n n nt
1 2 3
1
2
11 12 1
21 22 2
1 2
Hsin-Hsi Chen 69
Similarity Measure (Continued)
• term correlations Ti • Tj are not availableassumption: term vectors are orthogonal
Ti • Tj =0 (ij) Ti • Tj =1 (i=j)
• Assume that terms are uncorrelated.
• Similarity measurement between documents
sim D Q a qr s r s
i j
t
i j( ),
,
1
sim D D a ar s r s
i j
t
i j( ),
,
1
Hsin-Hsi Chen 70
Sample query-documentsimilarity computation
• D1=2T1+3T2+5T3 D2=3T1+7T2+1T3
Q=0T1+0T2+2T3
• similarity computations for uncorrelated termssim(D1,Q)=2•0+3 •0+5 •2=10sim(D2,Q)=3•0+7 •0+1 •2=2
• D1 is preferred
Hsin-Hsi Chen 71
Sample query-documentsimilarity computation (Continued)
• T1 T2 T3
T1 1 0.5 0T2 0.5 1 -0.2T3 0 -0.2 1
• similarity computations for correlated termssim(D1,Q)=(2T1+3T2+5T3) • (0T1+0T2+2T3 )
=4T1•T3+6T2 •T3 +10T3 •T3 =-6*0.2+10*1=8.8
sim(D2,Q)=(3T1+7T2+1T3) • (0T1+0T2+2T3 )=6T1•T3+14T2 •T3 +2T3 •T3 =-14*0.2+2*1=-0.8
• D1 is preferred
Hsin-Hsi Chen 72
Vector Model
• wi,j: a positive, non-binary weight for (ki,dj)
• wi,q: a positive, non-binary weight for (ki,q)
• q=(w1,q, w2,q, …, wt,q): a query vector, where t is the total number of index terms in the system
• dj= (w1,j, w2,j, …, wt,j): a document vector
Hsin-Hsi Chen 73
Similarity of document dj w.r.t. query q
• The correlation between vectors dj and q
• | q | does not affect the ranking
• | dj | provides a normalization
tj qi
ti ji
ti qiji
j
jj
ww
ww
qd
qdqdsim
12,1
2,
1 ,,
||||),(
Q
dj
cos(dj,q)
Hsin-Hsi Chen 74
document ranking
• Similarity (i.e., sim(q, dj)) varies from 0 to 1.
• Retrieve the documents with a degree of similarity above a predefined threshold(allow partial matching)
Hsin-Hsi Chen 75
term weighting techniques
• IR problem: one of clustering– user query: a specification of a set A of objects– clustering problem: determine which documents are in the set A (r
elevant), which ones are not (non-relevant)– intra-cluster similarity
• the features better describe the objects in the set A• tf factor in vector model
the raw frequency of a term ki inside a document dj
– inter-cluster similarity• the features better distinguish the the objects in the set A from the remaining
objects in the collection C• idf factor (inverse document frequency) in vector model
the inverse of the frequency of a term ki among the documents in the collection
Hsin-Hsi Chen 76
Definition of tf
• N: total number of documents in the system
• ni: the number of documents in which the index term ki appears
• freqi,j: the raw frequency of term ki in the document dj
• fi,j: the normalized frequency of term ki in document dj jll
jiji freq
freqf
,
,, max
Term tl has maximum frequencyin the document dj
(0~1)
Hsin-Hsi Chen 77
Definition of idf and tf-idf scheme
• idfi: inverse document frequency for ki
• wi,j: term-weighting by tf-idf scheme
• query term weight (Salton and Buckley)
ii n
Nidf log
ijiji n
Nfw log,,
iqil
qiqi n
N
freq
freqw log)
max
5.05.0(
,
,,
freqi,q: the raw frequency of the term ki in q
Hsin-Hsi Chen 78
Analysis of vector model
• advantages– its term-weighting scheme improves retrieval
performance– its partial matching strategy allows retrieval of
documents that approximate the query conditions– its cosine ranking formula sorts the documents
according to their degree of similarity to the query
• disadvantages– indexed terms are assumed to be mutually
independently
Hsin-Hsi Chen 79
Probabilistic Model
• Given a query, there is an ideal answer set– a set of documents which contains exactly the
relevant documents and no other
• query process– a process of specifying the properties of an
ideal answer set
• problem: what are the properties?
Hsin-Hsi Chen 80
Probabilistic Model (Continued)
• Generate a preliminary probabilistic description of the ideal answer set
• Initiate an interaction with the user– User looks at the retrieved documents and
decide which ones are relevant and which ones are not
– System uses this information to refine the description of the ideal answer set
– Repeat the process many times.
Hsin-Hsi Chen 81
Probabilistic Principle
• Given a user query q and a document dj in the collection, the probabilistic model estimates the probability that user will find dj relevant
• assumptions– The probability of relevance depends on query and docum
ent representations only– There is a subset of all documents which the user prefers a
s the answer set for the query q
• Given a query, the probabilistic model assigns to each document dj a measure of its similarity to the query
)(
)(
qtotnonrelevandP
qtorelevantdP
j
j
Hsin-Hsi Chen 82
Probabilistic Principle
• wi,j{0,1}, wi,q{0,1}: the index term weight variables are all binary non-relevant
• q: a query which is a subset of index terms• R: the set of documents known to be relevant• R (complement of R): the set of documents
• P(R|dj): the probability that the document dj is relevant to the query q
• P(R|dj): the probability that dj is non-relevant to q
Hsin-Hsi Chen 83
similarity• sim(dj,q): the similarity of the document dj t
o the query q
)|(
)|(),(
j
jj dRP
dRPqdsim (by definition)
)()|(
)()|(),(
RPRdP
RPRdPqdsim
j
jj
(Bayes’ rule)
)|(
)|(),(
RdP
RdPqdsim
j
jj (P(R) and P(R) are the
same for all documents)
)|( RdP j : the probability of randomly selecting the documentdj from the set of R of relevant documents
P(R): the probability that a document randomly selected fromthe entire collection is relevant
Hsin-Hsi Chen 84
t
i i
i
ii
iij
t
ii
t
i i
i
ii
iij
t
ii
t
i ijdig
ii
ijdig
ii
t
i jdigi
jdigi
jdigi
jdigi
t
i
jdigi
jdigi
t
i
jdigi
jdigi
j
jj
RkP
RkP
RkPRkP
RkPRkPdg
RkP
RkP
RkPRkP
RkPRkPdg
RkPRkPRkP
RkPRkPRkP
RkPRkP
RkPRkP
RkPRkP
RkPRkP
RdP
RdPqdsim
11
11
1)(
)(
1)(1)(
)(1)(
1
)(1)(
1
)(1)(
)|(
)|(
))|(1()|(
))|(1()|(log)(
)|(
)|(
)|()|(
)|()|(log)(
))|(())|()|((
))|(())|()|((log
))|(())|((
))|(())|((log
))|(())|((
))|(())|((
log
)|(
)|(),(
P(ki|R): the probability that the indexterm ki is present in a document randomly selected from the set R.
P(ki|R): the probability that the indexterm ki is not present in a document randomly selected from the set R.
independence assumption of index terms
Hsin-Hsi Chen 85
))|(
))|(1(log)
))|(1(
)|((log)(
)|(
)|()
)|(
))|(1(log)
))|(1(
)|((log)(
)|(
)|(
))|(1()|(
))|(1()|(log)(
)|(
)|(),(
1
11
11
RkP
RkP
RkP
RkPdg
RkP
RkP
RkP
RkP
RkP
RkPdg
RkP
RkP
RkPRkP
RkPRkPdg
RdP
RdPqdsim
i
i
i
ij
t
ii
t
i i
i
i
i
i
ij
t
ii
t
i i
i
ii
iij
t
ii
j
jj
Problem: where is the set R?
Hsin-Hsi Chen 86
Initial guess
• P(ki|R) is constant for all index terms ki.
• The distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all the documents in the collection.
5.0)|( Rkp i
N
nRkP i
i )|(
( 假設 N>>|R|,N|R|)
Hsin-Hsi Chen 87
Initial ranking
• V: a subset of the documents initially retrieved and ranked by the probabilistic model (top r documents)
• Vi: subset of V composed of documents which contain the index term ki
• Approximate P(ki|R) by the distribution of the index term ki among the documents retrieved so far.
• Approximate P(ki|R) by considering that all the non-retrieved documents are not relevant.
V
VRkP i
i )|(
VN
VnRkP ii
i
)|(
Hsin-Hsi Chen 88
Small values of V and Vi
• alternative 1
• alternative 2
1
5.0)|(
1
5.0)|(
VN
VnRkP
V
VRkP
iii
ii
1)|(
1)|(
VNNn
VnRkP
VNn
VRkP
iii
i
ii
i
V
VRkP i
i )|(
VN
VnRkP ii
i
)|(
a problem when V=1 and Vi=0
Hsin-Hsi Chen 89
Analysis of Probabilistic Model
• advantage– documents are ranked in decreasing order of
their probability of being relevant
• disadvantages– the need to guess the initial separation of
documents into relevant and non-relevant sets– do not consider the frequency with which an
index terms occurs inside a document– the independence assumption for index terms
Hsin-Hsi Chen 90
Comparison of classic models
• Boolean model: the weakest classic model
• Vector model is expected to outperform the probabilistic model with general collections (Salton and Buckley)
Hsin-Hsi Chen 91
Alternative Set Theoretic Models-Fuzzy Set Model
• Model– a query term: a fuzzy set– a document: degree of membership in this set– membership function
• Associate membership function with the elements of the class
• 0: no membership in the set• 1: full membership • 0~1: marginal elements of the set
documents
Hsin-Hsi Chen 92
Fuzzy Set Theory
• A fuzzy subset A of a universe of discourse U is characterized by a membership function µA: U[0,1] which associates with each element u of U a number µA(u) in the interval [0,1]– complement:– union:– intersection:
)(1)( uu AA
))(),(max()( uuu BABA
))(),(min()( uuu BABA
a class
a document
Hsin-Hsi Chen 93
Examples
• Assume U={d1, d2, d3, d4, d5, d6}
• Let A and B be {d1, d2, d3} and {d2, d3, d4}, respectively.
• Assume A={d1:0.8, d2:0.7, d3:0.6, d4:0, d5:0, d6:0} and B={d1:0, d2:0.6, d3:0.8, d4:0.9, d5:0, d6:0}
• ={d1:0.2, d2:0.3, d3:0.4, d4:1, d5:1, d6:1}
• ={d1:0.8, d2:0.7, d3:0.8, d4:9, d5:0, d6:0}
• ={d1:0.2, d2:0.6, d3:0.6, d4:0, d5:0, d6:0}
)(1)( uu AA
))(),(max()( uuu BABA
))(),(min()( uuu BABA
Hsin-Hsi Chen 94
Fuzzy Information Retrieval
• basic idea– Expand the set of index terms in the query with
related terms (from the thesaurus) such that additional relevant documents can be retrieved
– A thesaurus can be constructed by defining a term-term correlation matrix c whose rows and columns are associated to the index terms in the document collection
keyword connection matrix
Hsin-Hsi Chen 95
Fuzzy Information Retrieval(Continued)
• normalized correlation factor ci,l between two terms ki and kl (0~1)
• In the fuzzy set associated to each index term ki, a document dj has a degree of membership µi,j
lili
lili nnn
nc
,
,,
)1(1 ,,
jdlk
liji c
where ni is # of documents containing term ki
nl is # of documents containing term kl
ni,l is # of documents containing ki and kl
Hsin-Hsi Chen 96
Fuzzy Information Retrieval(Continued)
• physical meaning– A document dj belongs to the fuzzy set associated to the
term ki if its own terms are related to ki, i.e., i,j=1.
– If there is at least one index term kl of dj which is strongly related to the index ki, then i,j1.
ki is a good fuzzy index
– When all index terms of dj are only loosely related to ki, i,j0.
ki is not a good fuzzy index
Hsin-Hsi Chen 97
Example
• q=(ka (kb kc)=(ka kb kc) (ka kb kc) (ka kb kc)=cc1+cc2+cc3
Da
Db
Dc
cc3cc2
cc1
Da: the fuzzy set of documents associated to the index ka
djDa has a degree of membership a,j > a predefined threshold K
Da: the fuzzy set of documents associated to the index ka
(the negation of index term ka)
Hsin-Hsi Chen 98
Example
))1)(1(1())1(1()1(1
)1(1
,,,,,,,,,
3
1,
,321,
jcjbjajcjbjajcjbja
ijicc
jccccccjq
Query q=ka (kb kc)
disjunctive normal form qdnf=(1,1,1) (1,1,0) (1,0,0)
(1) the degree of membership in a disjunctive fuzzy set is computedusing an algebraic sum (instead of max function) more smoothly(2) the degree of membership in a conjunctive fuzzy set is computedusing an algebraic product (instead of min function)
Recall )(1)( uu AA
Hsin-Hsi Chen 99
Alternative Algebraic Model:Generalized Vector Space Model• independence of index terms
– ki: a vector associated with the index term ki
– the set of vectors {k1, k2, …, kt} is linearly independent• orthogonal:
– The index term vectors are assumed linearly independent but are not pairwise orthogonal in generalized vector space model
– The index term vectors, which are not seen as the basis of the space, are composed of smaller components derived from the particular collection.
0 jkk i for ij
Hsin-Hsi Chen 100
Generalized Vector Space Model• {k1, k2, …, kt}: index terms in a collection• wi,j: binary weights associated with the term-document pair {ki, dj}• The patterns of term co-occurrence (inside documents) can be repre
sented by a set of 2t minterms
• gi(mj): return the weight {0,1} of the index term ki in the minterm mj (1 i t)
m1=(0, 0, …, 0): point to documents containing none of index termsm2=(1, 0, …, 0): point to documents containing the index term k1 onlym3=(0,1,…,0): point to documents containing the index term k2 onlym4=(1,1,…,0): point to documents containing the index terms k1 and k2
…
m2t=(1, 1, …, 1): point to documents containing all the index terms
Hsin-Hsi Chen 101
Generalized Vector Space Model(Continued)
• mi (2t-tuple vector) is associated with minterm mi (t-tuple vector)
• e.g., m4 is associated with m4 containing k1 and k2, and no others
• co-occurrence of index terms inside documents: dependencies among index terms
)1,0,...,0,0(
0...
)0,0,...,1,0(
)0,0,...,0,1(
2
2
1
t
im
jiformm
m
m
j
(the set of mi are pairwise orthogonal)
Hsin-Hsi Chen 102
28,1
27,1
26,1
25,1
88,177,16,155,11
6
cccc
mcmcmcmck
minterm mr mr vectorm1=(0,0,0) m1=(1,0,0,0,0,0,0,0)m2=(0,0,1) m2=(0,1,0,0,0,0,0,0)m3=(0,1,0) m3=(0,0,1,0,0,0,0,0)m4=(0,1,1) m4=(0,0,0,1,0,0,0,0)m5=(1,0,0) m5=(0,0,0,0,1,0,0,0)m6=(1,0,1) m6=(0,0,0,0,0,1,0,0)m7=(1,1,0) m7=(0,0,0,0,0,0,1,0)m8=(1,1,1) m8=(0,0,0,0,0,0,0,1)
t=3
d1 (t1) d11 (t1 t2)d2 (t3) d12 (t1 t3)d3 (t3) d13 (t1 t2)d4 (t1) d14 (t1 t2)d5 (t2) d15 (t1 t2 t3)d6 (t2) d16 (t1 t2)d7 (t2 t3) d17 (t1 t2)d8 (t2 t3) d18 (t1 t2)d9 (t2) d19 (t1 t2 t3)d10 (t2 t3) d20 (t1 t2)
19,115,18,1
20,118,117,116,114,113,111,17,1
12,16,14,11,15,1
wwc
wwwwwwwc
wcwwc
Hsin-Hsi Chen 103
28,2
27,2
24,2
23,2
88,277,24,233,22
4
cccc
mcmcmcmck
minterm mr mr vectorm1=(0,0,0) m1=(1,0,0,0,0,0,0,0)m2=(0,0,1) m2=(0,1,0,0,0,0,0,0)m3=(0,1,0) m3=(0,0,1,0,0,0,0,0)m4=(0,1,1) m4=(0,0,0,1,0,0,0,0)m5=(1,0,0) m5=(0,0,0,0,1,0,0,0)m6=(1,0,1) m6=(0,0,0,0,0,1,0,0)m7=(1,1,0) m7=(0,0,0,0,0,0,1,0)m8=(1,1,1) m8=(0,0,0,0,0,0,0,1)
t=3
d1 (t1) d11 (t1 t2)d2 (t3) d12 (t1 t3)d3 (t3) d13 (t1 t2)d4 (t1) d14 (t1 t2)d5 (t2) d15 (t1 t2 t3)d6 (t2) d16 (t1 t2)d7 (t2 t3) d17 (t1 t2)d8 (t2 t3) d18 (t1 t2)d9 (t2) d19 (t1 t2 t3)d10 (t2 t3) d20 (t1 t2)
19,215,28,2
20,218,217,216,214,213,211,27,2
10,28,27,24,29,26,25,23,2
wwc
wwwwwwwc
wwwcwwwc
Hsin-Hsi Chen 104
minterm mr mr vectorm1=(0,0,0) m1=(1,0,0,0,0,0,0,0)m2=(0,0,1) m2=(0,1,0,0,0,0,0,0)m3=(0,1,0) m3=(0,0,1,0,0,0,0,0)m4=(0,1,1) m4=(0,0,0,1,0,0,0,0)m5=(1,0,0) m5=(0,0,0,0,1,0,0,0)m6=(1,0,1) m6=(0,0,0,0,0,1,0,0)m7=(1,1,0) m7=(0,0,0,0,0,0,1,0)m8=(1,1,1) m8=(0,0,0,0,0,0,0,1)
t=3
12,36,310,38,37,34,33,32,32,3
28,3
26,3
24,3
22,3
88,366,34,322,33
4
wcwwwcwwc
cccc
mcmcmcmck
19,315,38,3 wwc
d1 (t1) d11 (t1 t2)d2 (t3) d12 (t1 t3)d3 (t3) d13 (t1 t2)d4 (t1) d14 (t1 t2)d5 (t2) d15 (t1 t2 t3)d6 (t2) d16 (t1 t2)d7 (t2 t3) d17 (t1 t2)d8 (t2 t3) d18 (t1 t2)d9 (t2) d19 (t1 t2 t3)d10 (t2 t3) d20 (t1 t2)
Hsin-Hsi Chen 105
Generalized Vector Space Model(Continued)
• Determine the index vector ki associated with the index term ki
1)(,2
1)(, ,
,ri ri
ri
mgr
mgrrri
ic
mck
lallformgdgd
jiri
rljlj
wc)()(|
,,
Collect all the vectors mr in which the index term ki is in state 1.
Sum up wi,j associated withthe index term ki and documentdj whose term occurrence pattern coincides with minterm mr
Hsin-Hsi Chen 106
Generalized Vector Space Model(Continued)
• kikj quantifies a degree of correlation between ki and kj
• standard cosine similarity is adopted
1)(1)(|
,,
rri mgjmgr
rjriji cckk
ii qii
i jij kwqkwd ,,
1)(,2
1)(, ,
,ri ri
ri
mgr
mgrrri
ic
mck
Hsin-Hsi Chen 107
28,3
26,3
24,3
22,3
88,366,34,322,33
4
cccc
mcmcmcmck
8,38,24,34,232
8,38,16,36,131
8,28,17,27,121
cccckk
cccckk
cccckk
28,1
27,1
26,1
25,1
88,177,16,155,11
6
cccc
mcmcmcmck
28,2
27,2
24,2
23,2
88,277,24,233,22
4
cccc
mcmcmcmck
Hsin-Hsi Chen 108
Comparison with Standard Vector Space Model
d1 (t1): (w1,1,0,0) d11 (t1 t2)
d2 (t3): (0,0,w3,2) d12 (t1 t3)
d3 (t3): (0,0,w3,3) d13 (t1 t2)
d4 (t1): (w1,4,0,0) d14 (t1 t2)
d5 (t2): (0,w2,5,0) d15 (t1 t2 t3)
d6 (t2): (0,w2,6,0) d16 (t1 t2)
d7 (t2 t3): (0,w2,7,w3,7) d17 (t1 t2)
d8 (t2 t3): (0,w2,8,w3,8) d18 (t1 t2)
d9 (t2): (0,w2,9,0) d19 (t1 t2 t3)
d10 (t2 t3): (0,w2,10,w3,10) d20 (t1 t2)
Hsin-Hsi Chen 109
Latent Semantic Indexing Model
• representation of documents and queries by index terms– problem 1: many unrelated documents might be
included in the answer set– problem 2: relevant documents which are not
indexed by any of the query keywords are not retrieved
• possible solution: concept matching instead of index term matching– application in cross-language information retrieval
Hsin-Hsi Chen 110
basic idea
• Map each document and query vector into a lower dimensional space which is associated with concepts
• Retrieval in the reduced space may be superior to retrieval in the space of index terms
Hsin-Hsi Chen 111
Definition
• t: the number of index terms in the collection
• N: the total number of documents
• M=(Mij): a term-document association matrix with t rows and N columns
• Mij: a weight wi,j associated with the term-document pair [ki, dj] (e.g., using tf-idf)
Hsin-Hsi Chen 112
Singular Value Decomposition
})()({
:sin
}{
)1(
AQDQQDQQDQAQDQA
iondecompositvaluegular
IQQIQQstRQ
AA
RA
TTTTTTTTT
TTnn
T
nn
where D =
1
2
n
.
.
.0
0diagonal matrix
orthogonal
1 2 … n 0
Hsin-Hsi Chen 113
TTTTTTT
T
TTnn
T
nn
UUDVDUUDVUDVUDVAA
UDVA
iondecompositvaluegular
IVVIUUstRVU
AA
RA
2))(())((
:sin
,,
)2(
where D =
1
2
n
.
.
.0
0diagonal matrix
orthogonal
(AB)T= BT AT
1 2 … n 0
Hsin-Hsi Chen 114
vectorcolumnaqqqqQwhere
QDQQDQAQ
QDQA
in
T
T
:],[ 21
][][ 2121 nn qqqqqqA
1
2
n
.
.
.
0
nnn
nnn
qAqqAqqAq
qqqAqAqAq
222111
221121 ][][
1, 2, …, n 為 A 之 eigenvalues , qk 為 A 相對於 k 之 eigenvector
Hsin-Hsi Chen 115
Singular Value Decomposition
matrixtermtotermttaMM
matrixdocumenttodocumentNNaMM
DSKM
columnsNandrowstwithmatrixdocumenttermaM
t
t
t
:
:
:
According to
t
t
t
Nt
DSKM
MMfromderivedrseigenvectoofmatrixtheD
MMfromderivedrseigenvectoofmatrixtheK
RM
:
:
IDD
IKKt
t
Hsin-Hsi Chen 116
t
ttt
ttt
t
DSD
DSKKSD
DSKDSK
matrixdocumenttodocumentMM
2
))((
)()(
:
t
ttt
ttt
t
KSK
KSDDSK
DSKDSK
matrixtermtotermMM
2
))((
))((
:
對照 A=QDQT
Q is matrix of eigenvectors of AD is diagonal matrix of singular values
tMMfromderived
rseigenvectoofmatrixtheK :
MMfromderived
rseigenvectoofmatrixtheDt
:得到
),min(,
sin:
Ntrwherevalues
gularofmatrixdiagonalrrS
s < r (Concept space is reduced)
Hsin-Hsi Chen 117
Consider only the s largest singular values of S
1
2
n
.
.
.0
0
1 2 … n 0
The resultant Ms matrix is the matrix of rank s which is closestto the original matrix M in the least square sense.
t
ssss DSKM (s<<t, s<<N)
Hsin-Hsi Chen 118
Ranking in LSI
• query: a pseudo-document in the original M term-document– query is modeled as the document with number
0
– MstMs: the ranks of all documents w.r.t this que
ry
Hsin-Hsi Chen 119
Structured Text Retrieval Models
• Definition– Combine information on text content with information on the document
structure– e.g., same-page(near(‘atomic holocaust’, Figure(label(‘earth’))))
• Expressive power vs. evaluation efficiency – a model based on non-overlapping lists– a model based on proximal nodes
• Terminology– match point: position in the text of a sequence of words that matches the user
query– region: a contiguous portion of the text– node: a structural component of the document (chap, sec, …)
Hsin-Hsi Chen 120
Non-Overlapping Lists
• divide the whole text of each document in non-overlapping text regions (lists)
• example
• Text regions from distinct lists might overlap
L0 Chapter
L1 Sections
L2 Subsections
L3 Subsubsections
indexinglists
a list of all chapters in the document
a list of all sections in the document
a list of all subsections in the document
a list all subsubsections in the document
1 5000
1 3000
Chapter 1
3001 50001.1 1.2
1 1000 1001 3000 3001 50001.1.1 1.1.2 1.2.1
1 500 5011000 1001
Hsin-Hsi Chen 121
Non-Overlapping Lists(Continued)
• Data structure– a single inverted file – each structural component stands as an entry– for each entry, there is a list of text regions as a list
occurrences
• Operations– Select a region which contains a given word– Select a region A which does not contain any other region B
(where B belongs to a list distinct from the list for A)– Select a region not contained within any other region– …
Recall that there is another invertedfile for the words in the text
Hsin-Hsi Chen 122
Inverted Files
• File is represented as an array of indexed records.
Term 1 Term 2 Term 3 Term 4
Record 1 1 1 0 1
Record 2 0 1 1 1
Record 3 1 0 1 1
Record 4 0 0 1 1
Hsin-Hsi Chen 123
Inverted-file process
• The record-term array is inverted (transposed).
Record 1 Record 2 Record 3 Record 4
Term 1 1 0 1 0
Term 2 1 1 0 0
Term 3 0 1 1 1
Term 4 1 1 1 1
Hsin-Hsi Chen 124
Inverted-file process (Continued)
• Take two or more rows of an inverted term-record array, and produce a single combined list of record identifiers.
Query (term2 and term3)1 1 0 00 1 1 1
---------------------------------1 <-- R2
Hsin-Hsi Chen 125
Extensions of Inverted Index Operations(Distance Constraints)
• Distance Constraints– (A within sentence B)
terms A and B must co-occur in a common sentence
– (A adjacent B)terms A and B must occur adjacently in the text
Hsin-Hsi Chen 126
Extensions of Inverted Index Operations(Distance Constraints)
• Implementation– include term-location in the inverted indexes
information: {R345, R348, R350, …}retrieval: {R123, R128, R345, …}
– include sentence-location in the indexes information:
{R345, 25; R345, 37; R348, 10; R350, 8; …}retrieval:
{R123, 5; R128, 25; R345, 37; R345, 40; …}
Hsin-Hsi Chen 127
Extensions of Inverted Index Operations(Distance Constraints)
– include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {R345, 2, 3, 5; …}retrieval: {R345, 2, 3, 6; …}
– query examples(information adjacent retrieval)(information within five words retrieval)
– cost: the size of indexes
Hsin-Hsi Chen 128
Model Based on Proximal Nodes
• hierarchical vs. flat indexing structures
Chapter
Sections
Subsections
Subsubsections
…holocaust 10 256 48,324…
paragraphs, pages, lines
…
an inverted list for holocaust
hierarchicalindex
flat index
entries: positions in the text
nodes: position in the text
Hsin-Hsi Chen 129
Model Based on Proximal Nodes(Continued)
• query language– Specification of regular expressions– Reference to structural components by name– Combination– Example
• Search for sections, subsections, or subsubsections which contain the word ‘holocaust’
• [(*section) with (‘holocaust’)]
Hsin-Hsi Chen 130
Model Based on Proximal Nodes(Continued)
• Basic algorithm– Traverse the inverted list for the term ‘holocaust’– For each entry in the list (i.e., an occurrence), search the
hierarchical index looking for sections, subsections, and sub-subsections
• Revised algorithm– For the first entry, search as before– Let the last matching structural component be the innermost
matching component– Verify the innermost matching component also matches the
second entry.• If it does, the larger structural components above it also do.
nearby nodes
Hsin-Hsi Chen 131
Models for Browsing
• Browsing vs. searching– The goal of a searching task is clearer in the
mind of the user than the goal of a browsing task
• Models– Flat browsing– Structure guided browsing– The hypertext model
Hsin-Hsi Chen 132
Models for Browsing
• Flat organization– Documents are represented as dots in a 2-D plan
– Documents are represented as elements in a 1-D list, e.g., the results of search engine
• Structure guided browsing– Documents are organized in a directory, which group
documents covering related topics
• Hypertext model– Navigating the hypertext: a traversal of a directed graph
Hsin-Hsi Chen 133
Trends and Research Issues• Library systems
– Cognitive and behavioral issues oriented particularly at a better understanding of which criteria the users adopt to judge relevance
• Specialized retrieval systems– e.g., legal and business documents– how to retrieve all relevant documents without retrieving a large
number of unrelated documents
• The Web– User does not know what he wants or has great difficulty in
formulating his request– How the paradigm adopted for the user interface affects the ranking– The indexes maintained by various Web search engine are almost
disjoint