classic ir models
DESCRIPTION
Classic IR Models. Boolean model simple model based on set theory queries as Boolean expressions adopted by many commercial systems Vector space model queries and documents as vectors in an M -dimensional space M is the number of terms - PowerPoint PPT PresentationTRANSCRIPT
E.G.M. Petrakis Information Retrieval Models 1
Classic IR ModelsBoolean model
simple model based on set theory queries as Boolean expressions adopted by many commercial systems
Vector space model queries and documents as vectors in an M-dimensional
space M is the number of terms find documents most similar to the query in the M-
dimensional spaceProbabilistic model
a probabilistic approach assume an ideal answer set for each query iteratively refine the properties of the ideal answer set
E.G.M. Petrakis Information Retrieval Models 2
Document Index Terms Each document is represented by a set of
representative index terms or keywords requires text pre-processing (off-line) these terms summarize document contents adjectives, adverbs, connectives are less useful the index terms are mainly nouns (lexicon look-
up) Not all terms are equally useful
very frequent terms are not useful very infrequent terms are not useful neither terms have varying relevance (weights) when
used to describe documents
E.G.M. Petrakis Information Retrieval Models 3
Text PreprocessingExtract terms from documents and queries
document - query profileProcessing stages
word separation sentence splittingchange terms to a standard form (e.g., lowercase)eliminate stop-words (e.g. and, is, the, …)reduce terms to their base form (e.g., eliminate
prefixes, suffixes)construct term indices (usually inverted files)
E.G.M. Petrakis Information Retrieval Models 4
Text Preprocessing Chart
from Baeza – Yates & Ribeiro – Neto, 1999
E.G.M. Petrakis Information Retrieval Models 5
Inverted Index
άγαλμααγάπη…δουλειά…πρωί…ωκεανός
index posting list
(1,2)(3,4)
(4,3)(7,5)
(10,3)
123456789
1011
………
documents
E.G.M. Petrakis Information Retrieval Models 6
Basic NotationDocument: usually text
D: document collection (corpus)d: an instance of D
Query: same representation with documentsQ: set of all possible queriesq: an instance of Q
Relevance: R(d,q)binary relation R: D x Q {0,1}d is “relevant” to q iff R(d,q) = 1 or degree of relevance: R(d,q) [0,1] or probability of relevance R(d,q) = Prob(R|d,q)
E.G.M. Petrakis Information Retrieval Models 7
Term WeightsT = {t1, t2, ….tM } the terms in corpus N number of documents in corpusdj a documentdj is represented by (w1j,w2j,…wMj) where
wij > 0 if ti appears in dj
wij = 0 otherwiseq is represented by (q1,q2,…qM)
R(d,q) > 0 if q and d have common terms
E.G.M. Petrakis Information Retrieval Models 8
Term Weighting
t2
wMNwM1tM
w1Nw12w11t1
dN….d2d1 docsterms
w2i
E.G.M. Petrakis Information Retrieval Models 9
Document Space (corpus)
q
D
queryrelevant documentnon-relevant document
E.G.M. Petrakis Information Retrieval Models 10
Boolean ModelBased on set theory and Boolean algebra
Boolean queries: “John” and “Mary” not “Ann”terms linked by “and”, “or”, “not”terms weights are 0 or 1 (wij=0 or 1)query terms are present or absent in a documenta document is relevant if the query condition is
satisfiedPros: simple, in many commercial systemsCons: no ranking, not easy for complex
queries
E.G.M. Petrakis Information Retrieval Models 11
Query Processing For each term ti in query q={t1,t2,…tM}
1) use the index to retrieve all dj with wij > 02) sort them by decreasing order (e.g., by term
frequency) Return documents satisfying the query
condition Slow for many terms: involves set
intersections Keep only the top K documents for each
term at step 2 or Do not process all query terms
E.G.M. Petrakis Information Retrieval Models 12
Vector Space ModelDocuments and queries are M –
dimensional term vectorsnon-binary weights to index termsa query is similar to a document if their
vectors are similarretrieved documents are sorted by
decreasing order a document may match a query only
partially SMART is the most popular implementation
E.G.M. Petrakis Information Retrieval Models 13
Query – Document Similarity
M
i idM
i iq
M
i idiq
ww
ww
dqdqdqSim
12
12
1
||||),(
Similarity is defined as the cosine of the angle between document and query vectors
θ
q
d
E.G.M. Petrakis Information Retrieval Models 14
Weighting Schemetf x idf weighting scheme
wij: weight of term ti associated with document dj
tfij frequency of term ti in document dj
max frequencytfli is computed over all terms in dj
tfij: normalized frequencyidfi: inverse document frequency ni: number of documents where term ti occurs
iidfnN
ijtffreq
freqw
ili
ijij logmax=
E.G.M. Petrakis Information Retrieval Models 15
Weight NormalizationMany ways to express weights
E.g., using log(tfij) The weight is normalized in [0,1]
Normalize by document length
M
ik kj
iijij
tf
idftfw
2))log(1(
))log(1(
E.G.M. Petrakis Information Retrieval Models 16
M
k kj
ijij
w
ww
1
2'
Normalization by Document Length
The longer the document, the more likely it is for a given term to appear in it
Normalize the term weights by document length (so longer documents are not given more weight)
E.G.M. Petrakis Information Retrieval Models 17
Comments on Term Weighting
tfij: term frequency – measures how well a term describes a documentintra document characterization
idfi: terms appearing in many documents are not very useful in distinguishing relevant from non-relevant documentsinter document characterization
This scheme favors average terms
E.G.M. Petrakis Information Retrieval Models 18
Comments on Vector Space Model
Pros:at least as good as other modelsapproximate query matching: a query
and a document need not contain exactly the same terms
allows for ranking of resultsCons:
assumes term independency
E.G.M. Petrakis Information Retrieval Models 19
Document DistanceConsider documents d1, d2 with vectors u1,
u2
their distance is defined as the length AB
)),(1(2=))cos(1(2
=)2/sin(2=),(tan
21
21
ddsimilarity
θ
θddcedis
-
-
E.G.M. Petrakis Information Retrieval Models 20
Probabilistic ModelComputes the probability that the
document is relevant to the queryranks the documents according to their
probability of being relevant to the queryAssumption: there is a set R of relevant
documents which maximizes the overall probability of relevance R: ideal answer set
R is not known in advanceinitially assume a description (the terms) of Riteratively refine this description
E.G.M. Petrakis Information Retrieval Models 21
Basic NotationD: corpus, d: an instance of DQ: set of queries, q: an instance of Q
P(R | d) : probability that d is relevant : probability that d is not
relevant
q} orelevant t is d ,q ,d | q){(d, R QD
q} orelevant tnot is d ,q ,d | q){(d, R QD
)( |dRP
E.G.M. Petrakis Information Retrieval Models 22
Probability of RelevanceP(R|d): probability that d is relevant
Bayes rule
P(d|R): probability of selecting d from RP(R): probability of selecting R from DP(d): probability of selecting d from D
)()()(=)(
dPRPd|RPR|dP
E.G.M. Petrakis Information Retrieval Models 23
Document RankingTake the odds of relevance as the
rank
Minimizes probability of erroneous judgment
are the same for all docs
)()()()(
)()()(
RPRd|PRPd|RP
|dRPR|dPd|qSim
)(),( RPRP
)()(=)(
Rd|Pd|RPd|qSim
E.G.M. Petrakis Information Retrieval Models 24
Ranking (cont’d)Each document is represented by a set
of index terms t1,t2,..tM assume binary terms wi for terms ti
d=(w1,w2,…wM) wherewi=1 if the term appears in dwi=0 otherwise
Assuming independence of index terms
dt idt i
ii|R)tP(|RtPd|RP )()(
E.G.M. Petrakis Information Retrieval Models 25
Ranking (conted)By taking logarithms and by omitting
constant terms
R is initially unknown
)R|P(t)R|P(t-1logww+
R)|P(t-1R)|P(tlogww
~)R|P(d
R)|P(d=Sim(d/q)
i
iid1 q
i
iid1 q
M
i iM
i i
E.G.M. Petrakis Information Retrieval Models 26
Initial EstimationMake simplifying assumptions such as
where ni: number of documents containing ti and N: total number of documents
Retrieve initial answer set using these values
Refine answer iteratively
Nn
R|tP|RtP iii =)( ,5.0=)(
E.G.M. Petrakis Information Retrieval Models 27
ImprovementLet V the number of documents retrieved
initiallyTake the fist r answers as relevant From them compute Vi: number of documents
containing ti
Update the initial probabilities:
Resubmit query and repeat until convergenceV-NV-n=)RP(t ,V
V=R)P(t iii
ii ||
E.G.M. Petrakis Information Retrieval Models 28
Comments on Probabilistic Model
Pros: good theoretical basis
Cons: need to guess initial probabilitiesbinary weights independence assumption
Extensions:relevance feedback: humans choose relevant
docsOKAPI formula for non – binary weights
E.G.M. Petrakis Information Retrieval Models 29
Comparison of Models The Boolean model is simple and used
used almost everywhere. It does not allow for partial matches. It is the weakest model
The Vector space model has been shown (Salton and Buckley) to outperform the other two models
Various extensions deal with their weaknesses
E.G.M. Petrakis Information Retrieval Models 30
Query ModificationThe results are not always satisfactory
some answers are correct, others are notqueries can’t specify user’s needs precisely
Iteratively reformulate and resubmit the query until the results become satisfactory
Two approachesrelevance feedbackquery expansion
E.G.M. Petrakis Information Retrieval Models 31
Relevance FeedbackMark answers as
relevant: positive examplesirrelevant: negative examples
Query: a point in document spaceat each iteration compute new query pointthe query moves towards an “optimal
point” that distinguishes relevant from non-relevant document
the weights of query terms are modified “term reweighting”
E.G.M. Petrakis Information Retrieval Models 32
Rochio Vectorsq0 q1
q2
optimal query
E.G.M. Petrakis Information Retrieval Models 33
Rochio FormulaQuery point
di: relevant answerdj: non-relevant answern1: number of relevant answersn2: number or non-relevant answersα, β, γ: relative strength (usually α=β=γ=1) α = 1, β = 0.75, γ = 0.25: q0 and relevant
answers contain important information
21
12
n
1i1
0 - n
j ji dn
dn
E.G.M. Petrakis Information Retrieval Models 34
Query ExpansionAdds new terms to the query which are
somehow related to existing termssynonyms from dictionary (e.g., staff, crew)semantically related terms from a
thesaurus (e.g., “wordnet”): man, woman, man kind, human…)
terms with similar pronunciation (Phonix, Soundex)
Better results in many cases but query defocuses (topic drift)
E.G.M. Petrakis Information Retrieval Models 35
CommentsDo all together
query expansion: new terms are added from relevant documents, dictionaries, thesaurus
term reweighing by Rochio formulaIf consistent relevance judgments are
provided2-3 iterations improve resultsquality depends on corpus
E.G.M. Petrakis Information Retrieval Models 36
ExtensionsPseudo relevance feedback: mark top
k answers as relevant, bottom k answers as non-relevant and apply Rochio formula
Relevance models for probabilistic modelevaluation of initial answers by humansterm reweighting model by Bruce Croft,
1983
E.G.M. Petrakis Information Retrieval Models 37
Text ClusteringThe grouping of similar vectors into
clustersSimilar documents tend to be
relevant to the same requestsClustering on M-dimensional space
M number of terms
E.G.M. Petrakis Information Retrieval Models 38
Clustering MethodsSound methods based on the
document-to-document similarity matrixgraph theoretic methodsO(N2) time
Iterative methods operating directly on the document vectorsO(NlogN) or O(N2/logN) time
E.G.M. Petrakis Information Retrieval Models 39
Sound Methods1. Two documents with similarity > T
(threshold) are connected with an edge [Duda&Hart73]
clusters: the connected components (maximal cliques) of the resulting graph
problem: selection of appropriate threshold T
E.G.M. Petrakis Information Retrieval Models 40
Zahn’s method [Zahn71]
Find the minimum spanning tree For each doc delete edges with length l > lavg
lavg: average distance if its incident edges Or remove the longest edge (1 edge removed
=> 2 clusters, 2 edges removed => 3 clusters Clusters: the connected components of the
graph
the dashed edge is inconsistent and is deleted
E.G.M. Petrakis Information Retrieval Models 41
Iterative MethodsK-means clustering (K known in
advance)Choose some seed points (documents)
possible cluster centroidsRepeat until the centroids do not
changeassign each vector (document) to its
closest seed compute new centroids reassign vectors to improve clusters
E.G.M. Petrakis Information Retrieval Models 42
Cluster SearchingThe M-dimensional query vector is
compared with the cluster-centroidssearch closest cluster retrieve documents with similarity > T
E.G.M. Petrakis Information Retrieval Models 43
References "Modern Information Retrieval", Richardo Baeza-Yates,
Addison Wesley 1999 "Searching Multimedia Databases by Content",
Christos Faloutsos, Kluwer Academic Publishers, 1996 Information Retrieval Resources
http://nlp.stanford.edu/IR-book/information-retrieval.html
TREC http://trec.nist.gov/ SMART http://en.wikipedia.org/wiki/SMART_
Information_Retrieval_System LEMOUR http://www.lemurproject.org/ LUCENE http://lucene.apache.org/