3.3 probabilistic ir - htw saar · 2019-02-28 · 28 3.3 probabilistic ir § vector space model is...
Post on 23-Apr-2020
6 Views
Preview:
TRANSCRIPT
28
3.3 Probabilistic IR§ Vector space model is commonly criticized for being
heuristic and lacking a clear model of when adocument should be considered relevant
§ Probabilistic IR relies on probability theory to model the event that a document d is relevant to a query q
§ This probability is then estimated based on the termscontained in the document and the query
Information Retrieval / Chapter 3: Retrieval Models
29
Events and Probabilities§ Let’s consider two events A and B
§ A is the event that an object is a circle§ B is the event that an object is green
§ We refer to A ∧ B as the joint event that an objectis a green circle
P [ A ] = 59 P [ B ] = 4
9
P [ A · B ] = P [ A, B ] = 39
Information Retrieval / Chapter 3: Retrieval Models
30
Conditional Probabilities§ The conditional probability P[B|A] (B given A) is the
probability that the event B occurs if we already know that the event A has occurred
here
P [ B | A ] = P [ A · B ]P [ A ]
P [ B | A ] = 35
P [ A | B ] = 34
Information Retrieval / Chapter 3: Retrieval Models
31
Independence§ Two events A and B are called (stochastically)
independent, if the following holds for their joint probability
§ In our example, the events A and B are not independent
P [ A · B ] = P [ A ] P [ B ]
39 ”= 5
949
Information Retrieval / Chapter 3: Retrieval Models
32
Bayes’ Theorem§ Thomas Bayes (1701-1761) famously
observed the following theoremregarding the conditionalprobabilities of events
§ Bayes’ theorem is particularly useful when, for two eventsA and B, one of the conditional probabilities is easyto estimate, but the other is hard to estimate
Source: en.wikipedia.orgP [ A | B ] = P [ B | A ] P [ A ]P [ B ]
Information Retrieval / Chapter 3: Retrieval Models
33
Bayes’ Theorem in Action§ Example: Examining animals in the wild
§ A is the event that the animal is a fox§ B is the event that the animal has rabies (“Tollwut”)
§ Assume that we know the following probabilities
§ P[A] = 0.1 (e.g., estimated based on video surveillance)
§ P[B] = 0.05 (e.g., estimated based on hunted animals)
§ P[A|B] = 0.25 (e.g., estimated based on deceased animals)
§ We can now estimate the probability that a fox has rabies
P [ B | A ] = 0.25 · 0.050.1 = 0.125
Information Retrieval / Chapter 3: Retrieval Models
34
Probabilistic Ranking Principle (PRP)§ Probabilistic Ranking Principle (PRP) suggests that
documents should be ranked in descendingorder of their probability
of being relevant to the query (R = 1 indicates the event of observing a relevant document)
§ PRP maximizes precision under the assumptions that the probabilities can be determined exactly and that they are independent (both questionable assumptions)
Information Retrieval / Chapter 3: Retrieval Models
P [ R = 1 | d, q ]
35
Binary Independence Model§ Binary Independence Model (BIM) considers documents
and queries as sets of terms, i.e., a term either occursin a document or it doesn’t
§ BIM assumes that terms occur independently from eachother in documents (a questionable assumption)
§ Documents are ranked, following the PRP, according totheir probability P[R = 1 | d, q] with
Information Retrieval / Chapter 3: Retrieval Models
P [ R = 1 | d, q ] + P [ R = 0 | d, q ] = 1
36
Binary Independence Model§ We obtain the same ranking of documents, if we consider
their so-called odds ratios
§ Applying Bayes’ theorem we obtain
Information Retrieval / Chapter 3: Retrieval Models
O [ R | d, q ] =P [ R = 1 | d, q ]
P [ R = 0 | d, q ]
O [ R | d, q ] =P [ R = 1 | q ]
P [ R = 0 | q ]· P [ d | R = 1, q ]
P [ d | R = 0, q ]
à P [ d | R = 1, q ]P [ d | R = 0, q ]
{Constant
(depends only on q)
37
Binary Independence Model§ Assuming that terms occur independently
with V as the vocabulary of all known terms
§ Assuming that only terms from the query play a role
Information Retrieval / Chapter 3: Retrieval Models
P [ d | R = 1, q ]P [ d | R = 0, q ] =
Ÿ
vœV
P [ v | R = 1, q ]P [ v | R = 0, q ]
P [ d | R = 1, q ]P [ d | R = 0, q ] ¥
Ÿ
vœq
P [ v | R = 1, q ]P [ v | R = 0, q ]
38
Binary Independence Model§ We can distinguish between terms that occur in a
document and terms that don’t
§ Let pv and uv denote the probabilities that a term v occurs in a relevant and irrelevant document, respectively
Information Retrieval / Chapter 3: Retrieval Models
P [ d | R = 1, q ]P [ d | R = 0, q ] ¥
Ÿ
vœqvœd
P [ v | R = 1, q ]P [ v | R = 0, q ] ·
Ÿ
vœqv ”œd
P [ v | R = 1, q ]P [ v | R = 0, q ]
P [ d | R = 1, q ]P [ d | R = 0, q ] ¥
Ÿ
vœqvœd
pv
uv·Ÿ
vœqv ”œd
1 ≠ pv
1 ≠ uv
39
Binary Independence Model§ This can be rewritten as
Information Retrieval / Chapter 3: Retrieval Models
P [ d | R = 1, q ]P [ d | R = 0, q ] ¥
Ÿ
vœqvœd
pv (1 ≠ uv)uv (1 ≠ pv) ·
Ÿ
vœq
1 ≠ pv
1 ≠ uv
{
Constant(depends only on q)
ß
vœqvœd
pv (1 ≠ uv)uv (1 ≠ pv)
40
Computing with Probabilities§ When representing probabilities as floating point numbers
(e.g., double in Java) we have to worry aboutnumerical imprecision
§ We can mitigate the problem of numerical imprecisionby applying a logarithmic transformation, thus turningproducts into sums and operating with logarithmsof probabilities
Information Retrieval / Chapter 3: Retrieval Models
41
Computing with Probabilities
Information Retrieval / Chapter 3: Retrieval Models
42
Binary Independence Model§ Applying a logarithmic transformation to the binary
independence model, we obtain
§ We can return documents in descending order of theirrank status value (RSVd) and obtain the sameranking that we would have obtained whencomputing with the actual probabilities
§ How can we estimate the probabilities pv and uv?Information Retrieval / Chapter 3: Retrieval Models
log
Q
caŸ
vœqvœd
pv (1 ≠ uv)uv (1 ≠ pv)
R
db =ÿ
vœqvœd
log pv (1 ≠ uv)uv (1 ≠ pv) = RSVd
43
Binary Independence Model§ Assuming that most documents in the document
collection are irrelevant to any query, we estimate
as the probability that the term v occurs ina document that is irrelevant to the query
Information Retrieval / Chapter 3: Retrieval Models
uv = df (v)|D|
44
Binary Independence Model§ We have no information about which documents are
relevant to the query and thus estimate
as the probability that the term v occurs in a document that is relevant to the query
Information Retrieval / Chapter 3: Retrieval Models
pv = (1 ≠ pv) = 0.5
45
Binary Independence Model§ Retrieval status value RSVd can thus be rewritten as
the following variant of tf.idf
under the assumption that most terms occur rarely
Information Retrieval / Chapter 3: Retrieval Models
RSVd =ÿ
vœqvœd
log pv (1 ≠ uv)uv (1 ≠ pv) =
ÿ
vœqvœd
log (1 ≠ uv)uv
=ÿ
vœqvœd
log
11 ≠ df (v)
|D|
2
df (v)|D|
=ÿ
vœqvœd
log |D| ≠ df (v)df (v)
¥ÿ
vœqvœd
log |D|df (v)
46
Binary Independence Model§ Binary Independence Model has been shown to obtain
good results on collections with documents havinghomogeneous lengths, it does not work well whendocument lengths differ a lot (e.g., on the Web)
§ Relevance feedback by a user can be incorporated when estimating the probabilities pv and uv
§ While more principled than the vector space model,many of the assumptions made are questionablein practice (e.g., independence of terms)
Information Retrieval / Chapter 3: Retrieval Models
47
Okapi BM25§ Okapi BM25 is a probabilistic retrieval model that builds on
the binary independence model but takes term frequencies into account
§ It assumes that terms in relevant and irrelevant documents are distributed according to a Poisson distribution
§ Derivation of the rank status value is beyondthe scope of this lecture
Information Retrieval / Chapter 3: Retrieval Models
P [ tf (v, d) = k ] = ⁄k
k! e≠⁄
48
Okapi BM25
§ Parameter k1 controls influence of term frequencies
§ k1 = 0.0 yields a binary model similar to the BIM
§ k1 = 1.2 is a common choice in practice
§ Parameter b controls the normalization of term frequencies based on the document length |d| and the average document length avdl
§ b = 0.0 ignores document lengths
§ b = 0.75 is a common choice in practice
Information Retrieval / Chapter 3: Retrieval Models
RSVd =ÿ
vœq
(k1 + 1) tf (v, d)k1 ((1 ≠ b) + b (|d|/avdl)) + tf (v, d) log |D| ≠ df (v) + 0.5
df (v) + 0.5
49
Okapi BM25§ Okapi BM25F as an extension that can deal with fielded
documents (e.g., title, abstract, body)
§ Okapi BM25 has been shown to yield excellent results in different settings and is considered one of the state of the art retrieval models (e.g., available in Apache Lucene)
§ While more principled than the vector space model,many of the assumptions made are questionablein practice (e.g., independence of terms)
Information Retrieval / Chapter 3: Retrieval Models
50
Summary§ Probabilistic IR relies on probability theory to model the
event that a document is relevant to a query
§ Probabilistic Ranking Principle suggests to rank documents according to their probability of being relevant
§ Binary Independence Model considers whether a term occurs in a document or not and assumes independence
§ Okapi BM25 as a more sophisticated model that yields good results and is considered state of the art
Information Retrieval / Chapter 3: Retrieval Models
51
Literature[1] C. D. Manning, P. Raghavan, and H. Schütze:
Introduction to Information Retrieval,Cambridge University Press, 2008 (Chapter 11)
[2] W. B. Croft, D. Metzler, and T. Strohman:Search Engines – Information Retrievalin Practice, Pearson Education, 2009 (Chapter 7)
Information Retrieval / Chapter 3: Retrieval Models
top related