chapter. 02: modeling contenue... 19/10/2015dr. almetwally mostafa 1

61
Chapter. 02: Modeling Contenue... 15/06/22 Dr. Almetwally Mostafa 1

Upload: isabella-holt

Post on 02-Jan-2016

228 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Chapter. 02: Modeling Contenue...

20/04/23Dr. Almetwally Mostafa

1

Page 2: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Objective: to capture the IR problem using a probabilistic framework

Given a user query, there is an ideal answer set Querying as specification of the properties of

this ideal answer set (clustering) But, what are these properties? Guess at the beginning what they could be

(i.e., guess initial description of ideal answer set)

Improve by iteration

20/04/23Dr. Almetwally Mostafa

2

Page 3: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

3

Probabilistic Model An initial set of documents is retrieved somehow User inspects these docs looking for the relevant

ones (in truth, only top 10-20 need to be inspected) IR system uses this information to refine

description of ideal answer set By repeting this process, it is expected that the

description of the ideal answer set will improve Have always in mind the need to guess at the very

beginning the description of the ideal answer set Description of ideal answer set is modeled in

probabilistic terms

Page 4: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

4

Probabilistic Ranking Principle Given a user query q and a document dj, the

probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant.

But, how to compute probabilities? what is the sample space?

Page 5: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

5

The Ranking

Probabilistic ranking computed as: sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) This is the odds (likelihood) of the document dj being

relevant Taking the odds minimize the probability of an

erroneous judgement Definition:

wij {0,1} P(R | vec(dj)) : probability that given doc is relevant P(R | vec(dj)) : probability doc is not relevant

Page 6: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

6

The Ranking

sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj))

= [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)]

~ P(vec(dj) | R) P(vec(dj) | R)

P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents

Page 7: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

7

The Ranking

sim(dj,q) ~ P(vec(dj) | R) P(vec(dj) | R)

~ [ P(ki | R)] * [ P(ki | R)] [ P(ki | R)] * [ P(ki | R)]

P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents

Page 8: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

8

The Ranking sim(dj,q) ~ log [ P(ki | R)] * [ P(kj | R)]

[ P(ki | R)] * [ P(kj | R)]

~ K * [ log P(ki | R) + P(ki | R)

log P(ki | R) ] P(ki | R)

~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R)

where P(ki | R) = 1 - P(ki | R)P(ki | R) = 1 - P(ki | R)

Page 9: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

9

The Initial Ranking sim(dj,q) ~

~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R)

Probabilities P(ki | R) and P(ki | R) ? Estimates based on assumptions:

P(ki | R) = 0.5 P(ki | R) = ni

Nwhere ni is the number of docs that

contain ki Use this initial guess to retrieve an initial ranking Improve upon this initial ranking

Page 10: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

10

Improving the Initial Ranking sim(dj,q) ~

~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R)

Let V : set of docs initially retrieved Vi : subset of docs retrieved that contain ki

Reevaluate estimates: P(ki | R) = Vi

V P(ki | R) = ni - Vi

N - V Repeat recursively

Page 11: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

11

Improving the Initial Ranking sim(dj,q) ~

~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R)

To avoid problems with V=1 and Vi=0: P(ki | R) = Vi + 0.5

V + 1 P(ki | R) = ni - Vi + 0.5

N - V + 1 Also,

P(ki | R) = Vi + ni/N V + 1

P(ki | R) = ni - Vi + ni/N N - V + 1

Page 12: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

12

Pluses and Minuses Advantages:

Docs ranked in decreasing order of probability of relevance

Disadvantages: need to guess initial estimates for P(ki | R) method does not take into account tf and idf factors

Page 13: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

13

Brief Comparison of Classic Models

Boolean model does not provide for partial matches and is considered to be the weakest classic model

Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections

This seems also to be the view of the research community

Page 14: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

The Boolean model imposes a binary criterion for deciding relevance

The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past

We discuss now two set theoretic models for this: Fuzzy Set Model Extended Boolean Model

20/04/23Dr. Almetwally Mostafa

14

Page 15: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

15

Fuzzy Set Model Queries and docs represented by sets of index

terms: matching is approximate from the start This vagueness can be modeled using a fuzzy

framework, as follows: with each term is associated a fuzzy set each doc has a degree of membership in this fuzzy

set This interpretation provides the foundation for

many models for IR based on fuzzy theory In here, we discuss the model proposed by

Ogawa, Morita, and Kobayashi (1991)

Page 16: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

16

Fuzzy Set Theory

Framework for representing classes whose boundaries are not well defined

Key idea is to introduce the notion of a degree of membership associated with the elements of a set

This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership

Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic

Page 17: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

17

Fuzzy Set Theory

Definition A fuzzy subset A of U is characterized by a

membership function (A,u) : U [0,1]

which associates with each element u of U a number (u) in the interval [0,1]

Definition Let A and B be two fuzzy subsets of U. Also, let ¬A be

the complement of A. Then, (¬A,u) = 1 - (A,u) (AB,u) = max((A,u), (B,u)) (AB,u) = min((A,u), (B,u))

Page 18: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

18

Fuzzy Information Retrieval

Fuzzy sets are modeled based on a thesaurus This thesaurus is built as follows:

Let vec(c) be a term-term correlation matrix Let c(i,l) be a normalized correlation factor for (ki,kl):

c(i,l) = n(i,l) ni + nl - n(i,l)

• ni: number of docs which contain ki• nl: number of docs which contain kl• n(i,l): number of docs which contain both ki and kl

We now have the notion of proximity among index terms.

Page 19: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

19

Fuzzy Information Retrieval

The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows:

(i,j) = 1 - (1 - c(i,l)) ki dj

(i,j) : membership of doc dj in fuzzy subset associated with ki

The above expression computes an algebraic sum over all terms in the doc dj

A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki

Page 20: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

20

Fuzzy Information Retrieval

(i,j) = 1 - (1 - c(i,l)) ki dj

(i,j) : membership of doc dj in fuzzy subset associated with ki

If doc dj contains a term kl which is closely related to ki, we have c(i,l) ~ 1 (i,j) ~ 1 index ki is a good fuzzy index for doc

Page 21: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

21

Fuzzy IR: An Example

q = ka (kb kc) vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0)

= vec(cc1) + vec(cc2) + vec(cc3) (q,dj) = (cc1+cc2+cc3,j)

= 1 - (1 - (a,j) (b,j) (c,j)) * (1 - (a,j) (b,j) (1-

(c,j))) * (1 - (a,j) (1-(b,j)) (1-(c,j)))

cc1cc3

cc2

Ka Kb

Kc

Page 22: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

22

Fuzzy Information Retrieval

Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory

Experiments with standard test collections are not available

Difficult to compare at this time

Page 23: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

23

Extended Boolean Model Booelan retrieval is simple and elegant But, no ranking is provided How to extend the model?

interpret conjunctions and disjunctions in terms of Euclidean distances

Page 24: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Boolean model is simple and elegant. But, no provision for a ranking As with the fuzzy model, a ranking can be

obtained by relaxing the condition on set membership

Extend the Boolean model with the notions of partial matching and term weighting

Combine characteristics of the Vector model with properties of Boolean algebra

20/04/23Dr. Almetwally Mostafa

24

Page 25: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

25

The Idea

The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra

Let, q = kx ky wxj = fxj * idf(x) associated with [kx,dj]

max(idf(i)) Further, wxj = x and wyj = y

Page 26: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

26

The Idea: qand = kx ky; wxj = x and wyj = y

dj

dj+1

y = wyj

x = wxj(0,0)

(1,1)

kx

ky

sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2

2 2

AND

Page 27: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

27

The Idea: qor = kx ky; wxj = x and wyj = y

dj

dj+1

y = wyj

x = wxj(0,0)

(1,1)

kx

ky

sim(qor,dj) = sqrt( x + y ) 2

2 2

OR

Page 28: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

28

Generalizing the Idea

We can extend the previous model to consider Euclidean distances in a t-dimensional space

This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter

A generalized conjunctive query is given by qor = k1 k2 . . . kt

A generalized disjunctive query is given by qand = k1 k2 . . . kt

ppp

p p p

Page 29: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

29

Generalizing the Idea

sim(qand,dj) = 1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m

p pp p1

sim(qor,dj) = (x1 + x2 + . . . + xm ) m

p pp p1

Page 30: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

30

Properties

sim(qor,dj) = (x1 + x2 + . . . + xm ) m

If p = 1 then (Vector like) sim(qor,dj) = sim(qand,dj) = x1 + . . . + xm

m If p = then (Fuzzy like)

sim(qor,dj) = max (wxj) sim(qand,dj) = min (wxj)

p pp p1

Page 31: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

31

Properties

By varying p, we can make the model behave as a vector, as a fuzzy, or as an intermediary model

This is quite powerful and is a good argument in favor of the extended Boolean model

(k1 k2) k3

k1 and k2 are to be used as in a vectorial retrieval while the presence of k3 is required.

2

Page 32: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

32

Properties

q = (k1 k2) k3

sim(q,dj) = ( (1 - ( (1-x1) + (1-x2) ) ) + x3 ) 2

2

2

ppp1p

1p

Page 33: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Model is quite powerful Properties are interesting and might be

useful Computation is somewhat complex However, distributivity operation does not

hold for ranking computation: q1 = (k1 k2) k3 q2 = (k1 k3) (k2 k3) sim(q1,dj) sim(q2,dj)

20/04/23Dr. Almetwally Mostafa

33

Page 34: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Classic models enforce independence of index terms.

For the Vector model: Set of term vectors {k1, k2, ..., kt} are linearly

independent and form a basis for the subspace of interest.

Frequently, this is interpreted as: i,j ki kj = 0

In 1985, Wong, Ziarko, and Wong proposed an interpretation in which the set of terms is linearly independent, but not pairwise orthogonal.

20/04/23Dr. Almetwally Mostafa

34

Page 35: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

35

Key Idea: In the generalized vector model, two index terms

might be non-orthogonal and are represented in terms of smaller components (minterms).

As before let, wij be the weight associated with [ki,dj] {k1, k2, ..., kt} be the set of all terms

If these weights are all binary, all patterns of occurrence of terms within documents can be represented by the minterms:

m1 = (0,0, ..., 0) m5 = (0,0,1, ..., 0) m2 = (1,0, ..., 0) •

m3 = (0,1, ..., 0) • m4 = (1,1, ..., 0) m2

= (1,1,1, ..., 1) In here, m2 indicates documents in which solely the

term k1 occurs.

t

Page 36: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

36

Key Idea: The basis for the generalized vector model is

formed by a set of 2 vectors defined over the set of minterms, as follows:

0 1 2 ... 2 m1 = (1, 0, 0, ..., 0, 0) m2 = (0, 1, 0, ..., 0, 0) m3 = (0, 0, 1, ..., 0, 0)

••

• m2 = (0, 0, 0, ..., 0, 1)

Notice that, i,j mi mj = 0 i.e., pairwise orthogonal

t

t

Page 37: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

37

Key Idea:

Minterm vectors are pairwise orthogonal. But, this does not mean that the index terms are independent:

The minterm m4 is given by:m4 = (1, 1, 0, ..., 0, 0)

This minterm indicates the occurrence of the terms k1 and k2 within a same document. If such document exists in a collection, we say that the minterm m4 is active and that a dependency between these two terms is induced.

The generalized vector model adopts as a basic foundation the notion that cooccurence of terms within documents induces dependencies among them.

Page 38: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

38

The vector associated with the term ki is computed as: c mr

ki = sqrt( c )

c = wij

The weight c associated with the pair [ki,mr] sums up the weights of the term ki in all the documents which have a term occurrence pattern given by mr.

Notice that for a collection of size N, only N minterms affect the ranking (and not 2 ).

Forming the Term Vectors

r, g (m )=1i r

i,r

r, g (m )=1i r

i,r2

i,r dj | l, gl(dj)=gl(mr)

i,r

t

Page 39: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

39

A degree of correlation between the terms ki and kj can now be computed as:

ki • kj = c * c

This degree of correlation sums up (in a weighted form) the dependencies between ki and kj induced by the documents in the collection (represented by the mr minterms).

Dependency between Index Terms

j,ri,rr, g (m )=1 g (m )=1i r j r

Page 40: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

40

The Generalized Vector Model: An Example

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3d1 2 0 1d2 1 0 0d3 0 1 3d4 2 0 0d5 1 2 4d6 1 2 0d7 0 5 0

q 1 2 3

Page 41: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

41

k1 k2 k3d1 2 0 1d2 1 0 0d3 0 1 3d4 2 0 0d5 1 2 4d6 0 2 2d7 0 5 0

q 1 2 3

Computation of C i,rk1 k2 k3

d1 = m6 1 0 1d2 = m2 1 0 0d3 = m7 0 1 1d4 = m2 1 0 0d5 = m8 1 1 1d6 = m7 0 1 1d7 = m3 0 1 0

q = m8 1 1 1

c1,r c2,r c3,rm1 0 0 0m2 3 0 0m3 0 5 0m4 0 0 0m5 0 0 0m6 2 0 1m7 0 3 5m8 1 2 4

c = wiji,r dj | l, gl(dj)=gl(mr)

wij

Page 42: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

42

Computation of Index Term Vectors

c1,r c2,r c3,rm1 0 0 0m2 3 0 0m3 0 5 0m4 0 0 0m5 0 0 0m6 2 0 1m7 0 3 5m8 1 2 4

k1 = 1 (3 m2 + 2 m6 + m8 ) sqrt(3 + 2 + 1 )

k2 = 1 (5 m3 + 3 m7 + 2 m8 ) sqrt(5 + 3 + 2 )

k3 = 1 (1 m6 + 5 m7 + 4 m8 ) sqrt(1 + 5 + 4 )

2 2 2

2 2 2

2 2 2

Page 43: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

43

Computation of Document Vectors

d1 = 2 k1 + k3 d2 = k1 d3 = k2 + 3 k3 d4 = 2 k1 d5 = k1 + 2 k2 + 4 k3 d6 = 2 k2 + 2 k3 d7 = 5 k2 q = k1 + 2 k2 + 3 k3

k1 k2 k3d1 2 0 1d2 1 0 0d3 0 1 3d4 2 0 0d5 1 2 4d6 0 2 2d7 0 5 0

q 1 2 3

Page 44: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

44

k1 = 1 (3 m2 + 2 m6 + m8 )sqrt(3 + 2 + 1 )

k2 = 1 (5 m3 + 3 m7 + 2 m8 )sqrt(5 + 3 + 2 )

k3 = 1 (1 m6 + 5 m7 + 4 m8 )sqrt(1 + 5 + 4 )

2 2 2

2 2 2

2 2 2

d1 = 2 k1 + k3 d2 = k1 d3 = k2 + 3 k3 d4 = 2 k1 d5 = k1 + 2 k2 + 4 k3 d6 = 2 k2 + 2 k3 d7 = 5 k2 q = k1 + 2 k2 + 3 k3

Ranking Computation

Page 45: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Model considers correlations among index terms

Not clear in which situations it is superior to the standard Vector model

Computation costs are higher Model does introduce interesting new

ideas

20/04/23Dr. Almetwally Mostafa

45

Page 46: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

46

Latent Semantic Indexing Classic IR might lead to poor retrieval due to:

unrelated documents might be included in the answer set

relevant documents that do not contain at least one index term are not retrieved

Reasoning: retrieval based on index terms is vague and noisy

The user information need is more related to concepts and ideas than to index terms

A document that shares concepts with another document known to be relevant might be of interest

Page 47: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

47

Latent Semantic Indexing

The key idea is to map documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms)

Retrieval in this reduced concept space might be superior to retrieval in the space of index terms

Page 48: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

48

Latent Semantic Indexing

Definitions Let t be the total number of index terms Let N be the number of documents Let (Mij) be a term-document matrix with t rows and

N columns To each element of this matrix is assigned a weight

wij associated with the pair [ki,dj] The weight wij can be based on a tf-idf weighting

scheme

Page 49: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

49

Latent Semantic Indexing

The matrix (Mij) can be decomposed into 3 matrices (singular value decomposition) as follows: (Mij) = (K) (S) (D)t

(K) is the matrix of eigenvectors derived from (M)(M)t

(D)t is the matrix of eigenvectors derived from (M)t(M) (S) is an r x r diagonal matrix of singular values

where r = min(t,N) that is, the rank of (Mij)

Page 50: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

50

Computing an Example Let (Mij) be given by the matrix

Compute the matrices (K), (S), and (D)t

k1 k2 k3 q djd1 2 0 1 5d2 1 0 0 1d3 0 1 3 11d4 2 0 0 2d5 1 2 4 17d6 1 2 0 5d7 0 5 0 10

q 1 2 3

Page 51: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

51

Latent Semantic Indexing In the matrix (S), select only the s largest singular

values Keep the corresponding columns in (K) and (D)t

The resultant matrix is called (M)s and is given by (M)s = (K)s (S)s (D)t

where s, s < r, is the dimensionality of the concept space

The parameter s should be large enough to allow fitting the characteristics of the

data small enough to filter out the non-relevant

representational details

s

Page 52: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

52

Latent Ranking The user query can be modelled as a pseudo-

document in the original (M) matrix Assume the query is modelled as the document

numbered 0 in the (M) matrix The matrix

(M)t(M)s

quantifies the relantionship between any two documents in the reduced concept space

The first row of this matrix provides the rank of all the documents with regard to the user query (represented as the document numbered 0)

s

Page 53: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Latent semantic indexing provides an interesting conceptualization of the IR problem

It allows reducing the complexity of the underline representational framework which might be explored, for instance, with the purpose of interfacing with the user

20/04/23Dr. Almetwally Mostafa

53

Page 54: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Classic IR: Terms are used to index documents and

queries Retrieval is based on index term matching

Motivation: Neural networks are known to be good

pattern matchers

20/04/23Dr. Almetwally Mostafa

54

Page 55: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

Neural Networks: The human brain is composed of billions of neurons Each neuron can be viewed as a small processing

unit A neuron is stimulated by input signals and emits

output signals in reaction A chain reaction of propagating signals is called a

spread activation process As a result of spread activation, the brain might

command the body to take physical reactions

20/04/23Dr. Almetwally Mostafa

55

Page 56: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

56

Neural Network Model

A neural network is an oversimplified representation of the neuron interconnections in the human brain: nodes are processing units edges are synaptic connections the strength of a propagating signal is modelled by a

weight assigned to each edge the state of a node is defined by its activation level depending on its activation level, a node might issue

an output signal

Page 57: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

57

Neural Network for IR: From the work by Wilkinson & Hingston, SIGIR’91

Document Terms

Query Terms Documents

ka

kb

kc

ka

kb

kc

k1

kt

d1

dj

dj+1

dN

Page 58: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

58

Neural Network for IR Three layers network Signals propagate across the network First level of propagation:

Query terms issue the first signals These signals propagate accross the network to

reach the document nodes Second level of propagation:

Document nodes might themselves generate new signals which affect the document term nodes

Document term nodes might respond with new signals of their own

Page 59: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

59

Quantifying Signal Propagation

Normalize signal strength (MAX = 1) Query terms emit initial signal equal to 1 Weight associated with an edge from a query term

node ki to a document term node ki: WiqWiq = wiq

sqrt ( i wiq ) Weight associated with an edge from a document

term node ki to a document node dj: WijWij = wij

sqrt ( i wij )

2

2

Page 60: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

60

Quantifying Signal Propagation After the first level of signal propagation, the

activation level of a document node dj is given by:

i WiqWiq WijWij = i wiq wij sqrt ( i wiq ) * sqrt ( i wij )

which is exactly the ranking of the Vector model New signals might be exchanged among document

term nodes and document nodes in a process analogous to a feedback cycle

A minimum threshold should be enforced to avoid spurious signal generation

2 2

Page 61: Chapter. 02: Modeling Contenue... 19/10/2015Dr. Almetwally Mostafa 1

20/04/23Dr. Almetwally Mostafa

61

Conclusions

Model provides an interesting formulation of the IR problem

Model has not been tested extensively It is not clear the improvements that the model might

provide