basic ir: modeling basic ir task: match a subset of documents to the user’s query slightly more...

22
Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted relevance The derivation of relevance leads to different IR models.

Upload: faith-horn

Post on 14-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Basic IR: Modeling Basic IR Task:

Match a subset of documents to the user’s query

Slightly more complex: and rank the resulting documents by

predicted relevance

The derivation of relevance leads to different IR models.

Page 2: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Concepts: Term-Document Incidence

Imagine matrix of terms X documents with 1 when the term appears in the document and 0 otherwise.

Queries satisfied how? Problems?

search segment

select semantic

MIR 1 0 1 1

AI 1 1 0 1

Page 3: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Concepts: Term Frequency To support document ranking, need

more than just term incidence. Term frequency records number of

times a given term appears in each document.

Intuition: More times a term appears in a document the more central it is to the topic of the document.

Page 4: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Concept: Term Weight Weights represent the importance of

a given term for characterizing a document.

wij is a weight for term i in document j.

Page 5: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Mapping Task and Document Type to Model

Index Terms

Full Text Full Text + Structure

Searching (Retrieval)

Classic Classic Structured

Surfing (Browsing)

Flat FlatHypertext

Structure GuidedHypertext

Page 6: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

IR Models

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

boolean vector probabilistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Browsing

Flat Structure Guided Hypertext from MIR text

Page 7: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Classic Models: Basic Concepts

Ki is an index term dj is a document t is the total number of docs K = (k1, k2, …, kt) is the set of all index terms wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to

doc vec(dj) = (w1j, w2j, …, wtj) is a weighted vector

associated with the document dj gi(vec(dj)) = wij is a function which returns the

weight associated with pair (ki,dj)

Page 8: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Classic: Boolean Model Based on set theory: map queries with

Boolean operations to set operations Select documents from term-

document incidence matrix Pros:Cons:

Page 9: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Exact Matching Ignores… term frequency in document term scarcity in corpus size of document ranking

Page 10: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Vector Model Vector of term weights based on term

frequency Compute similarity between query

and document where both are vectors vec(dj) = (w1j, w2j, ..., wtj)

vec(q) = (w1q, w2q, ..., wtq) Similarity is the cosine of the angle

between the vectors.

Page 11: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Cosine Measure

Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

j

dj

q

from MIR notes

t

iqi

t

iji

t

iqiji

j

j

ww

ww

qd

qdqdSim

1

2

,1

2

,

1,,

)cos(

),(

Page 12: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

How to Set Wij Weights? TF-IDF

Within document: Term-Frequency tf measures term density within a

document Across document: Inverse Document

Frequency idf measures informativeness or rarity of

term across corpus.

dfnidf

i

i log

Page 13: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

TF * IDF Computation

)/log(,, ididi dfntfw

rmcontain te that documents ofnumber the

documents ofnumber total

document in termoffrequency ,

idf

n

ditf

i

di

What happens as number of occurrences in a document increases?

What happens as term becomes more rare?

Page 14: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

TF * IDF TF may be normalized.

tf(i,d) = freq(i,d) / max(freq(l,d)) IDF is computed

normalized to size of corpus as log to make TF and IDF values

comparable IDF requires a static corpus.

Page 15: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

How to Set Wi,q Weights?

1. Create Vector directly from query2. Use modified tf-idf

i

qi df

n

qifreq

qifreqW log*

)),(max(

),(*5.05.0,

Page 16: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 d1 2 0 1 d2 1 0 0 d3 0 1 3 d4 2 0 0 d5 1 2 4 d6 1 2 0 d7 0 5 0

q 1 2 3

from MIR notes

The Vector Model: Example

Page 17: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

d1

d2

d3d4 d5

d6d7

k1k2

k3

from MIR notes

The Vector Model: Example (cont.)

1. Compute Tf-IDF Vector for each documentFor first document:K1: ((2/2)*(log (7/5)) = .33K2: (0*(log (7/4))) = 0K3: ((1/2)*(log (7/3))) = .42

for rest:[.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85], [.17 .56 0], [0 .56 0]

Page 18: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

The Vector Model: Example (cont.)

2. Compute the Tf-IDF for the query [1 2 3]:K1: (.5 + ((.5 * 1)/3))*(log (7/5)))K2: (.5 + ((.5 * 2)/3))*(log (7/4)))K3: (.5 + ((.5 * 3)/3))*(log (7/3)))which is: [.22 .47 .85]

d1

d2

d3d4 d5

d6d7

k1k2

k3

Page 19: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

The Vector Model: Example (cont.)

3. Compute the Sim for each document:D1:

D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43

|D1| = sqrt((.33^2) + (.42^2)) = .53|q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0sim = .43 / (.53 * 1.0) = .81

D2: .22 D3: .93 D4: .23 D5: .97 D6: .51 D7: .47

d1

d2

d3d4 d5

d6d7

k1k2

k3

Page 20: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Vector Model Implementation Issues Sparse TermXDocument matrix Store term count, term weight, or

weighted by idfi ? What if the corpus is not fixed (e.g.,

the Web)? What happens to IDF? How to efficiently compute Cosine

for large index?

Page 21: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Heuristics for Computing Cosine for Large Index

Select from only non-zero cosines Focus on non-zero cosines for rare

(high idf) words Pre-compute document adjacency

for each term, pre-compute k nearest docs for a t term query, compute cosines from

query to union of t pre-computed lists, choose top k

Page 22: Basic IR: Modeling Basic IR Task: Match a subset of documents to the user’s query Slightly more complex: and rank the resulting documents by predicted

Pros: term-weighting improves quality cosine ranking formula sorts documents

according to degree of similarity to the query

Cons: assumes independence of index terms

The TFIDF Vector Model: Pros/Cons