classic ir models - dalhousie universityweb.cs.dal.ca/~anwar/ir/lecturenotes/l2.pdf · classic ir...

Classic IR Models

5/6/2012 1

Classic IR Models

• Idea

– Each document is represented by index terms.

– An index term is basically a (word) whose semantics give meaning to the document.

– Not all index terms are equally useful for describing the document content.

– The effect of index terms on the document is captured by weights to each term in the document.

5/6/2012 4

Definition

• Let

– t be the number of index terms in the corpus (or system).

– ki a generic index term

– K= { k1, k2, …, kt) the set of index terms

– wi,j >0 is a weight associated with each index term ki in a document dj.

– wi,j =0 if ki does not appear in dj.

– With dj associated an index term vector dj = (w1,j , w2,j , …, wt,j).

– gi is a ranking function that returns the weight associated with the index term

ki in dj, gi (dj)= wi,j.

5/6/2012 5

IR Models

Non-Overlapping Lists

Proximal Nodes

Structured Models

Retrieval:

Adhoc

Filtering

Browsing

U

s

e

r

T

a

s

k

Classic Models

boolean

vector

probabilistic

Set Theoretic

Fuzzy

Extended Boolean

Probabilistic

Inference Network

Belief Network

Algebraic

Generalized Vector

Lat. Semantic Index

Neural Networks

Browsing

Flat

Structure Guided

Hypertext

5/6/2012 6

Basic Idea

• Document: set of terms

• Query: Boolean expression over terms

– Satisfying:

• Document evaluates to "true" on single-term query if it contains that term

• Evaluate document on expression query as you would any Boolean expression

• Document satisfies query if evaluates to true on query

Credit: Princeton 5/6/2012 7

Satisfying a Query in the Boolean Model

• What determines if document satisfies

• query?

– That depends ….

• Document model

• Query model

• START SIMPLE

– better understanding

– Use components of simple model later

5/6/2012 8

Boolean Model Example

• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new -- underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific knowledge and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.

• Doc 2: “An introduction to computer science in the context of scientific, engineering, and commercial applications. The goal of the course is to teach basic principles and practical issues, while at the same time preparing students to use computers effectively for applications in computer science …”

• Query:

– (principles AND knowledge) OR (science AND engineering)

5/6/2012 9




• Query:


0 1 1 0 0

Doc 1: FALSE 5/6/2012 10




• Query:


1 0 1 1 1

Doc 2: TRUE 5/6/2012 11

Exercise

• Use Doc 1 and Doc 2

• (principles OR knowledge) AND (science AND NOT(engineering))

• (principles OR knowledge) AND (science AND NOT(engineering))

5/6/2012 12

Implementation Example (Boolean Model)

• Suppose we have a data set of three documents as follows:

– D1 = Programming in Java

– D2 = OO Programming

– D3 = Databases and SQL Programming

• in, & and dropped (stop words)

5/6/2012 15


• Primary Index

• Inverted Index

Database Java OO Programming SQL

D1 0 1 0 1 0

D2 0 0 1 1 0

D3 1 0 0 1 1

Term Freq. Pointer

Database 1

Java 1

OO 1

Programming 3

SQL 1

Postings List

D3

D1

D2

D1,D2,D3

D3

5/6/2012 16

Term-Document Incidence Boolean Model

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

1 if play contains word, 0

otherwise Brutus AND Caesar BUT NOT Calpurnia

5/6/2012 17

Incidence Vectors

• So we have a 0/1 vector for each term.

• To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND.

• 110100 AND 110111 AND 101111 = 100100.

5/6/2012 18

Answers to Query

• Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.

• Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the

Capitol; Brutus killed me.

5/6/2012 19

Exercise 1

• D1 = “computer information retrieval”

• D2 = “computer retrieval”

• D3 = “information”

• D4 = “computer information”

• Q1 = “information retrieval”

• Q2 = “information ¬computer”

5/6/2012 20

Exercise 2 0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift

12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

5/6/2012 21

Retrieval Evaluation

• User Evaluation

– Relevant

– Not relevant

• System Evaluation

– Retrieved

– Not Retrieved

Rel. Not Rel.

Ret. a b

Not Ret. c d

Recall R= a / (a+c)

Precision P = a / (a+b)

5/6/2012 22

Drawing of Recall-Precision

http://ralphlosey.files.wordpress.com http://ilab.cs.ucsb.edu/

5/6/2012 23

http://ralphlosey.files.wordpress.com/

http://ilab.cs.ucsb.edu/

Advantage of Boolean IR Modeling

• The Boolean Model

– Fast to implement

– Fast to process a query

– Simple

5/6/2012 24

Boolean Modeling Pitfalls

• Retrieval based on binary decision criteria with no notion of partial

matching.

• No ranking of the documents is provided (absence of a grading scale).

• Information need has to be translated into a Boolean expression which

most users find awkward.

• The Boolean queries formulated by the users are most often too simplistic.

• As a consequence, the Boolean model frequently returns either too few or too many

documents in response to a user query.

5/6/2012 25

Always Remember!

• We care about modeling.

• Implementation can be done in different ways.

• Which way you should select:

– It depends.

• You can go with a hash-table/hash-tree, When?

• You can use a B-tree, When?

• More about this in assignment 2.

• The Boolean model has extended forms.

• The Boolean model does not take care of ranking (setting rendering priorities).

5/6/2012 26

The Inverted Index

Boolean Model Continued

5/6/2012 27

Example from last class

5/6/2012 28


• Suppose we have a data set of three documents as follows:

– D1 = Programming in Java

– D2 = OO Programming

– D3 = Databases and SQL Programming

• in, & and dropped (stop words)

5/6/2012 29


• Primary Index

• Inverted Index

Database Java OO Programming SQL

D1 0 1 0 1 0

D2 0 0 1 1 0

D3 1 0 1 1 1

Term Freq. Pointer

Database 1

Java 1

OO 2

Programming 3

SQL 1

Postings List

D3

D1

D2

D1,D2,D3

D3

5/6/2012 30

Also,

Look at Google’s Paper

The Anatomy of a Large-scale Hypertextual Search Engine

5/6/2012 31

More Detailed Inverted Index

5/6/2012 32

Introduction to Information Retrieval

Inverted index

For each term t, we must store a list of all documents that contain t.

Identify each by a docID, a document serial number

Can we use fixed-size arrays for this?

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

What happens if the word Caesar

is added to document 14?

Sec. 1.2

174

54 101

5/6/2012 33


Inverted index

We need variable-size postings lists

On disk, a continuous run of postings is normal and best

In memory, can use linked lists or variable length arrays Some tradeoffs in size/ease of insertion

Dictionary Postings

Sorted by docID (more later on why).

Posting

Sec. 1.2

Brutus

Calpurnia

Caesar 1 2 4 5 6 16 57 132

1 2 4 11 31 45 173

2 31

174

54 101

5/6/2012 34


Tokenizer

Token stream. Friends Romans Countrymen

Inverted index construction

Linguistic

modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

More on

these later.

Documents to

be indexed.

Friends, Romans, countrymen.

Sec. 1.2

5/6/2012 35


Indexer steps: Token sequence

Sequence of (Modified token, Document ID) pairs.

I did enact Julius

Caesar I was killed

i' the Capitol;

Brutus killed me.

Doc 1

So let it be with

Caesar. The noble

Brutus hath told you

Caesar was ambitious

Doc 2

Sec. 1.2

5/6/2012 36


Indexer steps: Sort

Sort by terms And then docID

Core indexing step, why?

Sec. 1.2

5/6/2012 37


Indexer steps: Dictionary & Postings

Multiple term entries in a single document are merged.

Split into Dictionary and Postings

Doc. frequency information is added.

Why frequency? Will discuss later.

Sec. 1.2

5/6/2012 38


Where do we pay in storage?

Pointers

Terms and

counts Later in the

course:

•How do we

index

efficiently?

•How much

storage do we

need?

Sec. 1.2

Lists of docIDs

5/6/2012 39


The index we just built

How do we process a query?

Later - what kinds of queries can we process?

Sec. 1.3

5/6/2012 40


Query processing: AND

Consider processing the query:

Brutus AND Caesar

Locate Brutus in the Dictionary; Retrieve its postings.

Locate Caesar in the Dictionary; Retrieve its postings.

“Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar

Sec. 1.3

5/6/2012 41


The merge

Walk through the two postings simultaneously, in time linear in the total number of postings entries

34

128 2 4 8 16 32 64

1 2 3 5 8 13 21

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar 2 8

If the list lengths are x and y, the merge takes O(x+y)

operations.

Crucial: postings sorted by docID.

Sec. 1.3

5/6/2012 42


Intersecting two postings lists (a “merge” algorithm)

5/6/2012 43


Boolean queries: Exact match

The Boolean retrieval model is being able to ask a query that is a Boolean expression:

Boolean Queries are queries using AND, OR and NOT to join query terms Views each document as a set of words

Is precise: document matches condition or not.

Perhaps the simplest model to build an IR system on

Primary commercial retrieval tool for 3 decades.

Many search systems you still use are Boolean:

Email, library catalog, Mac OS X Spotlight

Sec. 1.3

5/6/2012 44


Example: WestLaw http://www.westlaw.com/

Largest commercial (paying subscribers) legal

search service (started 1975; ranking added

1992)

Tens of terabytes of data; 700,000 users

Majority of users still use boolean queries

Example query:

What is the statute of limitations in cases involving

the federal tort claims act?

LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT

/3 CLAIM

/3 = within 3 words, /S = in same sentence

Sec. 1.4

5/6/2012 45


Example: WestLaw http://www.westlaw.com/

Another example query:

Requirements for disabled people to be able to access a workplace

disabl! /p access! /s work-site work-place (employment /3 place

Note that SPACE is disjunction, not conjunction!

Long, precise queries; proximity operators; incrementally developed; not like web search

Many professional searchers still like Boolean search

You know exactly what you are getting

But that doesn’t mean it actually works better….

Sec. 1.4

5/6/2012 46


Boolean queries: More general merges

Exercise: Adapt the merge for the queries:

Brutus AND NOT Caesar

Brutus OR NOT Caesar

Can we still run through the merge in time O(x+y)?

What can we achieve?

Sec. 1.3

5/6/2012 47


Merging

What about an arbitrary Boolean formula?

(Brutus OR Caesar) AND NOT

(Antony OR Cleopatra)

Can we always merge in “linear” time?

Linear in what?

Can we do better?

Sec. 1.3

5/6/2012 48


Query optimization

What is the best order for query processing?

Consider a query that is an AND of n terms.

For each of the n terms, get its postings, then AND them together.

Brutus

Caesar

Calpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64 128

13 16

Query: Brutus AND Calpurnia AND Caesar 49

Sec. 1.3

5/6/2012 49


Query optimization example

Process in order of increasing freq:

start with smallest set, then keep cutting further.

This is why we kept

document freq. in dictionary

Execute the query as (Calpurnia AND Brutus) AND Caesar.

Sec. 1.3

Brutus

Caesar

Calpurnia

1 2 3 5 8 16 21 34

2 4 8 16 32 64 128

13 16

5/6/2012 50


More general optimization

e.g., (madding OR crowd) AND (ignoble OR strife)

The Questions is, what size of the AND can be done faster?

Get doc. freq.’s for all terms.

Estimate the size of each OR by the sum of its doc. freq.’s (conservative).

Process in increasing order of OR sizes.

Sec. 1.3

5/6/2012 51


Exercise

Recommend a query processing order for

Term Freq

eyes 213312

kaleidoscope 87009

marmalade 107913

skies 271658

tangerine 46653

trees 316812

(tangerine OR trees) AND

(marmalade OR skies) AND

(kaleidoscope OR eyes)

5/6/2012 52

Classic IR Models

The Vector Space Model

5/6/2012 53

The Vector Space Model

• Document: bag of terms

• Query: list of terms

• Satisfying: – Each document is scored as to the degree it satisfies

query (non-negative real number)

– doc satisfies query if its score is >0

– Documents are returned in a sorted list decreasing by score: • Include only non-zero scores

• Include only highest n documents, some n Hints for

Implementation

5/6/2012 54

How to compute score? Basic Assumptions

• There is a dictionary (aka lexicon) of all terms, numbering t in all

– Number the terms 1, …, t

• Change the model of a document (temporarily):

– A document is a t-dimensional vector

– The ith entry of the vector is the weight (importance of ) term i in the document

5/6/2012 55

The Vector Space

5/6/2012 56

How compute score, continued

• Calculate a vector function of the document vector and the query vector to get the score of the document with respect to the query.

• Choices:

– Measure the distance between the vectors:

• 𝑫𝒊𝒔𝒕 𝒅, 𝒒 = (𝑑𝑖 − 𝑞𝑖)2

𝑡𝑖=1

• Is a dissimilarity measure

• Not normalized: Dist ranges [0, inf.]

• Fix: use e-Dist , range [0,1]

• Is it the right sense of difference?

5/6/2012 57

How compute score, cont’d

• Measure the angle between the vectors:

• Dot product: 𝑑•𝑞 = (𝑑𝑖 ∗ 𝑞𝑖)𝑡𝑖=1

• Is a similarity measure

– Not normalized: dot product ranges (-inf., inf.)

– Fix: use normalized dot product, range [-1,1]

• 𝒔𝒊𝒎 =(𝒅•𝒒)( 𝑑 • 𝑞 )

aka cosine similarity

• • In practice vector components are nonnegative • so range is [0,1]

• • This is the most commonly used function for scoring.

5/6/2012 58

Cosine Similarity

• Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them.

• Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using dot product and magnitude as,

• Dot Product

• Magnitude

Credit: http://www10.org/cdrom/papers/519/node12.html

θ

A

B Cosine Geometrically

5/6/2012 59

http://www10.org/cdrom/papers/519/node12.html

How to Compute Weights of Documents

• The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval .

• It is used to evaluate how important a word is to a document in a collection.

• Two factors:

– How frequent the term in the document (More frequent more important)

– How frequent the term in the collection of documents (less frequent more important to the current document)

5/6/2012 60

tf (Term Frequency)

• The term count in the given document is simply the number of times a given term appears in that document.

• Usually normalized (why?) • to prevent a bias towards longer documents.

– e.g. divide by the number of all terms in the document.

• tf is computed as follows: 𝑡𝑓𝑖,𝑗= 𝑛𝑖,𝑗

𝑛𝑘,𝑗𝑘

• ni,j is the number of occurrences of the considered term (ti)

in document dj. • The denominator is the sum of number of occurrences of

all terms in document dj.

5/6/2012 61

idf (Inverse Document Frequency)

• idf is a measure of the general importance of the term.

• Can be computed as follows:

• 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑓 =𝑑:𝑡𝑖 ∈𝑑

𝐷=

#𝑑𝑜𝑐𝑠. 𝑤ℎ𝑒𝑟𝑒 𝑡𝑖 𝑎𝑝𝑝𝑒𝑎𝑟𝑠

# 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑜𝑐𝑠.

• 𝑖𝑑𝑓𝑖 = log𝐷

𝑑:𝑡𝑖 ∈𝑑 +𝟏

• Where:

– 𝐷 is the total number of documents in the corpus.

– 𝑑: 𝑡𝑖 ∈ 𝑑 is the number of documents where the term ti appears.

– “1” is usually added to the denominator to prevent division by ZERO.

5/6/2012 62

idf (Inverse Document Frequency)

• idf is a measure of the general importance of the term.

• Can be computed as follows:

• 𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑑𝑓 =𝑑:𝑡𝑖 ∈𝑑

𝐷=

#𝑑𝑜𝑐𝑠. 𝑤ℎ𝑒𝑟𝑒 𝑡𝑖 𝑎𝑝𝑝𝑒𝑎𝑟𝑠

# 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑜𝑐𝑠.

• 𝑖𝑑𝑓𝑖 = log𝐷

𝑑:𝑡𝑖 ∈𝑑 +𝟏

• Where:

– 𝐷 is the total number of documents in the corpus.

– 𝑑: 𝑡𝑖 ∈ 𝑑 is the number of documents where the term ti appears.

– “1” is usually added to the denominator to prevent division by ZERO.

5/6/2012 63

tf (Term Frequency), another way to normalize

• 𝐿𝑒𝑡:

• 𝑵 𝑏𝑒 𝑡𝑕𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡𝑕𝑒 𝑑𝑎𝑡𝑎𝑠𝑒𝑡

• 𝒏𝒊 𝑏𝑒 𝑡𝑕𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑡𝑕𝑎𝑡 𝑐𝑜𝑛𝑡𝑎𝑖𝑛 𝑡𝑒𝑟𝑚 𝒊 • 𝒇𝒓𝒆𝒒𝒊,𝒋 𝑏𝑒 𝑡𝑕𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑡𝑒𝑟𝑚 𝒊 𝑖𝑛 𝑑𝑜𝑐. 𝒋

• 𝐼𝐷𝐹𝑖 =𝑁

𝑛𝑖= 𝑙𝑜𝑔2 𝑁 − 𝑙𝑜𝑔2 𝑛𝑖

• Now,

• 𝑊𝑖,𝑗 = 𝑡𝑓𝑖• 𝑖𝑑𝑓𝑖,𝑗 = 𝑓𝑟𝑒𝑞𝑖,𝑗 ∗ log𝑁 − log𝑛𝑖

We have to normalize, why?

• 𝑊𝑖,𝑗 = 𝑡𝑓𝑖• 𝑖𝑑𝑓𝑖,𝑗 = 𝑓𝑟𝑒𝑞𝑖,𝑗

𝒎𝒂𝒙(𝒇𝒓𝒆𝒒 𝑳𝒋) ∗ log𝑁 − log𝑛𝑖

• max 𝑓𝑟𝑒𝑞 𝐿𝑗 = max𝑓𝑟𝑒𝑞. 𝑜𝑓 𝑡𝑕𝑒 𝑜𝑓

𝑡𝑕𝑒 𝑚𝑜𝑠𝑡 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝒕𝒆𝒓𝒎 𝐿 𝑖𝑛 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 𝑗

• We may also need to ad (+1) to: 𝒍𝒐𝒈𝑵 − 𝒍𝒐𝒈𝒏𝒊

Why?

5/6/2012 64

tf-idf

• The weight of a term in a document is:

Wi,j = tfi,j * idfi

• What does it do?

• It usually filters out common terms.

• What about ranking?

5/6/2012 65

Vector Space Model Example

• Doc 1: “Computers have brought the world to our fingertips. We will try to understand at a basic level the science -- old and new – underlying this new Computational Universe. Our quest takes us on a broad sweep of scientific media and related technologies… Ultimately, this study makes us look anew at ourselves -- our genome; language; music; "knowledge"; and, above all, the mystery of our intelligence.

• Frequencies: science 1; knowledge 1; principles 0; engineering 0


• Frequencies: science 2; knowledge 0; principles 1; engineering 1

5/6/2012 66

Example, cont’d

• Consider having 5 documents in the collection.

• The Idf for terms in the previous example are:

– science ln(5/2) = 0.51

– engineering, principles, knowledge: ln(5/1) = 1.6

5/6/2012 67

Ranking

• Term by Doc. Table: freqjd * log(N/ nj ).

• Using un-normalized dot product for query: science, engineering, knowledge,

principles,

• Also, using 0/1 query vector, we get:

– Cosine (Doc1, Q) = 0.589

– Cosine (Doc2, Q) = 0.807

Doc 1 Doc 2 Query

science 0.51 1.02 0.51

engineering 0 1.6 1.6

principles 0 1.6 1.6

knowledge 1.6 0 1.6

5/6/2012 68

Vector Space Model (Summary)

• Advantages

– The concept of Ranking.

– Not difficult to implement

– Shown to be effective

• Disadvantages

– What threshold to choose?

– Term Independence

– Term Weights

5/6/2012 69