intro to information retrieval by the end of the lecture you should be able to: explain the...

Intro to Information Retrieval

By the end of the lecture you should be able to: explain the differences between database

and information retrieval technologies describe the basic maths underlying set-

theoretic and vector models of classical IR.

Reminder: efficiency is vital Reminder: Google finds documents which

match your keywords; this must be done EFFICIENTLY – cant just go through each document from start to end for each keyword

So, cache stores copy of document, and also a “cut-down” version of the document for searching: just a “bag of words”, a sorted list (or array/vector/…) of words appearing in the document (with links back to full document)

Try to match keywords against this list; if found, then return the full document

Even cleverer: dictionary and inverted file…

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..

dictionary Inverted or postings file Data file

IR vs DBMS

DBMS IR

match exact partial or best match inference deduction induction model deterministic probabilistic data record/field text document query language artificial natural? query specification

complete incomplete

items wanted matching relevant error response sensitive insensitive

informal introduction

IR was developed for bibliographic systems. We shall refer to ‘documents’, but the technique extends beyond items of text.

central to IR is representation of a document by a set of ‘descriptors’ or ‘index terms’ (“words in the document”).

searching for a document is carried out (mainly) in the ‘space’ of index terms.

we need a language for formulating queries, and a method for matching queries with document descriptors.

architecture

user

Query matching

Learning component

Object base

(objects and their descriptions)

hits

query

feedback

basic notation

Given a list of m documents, D, and a list of n index

terms, T, we define wi,j 0 to be a weight associated with

the ith keyword and the jth document.

For the jth document, we define an index term vector, dj :

dj = (w1,j , w2,j , …., wn,j )

For example: D = { d1, d2, d3},

T = {pudding, jam, traffic, lane, treacle}

d1 = (1, 1, 0, 0, 0),

d2 = (0, 0, 1, 1, 0),

d3 = (1, 1, 1, 1, 0)

Recipe for jam pudding

DoT report on traffic lanes

Radio item on traffic jam in Pudding Lane

set theoretic, Boolean model Queries are Boolean expressions formed using

keywords, eg:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’ ¬ ‘Traffic’

Query is re-expressed in disjunctive normal form (DNF)

eg (1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)To match a document with a query: sim(d, qDNF) = 1 if d is equal to a component of

qDNF

= 0 otherwise

CF: T = {pudding, jam, traffic, lane, treacle}

d1 = (1, 1, 0, 0, 0),

d2 = (0, 0, 1, 1, 0),

d3 = (1, 1, 1, 1, 0)

(1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)

T = {pudding, jam, traffic, lane, treacle}

pudding

jam

trafficlane

treacle

collecting resultsT = {pudding, jam, traffic, lane, treacle}

Answer: d1 = (1, 1, 0, 0, 0) Jam pud recipe

Query:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’¬ ‘Traffic’(jam treacle) (pudding)

- Lane - Traffic

pudding

jam

trafficlane

treacle

Statistical vector model

weights, 1 wi,j 0, no longer binary-valued query also represented by a vector

q = (w1q, w2q, …, wnq)– eg q = (1.0, 0.6, 0.0, 0.0, 0.8)

CF: T = {pudding, jam, traffic, lane, treacle}to match jth document with a query:

sim(dj, q) = dj q /( | dj | ×| q | )

=

i=1(wij × wiq)

n

i=1 i=1wij 2 × wiq

2n n

T1

T2

D1

Q

w11

w1q

w21 w2q

wiq 2

i=1(wij × wiq)

n

i=1 i=1wij 2 ×

n n

= cos()

Cosine coefficient

T1

T2

D1

Q

w11

w1q

w21

w2q

wiq 2

i=1(wij × wiq)

n

i=1 i=1wij 2 ×

n n

= cos(0)

= 1

=0

Cosine coefficient

T1

T2

D1

Q

w11

w1q= 0

w21= 0w2q

wiq 2

i=1(wij × wiq)

n

i=1 i=1wij 2 ×

n n

= cos(90º)

= 0

= 90º

Cosine coefficient

i=1(wij × wiq)

n

i=1wij 2 n

i=1wiq

2n

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

n

i=1 i=1wij 2 × wiq

2n n

d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe

= 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8

= 1.44= 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 1.44 = 0.89

1.32 × 2.0

i=1(wij × wiq)

n

i=1wij 2 n

i=1wiq

2n

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

n

i=1 i=1wij 2 × wiq

2n n

d2 = (0.0, 0.0, 0.9, 0.8, 0), DoT Report

= 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8

= 0.0= 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 0.0 = 0.0

1.45 × 2.0

i=1(wij × wiq)

n

i=1wij 2 n

i=1wiq

2n

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

n

i=1 i=1wij 2 × wiq

2n n

d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic Report

= 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8

= 1.14= 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 1.14 = 0.51

2.53 × 2.0

q = (1.0, 0.6, 0.0, 0.0, 0.8)

2. d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic (0.51) Report

collecting results

1. d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe (0.89)

Rank document vector document (sim)

CF: T = {pudding, jam, traffic, lane, treacle}

Discussion: Set theoretic model

Boolean model is simple, queries have precise semantics, but it is an ‘exact match’ model, and does not Rank results

Boolean model popular with bibliographic systems; available on some search engines

Users find Boolean queries hard to formulate Attempts to use set theoretic model as basis

for a partial-match system: Fuzzy set model and the extended Boolean model.

Discussion: Vector Model

Vector model is simple, fast and results show leads to ‘good’ results.

Partial matching leads to ranked output Popular model with search engines Underlying assumption of term independence

(not realistic! Phrases, collocations, grammar) Generalised vector space model relaxes the

assumption that index terms are pairwise orthogonal (but is more complicated).

questions raised

Where do the index terms come from? (ALL the words in the source documents?)

What determines the weights? How well can we expect these systems to

work for practical applications? How can we improve them? How do we integrate IR into more traditional

DB management?

Questions to think about

Why is traditional database unsuited to retrieval of unstructured information?

How would you re-express a Boolean query, eg (A or B or (C and not D)), in disjunctive normal form?

For the matching coefficient, sim(., .) show that 0 sim(., .) 1, and that sim(a, a) = 1.

Compare and contrast the ‘vector’ and ‘set theoretic’ models in terms of power of representation of documents and queries.

intro to information retrieval by the end of the lecture you should be able to: explain the...

Documents

j th document

pudding lane slide

document descriptors

traffic jam

inverted file slide

query feedback slide

jam pudding dot report

index term vector