intro to information retrieval by the end of the lecture you should be able to: explain the...

Intro to Information Retrieval

By the end of the lecture you should be able to: explain the differences between database

and information retrieval technologies describe the basic maths underlying set-

theoretic and vector models of classical IR.

Reminder: efficiency is vital Reminder: Google finds documents which

match your keywords; this must be done EFFICIENTLY – cant just go through each document from start to end for each keyword

So, cache stores copy of document, and also a “cut-down” version of the document for searching: just a “bag of words”, a sorted list (or array/vector/…) of words appearing in the document (with links back to full document)

Try to match keywords against this list; if found, then return the full document

Even cleverer: dictionary and inverted file…

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

Doc6..

dictionary Inverted or postings file Data file

IR vs DBMS

DBMS IR

match exact partial or best match inference deduction induction model deterministic probabilistic data record/field text document query language artificial natural? query specification

complete incomplete

items wanted matching relevant error response sensitive insensitive

informal introduction

IR was developed for bibliographic systems. We shall refer to ‘documents’, but the technique extends beyond items of text.

central to IR is representation of a document by a set of ‘descriptors’ or ‘index terms’ (“words in the document”).

searching for a document is carried out (mainly) in the ‘space’ of index terms.

we need a language for formulating queries, and a method for matching queries with document descriptors.

architecture

Query matching

Learning component

Object base

(objects and their descriptions)

feedback

basic notation

Given a list of m documents, D, and a list of n index

terms, T, we define wi,j 0 to be a weight associated with

the ith keyword and the jth document.

For the jth document, we define an index term vector, dj :

dj = (w1,j , w2,j , …., wn,j )

For example: D = { d1, d2, d3},

T = {pudding, jam, traffic, lane, treacle}

d1 = (1, 1, 0, 0, 0),

d2 = (0, 0, 1, 1, 0),

d3 = (1, 1, 1, 1, 0)

Recipe for jam pudding

DoT report on traffic lanes

Radio item on traffic jam in Pudding Lane

set theoretic, Boolean model Queries are Boolean expressions formed using

keywords, eg:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’ ¬ ‘Traffic’

Query is re-expressed in disjunctive normal form (DNF)

eg (1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)To match a document with a query: sim(d, qDNF) = 1 if d is equal to a component of

= 0 otherwise

CF: T = {pudding, jam, traffic, lane, treacle}

d1 = (1, 1, 0, 0, 0),

d2 = (0, 0, 1, 1, 0),

d3 = (1, 1, 1, 1, 0)

(1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)

T = {pudding, jam, traffic, lane, treacle}

pudding

trafficlane

treacle

collecting resultsT = {pudding, jam, traffic, lane, treacle}

Answer: d1 = (1, 1, 0, 0, 0) Jam pud recipe

Query:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’¬ ‘Traffic’(jam treacle) (pudding)

- Lane - Traffic

pudding

trafficlane

treacle

Statistical vector model

weights, 1 wi,j 0, no longer binary-valued query also represented by a vector

q = (w1q, w2q, …, wnq)– eg q = (1.0, 0.6, 0.0, 0.0, 0.8)

CF: T = {pudding, jam, traffic, lane, treacle}to match jth document with a query:

sim(dj, q) = dj q /( | dj | ×| q | )

i=1(wij × wiq)

i=1 i=1wij 2 × wiq

w21 w2q

i=1(wij × wiq)

i=1 i=1wij 2 ×

= cos()

Cosine coefficient

i=1(wij × wiq)

i=1 i=1wij 2 ×

= cos(0)

Cosine coefficient

w1q= 0

w21= 0w2q

i=1(wij × wiq)

i=1 i=1wij 2 ×

= cos(90º)

= 90º

Cosine coefficient

i=1(wij × wiq)

i=1wij 2 n

i=1wiq

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

i=1 i=1wij 2 × wiq

d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe

= 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8

= 1.44= 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 1.44 = 0.89

1.32 × 2.0

i=1(wij × wiq)

i=1wij 2 n

i=1wiq

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

i=1 i=1wij 2 × wiq

d2 = (0.0, 0.0, 0.9, 0.8, 0), DoT Report

= 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8

= 0.0= 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 0.0 = 0.0

1.45 × 2.0

i=1(wij × wiq)

i=1wij 2 n

i=1wiq

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

i=1 i=1wij 2 × wiq

d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic Report

= 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8

= 1.14= 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 1.14 = 0.51

2.53 × 2.0

q = (1.0, 0.6, 0.0, 0.0, 0.8)

2. d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic (0.51) Report

collecting results

1. d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe (0.89)

Rank document vector document (sim)

CF: T = {pudding, jam, traffic, lane, treacle}

Discussion: Set theoretic model

Boolean model is simple, queries have precise semantics, but it is an ‘exact match’ model, and does not Rank results

Boolean model popular with bibliographic systems; available on some search engines

Users find Boolean queries hard to formulate Attempts to use set theoretic model as basis

for a partial-match system: Fuzzy set model and the extended Boolean model.

Discussion: Vector Model

Vector model is simple, fast and results show leads to ‘good’ results.

Partial matching leads to ranked output Popular model with search engines Underlying assumption of term independence

(not realistic! Phrases, collocations, grammar) Generalised vector space model relaxes the

assumption that index terms are pairwise orthogonal (but is more complicated).

questions raised

Where do the index terms come from? (ALL the words in the source documents?)

What determines the weights? How well can we expect these systems to

work for practical applications? How can we improve them? How do we integrate IR into more traditional

DB management?

Questions to think about

Why is traditional database unsuited to retrieval of unstructured information?

How would you re-express a Boolean query, eg (A or B or (C and not D)), in disjunctive normal form?

For the matching coefficient, sim(., .) show that 0 sim(., .) 1, and that sim(a, a) = 1.

Compare and contrast the ‘vector’ and ‘set theoretic’ models in terms of power of representation of documents and queries.

intro to information retrieval by the end of the lecture you should be able to: explain the...

j th document

pudding lane slide

document descriptors

traffic jam

inverted file slide

query feedback slide

jam pudding dot report

index term vector

Documents

introduction to information retrieval information retrieval...

multimedia information retrieval...

wmes3103 information retrieval week 1 and 2. what is...

information retrieval and...

introduction to information retrieval - stanford...

information retrieval · information retrieval vs....

introductionto informationretrieval ·...

2 information retrieval. prof. dr. knut hinkelmann 2...

introduction to information retrieval introduction to...

cs336: intelligent information retrieval why is information...

week 10 information retrieval presentationlsir...

language models for information retrieval -...

information retrieval

information retrieval and extraction - berlin chen's...

introduction to information retrieval introduction to...

introduction to information retrieval introduction to...

introduction to information retrieval introducing...

introduction to information retrieval boolean retrieval

1 retrieval: getting information out module 27. 2 retrieval:...

introduction to information retrieval introduction to...