intro to information retrieval by the end of the lecture you should be able to: explain the...

22
Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies describe the basic maths underlying set-theoretic and vector models of classical IR.

Upload: autumn-barry

Post on 28-Mar-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

Intro to Information Retrieval

By the end of the lecture you should be able to: explain the differences between database

and information retrieval technologies describe the basic maths underlying set-

theoretic and vector models of classical IR.

Page 2: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

Reminder: efficiency is vital Reminder: Google finds documents which

match your keywords; this must be done EFFICIENTLY – cant just go through each document from start to end for each keyword

So, cache stores copy of document, and also a “cut-down” version of the document for searching: just a “bag of words”, a sorted list (or array/vector/…) of words appearing in the document (with links back to full document)

Try to match keywords against this list; if found, then return the full document

Even cleverer: dictionary and inverted file…

Page 3: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..

dictionary Inverted or postings file Data file

Page 4: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

IR vs DBMS

DBMS IR

match exact partial or best match inference deduction induction model deterministic probabilistic data record/field text document query language artificial natural? query specification

complete incomplete

items wanted matching relevant error response sensitive insensitive

Page 5: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

informal introduction

IR was developed for bibliographic systems. We shall refer to ‘documents’, but the technique extends beyond items of text.

central to IR is representation of a document by a set of ‘descriptors’ or ‘index terms’ (“words in the document”).

searching for a document is carried out (mainly) in the ‘space’ of index terms.

we need a language for formulating queries, and a method for matching queries with document descriptors.

Page 6: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

architecture

user

Query matching

Learning component

Object base

(objects and their descriptions)

hits

query

feedback

Page 7: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

basic notation

Given a list of m documents, D, and a list of n index

terms, T, we define wi,j 0 to be a weight associated with

the ith keyword and the jth document.

For the jth document, we define an index term vector, dj :

dj = (w1,j , w2,j , …., wn,j )

For example: D = { d1, d2, d3},

T = {pudding, jam, traffic, lane, treacle}

d1 = (1, 1, 0, 0, 0),

d2 = (0, 0, 1, 1, 0),

d3 = (1, 1, 1, 1, 0)

Recipe for jam pudding

DoT report on traffic lanes

Radio item on traffic jam in Pudding Lane

Page 8: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

set theoretic, Boolean model Queries are Boolean expressions formed using

keywords, eg:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’ ¬ ‘Traffic’

Query is re-expressed in disjunctive normal form (DNF)

eg (1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)To match a document with a query: sim(d, qDNF) = 1 if d is equal to a component of

qDNF

= 0 otherwise

CF: T = {pudding, jam, traffic, lane, treacle}

Page 9: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

d1 = (1, 1, 0, 0, 0),

d2 = (0, 0, 1, 1, 0),

d3 = (1, 1, 1, 1, 0)

(1, 1, 0, 0, 0) (1, 0, 0, 0, 1) (1, 1, 0, 0, 1)

T = {pudding, jam, traffic, lane, treacle}

pudding

jam

trafficlane

treacle

Page 10: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

collecting resultsT = {pudding, jam, traffic, lane, treacle}

Answer: d1 = (1, 1, 0, 0, 0) Jam pud recipe

Query:(‘Jam’ ‘Treacle’) ’Pudding’ ¬ ‘Lane’¬ ‘Traffic’(jam treacle) (pudding)

- Lane - Traffic

pudding

jam

trafficlane

treacle

Page 11: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

Statistical vector model

weights, 1 wi,j 0, no longer binary-valued query also represented by a vector

q = (w1q, w2q, …, wnq)– eg q = (1.0, 0.6, 0.0, 0.0, 0.8)

CF: T = {pudding, jam, traffic, lane, treacle}to match jth document with a query:

sim(dj, q) = dj q /( | dj | ×| q | )

=

i=1(wij × wiq)

n

i=1 i=1wij 2 × wiq

2n n

Page 12: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

T1

T2

D1

Q

w11

w1q

w21 w2q

wiq 2

i=1(wij × wiq)

n

i=1 i=1wij 2 ×

n n

= cos()

Cosine coefficient

Page 13: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

T1

T2

D1

Q

w11

w1q

w21

w2q

wiq 2

i=1(wij × wiq)

n

i=1 i=1wij 2 ×

n n

= cos(0)

= 1

=0

Cosine coefficient

Page 14: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

T1

T2

D1

Q

w11

w1q= 0

w21= 0w2q

wiq 2

i=1(wij × wiq)

n

i=1 i=1wij 2 ×

n n

= cos(90º)

= 0

= 90º

Cosine coefficient

Page 15: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

i=1(wij × wiq)

n

i=1wij 2 n

i=1wiq

2n

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

n

i=1 i=1wij 2 × wiq

2n n

d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe

= 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8

= 1.44= 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 1.44 = 0.89

1.32 × 2.0

Page 16: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

i=1(wij × wiq)

n

i=1wij 2 n

i=1wiq

2n

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

n

i=1 i=1wij 2 × wiq

2n n

d2 = (0.0, 0.0, 0.9, 0.8, 0), DoT Report

= 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8

= 0.0= 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 0.0 = 0.0

1.45 × 2.0

Page 17: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

i=1(wij × wiq)

n

i=1wij 2 n

i=1wiq

2n

q = (1.0, 0.6, 0.0, 0.0, 0.8)

i=1(wij × wiq)

n

i=1 i=1wij 2 × wiq

2n n

d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic Report

= 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8

= 1.14= 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53= 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

= 1.14 = 0.51

2.53 × 2.0

Page 18: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

q = (1.0, 0.6, 0.0, 0.0, 0.8)

2. d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic (0.51) Report

collecting results

1. d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe (0.89)

Rank document vector document (sim)

CF: T = {pudding, jam, traffic, lane, treacle}

Page 19: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

Discussion: Set theoretic model

Boolean model is simple, queries have precise semantics, but it is an ‘exact match’ model, and does not Rank results

Boolean model popular with bibliographic systems; available on some search engines

Users find Boolean queries hard to formulate Attempts to use set theoretic model as basis

for a partial-match system: Fuzzy set model and the extended Boolean model.

Page 20: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

Discussion: Vector Model

Vector model is simple, fast and results show leads to ‘good’ results.

Partial matching leads to ranked output Popular model with search engines Underlying assumption of term independence

(not realistic! Phrases, collocations, grammar) Generalised vector space model relaxes the

assumption that index terms are pairwise orthogonal (but is more complicated).

Page 21: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

questions raised

Where do the index terms come from? (ALL the words in the source documents?)

What determines the weights? How well can we expect these systems to

work for practical applications? How can we improve them? How do we integrate IR into more traditional

DB management?

Page 22: Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies

Questions to think about

Why is traditional database unsuited to retrieval of unstructured information?

How would you re-express a Boolean query, eg (A or B or (C and not D)), in disjunctive normal form?

For the matching coefficient, sim(., .) show that 0 sim(., .) 1, and that sim(a, a) = 1.

Compare and contrast the ‘vector’ and ‘set theoretic’ models in terms of power of representation of documents and queries.