jean-cédric c martin r emmanuel emartin ra j m a n emmanuel ec k a r d lia 1 contents 2...

Fundamentals in Information Retrieval

Jean-Cédric C H A P P E L I E R

M a r t i n R A J M A N

Emmanuel E C K A R D

LIA

1

Contents

2

► Introduction

► The vector space model

• Indexing terms (tokenization, filtering, Bags of Words)

• Weighting schemes (tf.idf, Okapi BM25)

• Cosine proximity measure

► Representing the queries

► Evaluating IR systems

► Limitations of the vector space model and Distributional Semantics

► More advanced models

► Key points and references

Information Retrieval

Definition

selection of documents relevant to a query in an unstructured collection of documents.

► unstructured: not produced with IR in mind, not a database.

► document: here, natural language text (but could also be video, audio or images)

► query: utterance in natural language (possibly augmented with commands, see later)

► relevant:

1. User’s perspective: answering their IR needs2. Mathematically: maximizing a formally defined “proximitymeasure”

3

Example of Information retrieval: issuing a query on an

unstructured collection

► query (“Alan Turing”)

► search among unstructured collection (Wikipedia articles)

4

Example of Information retrieval: results returned by the

system

► list of results with a percentage match

► highest matches first

5

Ambiguity

Sometimes unintended results occur

Examplequery: “Chicago school”

wanted?

► schools in Chicago (IL)?

► body of works in economy?

► architectural style?

► where to learn how to play Chicago (game):

► bridge?► or poker?

6

How is IR done?

Tasks

► have the computer represent documents (at the adequate level): preprocessing,indexing, ...

► represent the queries, not necessarily the same way as the documents (short queries, operators, . . . )

► define a relevance measure between document and query representations

Similarities with other N L P tasks

► Classification (no query)

► Data mining (formatted data)

► Information extraction (retrieve shorts parts of documents)

7

Representing the documents: the vector space model

t2

t3

d1

d2

d3

t1

► indexing terms define axis

► documents are point in the vector space

8

Vector Space model

Objective

Representing documents as vectors derived from the distributions of indexing terms in the document collection.

Principle

► V , a finite vocabulary of indexing terms

► R representation space

► R D : V∗ → R representation function

► similarity: Mprox : R ×R →R+

9

Indexing terms

Example► Now so long, Marianne

it’s time that we began to laugh and cry and cryand laugh about it all again.

► V , a finite vocabulary: aardvark, begin, cry, information, laugh, long, Marianne, retrieval, time, ...

→ Now so long Marianne it’s time that we began to laugh and cry and cry and laugh about it all again.

10

In practice

Indexing is the set of mechanisms used to build the vocabulary of indexing terms

This vocabulary is usually very large: ~several thousands of indexing terms

Indexing the documents

Definition

Selecting the relevant features in the documents (typically words) to use for representing the documents

11

Themes related to indexing:

► Tokenization

► Stop words

► Zipf and Luhn laws

► Bag of Words (BoW) model

Tokenization

Definition

Tokenization: splitting the text into words (Pre-requisite to choosing indexing terms)

Example► easy: whitespaces

Now is the winter of our discontent

Made glorious summer by this son of York

► less easy: space not always indicative of a term segmentation (compounds):Distributional Semantics Information Retrieval and Latent Semantics Indexing performance comparison

► agglutinative languages are a problem: Rinderkennzeichnungs- und

Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

► Technical terms

12

Tokenization of technical terms &

conclusion on tokenisation

e.g. in Chemistry

Methionyl-glutaminyl-arginyl-tyrosyl-glutamyl-seryl-leucyl-phenyl-alanyl-alanyl-glutaminyl-leucyl-

lysyl-glutamyl-arginyl-lysy-glutamyl-gycyl-alanyl-phenyl-alanyl-valyl-prolyl-phenyl-alanyl-valyl-

threonyl-leucyl-glycyl-aspartyl-prolyl-glycyl-isoleucyl-glutamyl-glutaminyl-seryl-leucyl-lysyl-

isoleucyl-aspartyl-threonyl-leucyl-isoleucyl-glutamyl-alanyl-glycyl-alanyl-aspartyl-alanyl-leucyl-

glutamyl-leucyl-glycyl-isoleucyl-prolyl-phenyl-alanyl-seryl-aspartyl-prolyl-leucyl-alanyl-aspartyl-

glycyl-prolyl-threonyl-isoleucyl-glutaminyl-asparaginyl-alanyl-threonyl-leucyl-arginyl-alanyl-

phenyl-alanyl-alanyl-alanyl-glycyl-valyl-threonyl-prolyl-alanyl-glutaminyl-cysteinyl-phenyl-alanyl-

glutamyl-methionyl-leucyl-alanyl-leucyl-isoleucyl-arginyl-glutaminyl-lysyl-histidyl-prolyl-threonyl-

isoleucyl-prolyl-isoleucyl-glycyl-leucyl-leucyl-methionyl-tyrosyl-alanyl-asparaginyl-leucyl-valyl-

phenyl-...

13

Conclusion on tokenization: it is a full fledged NLP issue → requires NLP techniques)

Choice of indexing terms

Filtering

Choice of indexing terms using various filters:

► on morpho-syntactic categories

► on stop-words

► on frequencies

14

Benefits

► more informative indexes► smaller indexes (tractability)

Indexing terms: filtering with Morpho-syntactic

categories

Filtering

Some morpho-syntactic categories (e.g. determiners, conjunctions, …) do not have much semantic content, so others (e.g. nouns, verbs, …) do!...

► Keep only the terms in a selected set of morpho-syntactic categories (e.g. nouns and adjectives)

15

Indexing terms: filtering with stop words

Definition

Stop word: term explicitely to be excluded from indexing.

Example

stoplist: the; a; ’s; in; but; I; we; my; your; their; then

Young men’s love then liesNot truly in their hearts, but in their eyes.

Document: Young men love lies truly hearts eyes

16

► Benefits:

► cheap way to remove classes of words without semantic content

► Problems:

To be or not to be

► this sentence would be entirely removed.

Indexing terms: filtering with frequencies

Zipf and Luhn laws

If r is the rank of a term and n is its number of occurrences (frequency) in the collection:

► Zipf (1949): n ∼ 1/r

► Luhn (1958): mid-rank terms are the best indicators of topics

17

Indexing terms: filtering with frequencies

Wo r d r a nk

Wo

rd

fre

qu

en

cy

U ppe r cu t-off

Lower cu t-off

W o r d t oo ra re

W o rd t oo

co m -m o n

S ig n ifi ca n t Wo r d

18

Bag of words model

Assumption

Positions of the terms are ignored. Term distribution is indicative enough of the meaning.

Model

d1 = {(t1,n(d1, t1));(t2,n(d1, t2)); ...}

d2 = {(t1,n(d2, t1));(t2,n(d2, t2)); ...}

A document is a multiset of terms

Example

Now so long, Marianne ; it’s time that we beganto laugh and cry and cry ; and laugh about it all again.

→ ([begin,1] [cry,2] [laugh,2] [long,1] [Marianne,1] [time,1])

19

Conclusions on indexing

► Bad indexing can ruin the performances of an otherwise sophisticated IR system

► Good indexing is anything but trivial

20

Representation function

DefinitionTo represent documents as vectors, we need to associate the indexing terms with weights



► R representation space: R|V|

→ ([aardvark,?] [begin,?] [cry,?] [information,?] [laugh,?]

[long,?] [Marianne,?] [retrieval,?] [time,?])

21

Weighting schemes: tf.idf

Term Frequency

tf(ti , dj ) = nb of occurrences of term ti in document dj

Note: sometimes (1 + log(tf(ti ,dj ))) is used in place of tf(ti,dj )

Term Frequency - Inverse Document Frequency

tf∙idf(ti ,dj ) = tf(ti ,dj ) · idf(ti )

with

idf(ti ) = log( )|D|

nb(dk∍ ti )

|D|: number of documents in the collectionnb(dk ∍ ti ): "document frequency of ti“, i.e. the number of documents that contain term ti

22

Example: tf weighting scheme



► R D : V∗ → R representation function: here: Term Frequency

→ ([aardvark,0] [begin,1] [cry,2] [information,0] [laugh,2] [long,1] [Marianne,1] [retrieval,0] [time,1])

→(0 1 2 0 2 1 1 0 1 ...)

In practice

the vectors are very sparse

23

Okapi BM25 weighting scheme

Origine

BM25 weight for term t in document d

where |d| is the length of document d, avdl is the average document length, and

k, b are constants, usually set to values s.t. k∈[1.2, 2.0] and b=0.75

BM25 is a very good scheme and is often used as reference

24

The Okapi BM25 is originally based on the Sparck-Jones’ probabilistic model for IR and has been strongly experimentally fine-tuned

𝑤𝐵𝑀25 𝑡, 𝑑 =(𝑘 + 1)

(𝑡𝑓 𝑡, 𝑑 + 𝑘(1 − 𝑏 +𝑑

𝑎𝑣𝑑𝑙))

∙ (𝑡𝑓∙𝑖𝑑𝑓 𝑡, 𝑑 )

Proximity measure between documents

Cosine similarity

► bounded (0 < cos(d1,d2) < 1,∀ d1,d2)

► it is a similarity: the greater, the more similar the documents (as opposed to ametric)

► independent of the length of the document

25

cos(d1, d2) =d1

d1∙

d2

d2=

σ𝑗=1𝑁 𝑑1𝑗𝑑2𝑗

σ𝑗 𝑑1𝑗2 σ𝑗 𝑑2𝑗

2

Proximity measure between documents

Document 1► Now so long, Marianne, it’s time that we

began to laugh and cry and cry and laugh about it all again.

► ...,[long,1] [Marianne,1]

[time,1] [begin,1] [laugh,2]

[cry,2],...

► d1 = (...,1,1,1,1,2,2,...)

Document 2► I haven’t seen Marianne laugthing for some

time, is she crying all day long ?

► ...,[long,1] [Marianne,1]

[time,1] [begin,0] [laugh,1]

[cry,1],...

► d2 = (...,1,1,1,0,1,1,...)

Example

cos(d1,d2) = 7/(√12 ·√5) = 0.904

26

Summary

Choices depending on the application

► Weighting: allows to translate semantic notions into computable models

► Proximity measure: defines the topology of the representation space

Constants

► |V|-dimensional vector space

► very sparse vectors

27

Queries: definition

Definition

Queries (or “topics”) are “questions” asked to the system

Typically keywords, possibly augmented with operators

dreamt WITHIN 5 philosophy

Supposed unknown at indexing time (difference between IR and classification or clustering)Visit http://www.google.com/trends for real-life examples

28

http://www.google.com/trends

Query representation

Example

► easy: as for documents

more things in heaven and earth

► less easy (verbatim sentence)

"more things in heaven and earth"

► quite different from the document (positional information)

dreamt WITHIN 5 philosophy

Conclusion:

Query representation is not necessarily trivial (not always the same as representation of documents).

29

Problem with short queries

Web queries

On the web,

► the average query length is under three words

► very few users use operators

Language being ambiguous, three-word queries are difficult to satisfy.

Solutions

► query expansion: use knowledge about the query terms to associate them with other terms and improve the query.

► query term reweighting: weight the terms of the query as to obtain maximum retrieval performance.

► relevance feedback: User provides the system an evaluation of the relevance of its answers.

30

Performances of IR systems: Precision and Recall

Reminder:Given an IR system, a document collection, queries, referential and an answer by the system:

Definition

Precision is the proportion of the documents retrieved by the system that are relevant (according to the referential)

Definition

Recall is the proportion of the relevant documents which were retrieved by the system

► Precision is problematic if no documents are returned

► Recall can be cheated by returning all documents

31

Performances of IR systems: Precision and Recall (2)

More formally, if, for any query q,

• S(q) is the set of documents retrieved by the IR system; and

• R(q) is the set of documents truly relevant for q, as defined by a referential associated to the used document collection,

then

• Precision:

and

• Recall:

Col lect ion

R S

C

RR e l e va n t

SRet r ieved

32

𝑃r(𝑞) =𝑅(𝑞)∩𝑆(𝑞)

𝑆(𝑞)

𝑅𝑒𝑐(𝑞) =𝑅(𝑞)∩𝑆(𝑞)

𝑅(𝑞)

Other performance measures:

Precision@n and R-Precision

Precision@n

Precision at n documents:

with Sn(q) = the set of the n first documents retrieved

R-Precision

Precision obtained after retrieving as many documents as there are relevant documents, averaged over queries

33

R−Precision =1

𝑁σ𝑖=1

𝑁 Pr 𝑅(𝑞𝑖) (𝑞𝑖)

Pr𝑛(𝑞) =𝑅(𝑞)∩𝑆𝑛(𝑞)

𝑆𝑛(𝑞)

Other performance measures:

Mean Average Precision

Average Precision

Average of the precisions whenever all relevant documents below rank rk(d, q) are retrieved:

Mean Average Precision

Mean over the queries of the Average Precisions

MAP measures the tendency of the system to retrieve relevant documents first.

34

AvgP(𝑞) =1

𝑅(𝑞)σ𝑑∈𝑅(𝑞) Pr𝑟𝑘(𝑑,𝑞)(𝑞)

MAP =1

𝑁σ𝑖=1

𝑁 AvgP(𝑞𝑖)

Plotting average Precision and Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Recall

lpha=0) ha=0.5)lpha=1)

DSIR (aHybrid (alp

VS (a

Aim of the game: push the curve towards the upper right corner

35

Evaluation campaigns

Evaluation set

1. Document collection

2. Query set

3. Referential

Definition

Referential: list of documents of a collection to be retrieved for one given query (handmade).

Examples of evaluation campaigns

► Smart (1970s)

► T R E C (since the 1990s; large collections)

► AMARYLL IS (French)

36

Limitations of the Vector Space model

Problem

Among the well known limitations, the impact of

► Polysemy

► Synonymy

37

Polysemy

Example

Query includes term Bank

→ Bank of England? Bank of fishes? Grand bank? Airplane bank?

Consequences

Negative impact on precision

38

Synonymy

Example

Query includes term freedom

→ liberty will not be seen as relevant

Consequences

Negative impact on recall

39

Distributional Semantics Information Retrieval

Idea

There is a high degree of correlation between the observable distributional

characteristics of a term and its meaning:

"a word is characterized by the company it keeps";

Z. Harris (1954), J.R. Firth (1957)

Example

► Some X , for instance, naturally attack rats.

► The X on the roof was exposing its back to the shine of the sun.

► He heard the mewings of X in the forest .

► X is a: . . .

40

X is a . . .

Bertil Videt, GFDL & CC-by-2.5

41

More advanced IR models

• Latent Semantic Indexing

→ aiming at reducing the dimensionality of the original representation space

• Sparck-Jones’ probabilistic models

→ inspirational for the Okapi BM25 weighting scheme

• LDA: Latent Dirichlet Allocation (Blei, Ng, Jordan 2003)

→ probabilistic model with hidden states (“topics”)

Some specific reference:

► D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84,

2012.

► J.-C. Chappelier, Topic-based Generative Models for Text Information Access, In

Textual Information Access – Statistical Models, E. Gaussier and F. Yvon eds, ch.

5, pp. 129-178, Wiley-ISTE, April 2012.

42

Keypoints

► Vector-space model

► Indexing (and why it is important)

► Weighting schemes: tf-idf

► Cosine similarity measure

► Evaluation: Precision and Recall

43

References

1 C. D. Manning, P. Raghavan and H. Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008.

2 R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval", Addison

Wesley, 1999.

3 "Topics in Information Retrieval", chap. 15 in "Foundations of Statistical Natural

Language Processing", C. D. Manning and H. Schütze, MIT Press, 1999.

44

jean-cédric c martin r emmanuel emartin ra j m a n emmanuel ec k a r d lia 1 contents 2...

Documents