jean-cédric c martin r emmanuel emartin ra j m a n emmanuel ec k a r d lia 1 contents 2...

44
Fundamentals in Information Retrieval Jean-Cédric C HAPPELIER Martin R AJMAN Emmanuel E CKARD LIA 1

Upload: others

Post on 20-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Fundamentals in Information Retrieval

Jean-Cédric C H A P P E L I E R

M a r t i n R A J M A N

Emmanuel E C K A R D

LIA

1

Page 2: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Contents

2

► Introduction

► The vector space model

• Indexing terms (tokenization, filtering, Bags of Words)

• Weighting schemes (tf.idf, Okapi BM25)

• Cosine proximity measure

► Representing the queries

► Evaluating IR systems

► Limitations of the vector space model and Distributional Semantics

► More advanced models

► Key points and references

Page 3: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Information Retrieval

Definition

selection of documents relevant to a query in an unstructured collection of documents.

► unstructured: not produced with IR in mind, not a database.

► document: here, natural language text (but could also be video, audio or images)

► query: utterance in natural language (possibly augmented with commands, see later)

► relevant:

1. User’s perspective: answering their IR needs2. Mathematically: maximizing a formally defined “proximitymeasure”

3

Page 4: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Example of Information retrieval: issuing a query on an

unstructured collection

► query (“Alan Turing”)

► search among unstructured collection (Wikipedia articles)

4

Page 5: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Example of Information retrieval: results returned by the

system

► list of results with a percentage match

► highest matches first

5

Page 6: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Ambiguity

Sometimes unintended results occur

Examplequery: “Chicago school”

wanted?

► schools in Chicago (IL)?

► body of works in economy?

► architectural style?

► where to learn how to play Chicago (game):

► bridge?► or poker?

6

Page 7: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

How is IR done?

Tasks

► have the computer represent documents (at the adequate level): preprocessing,indexing, ...

► represent the queries, not necessarily the same way as the documents (short queries, operators, . . . )

► define a relevance measure between document and query representations

Similarities with other N L P tasks

► Classification (no query)

► Data mining (formatted data)

► Information extraction (retrieve shorts parts of documents)

7

Page 8: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Representing the documents: the vector space model

t2

t3

d1

d2

d3

t1

► indexing terms define axis

► documents are point in the vector space

8

Page 9: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Vector Space model

Objective

Representing documents as vectors derived from the distributions of indexing terms in the document collection.

Principle

► V , a finite vocabulary of indexing terms

► R representation space

► R D : V∗ → R representation function

► similarity: Mprox : R ×R →R+

9

Page 10: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Indexing terms

Example► Now so long, Marianne

it’s time that we began to laugh and cry and cryand laugh about it all again.

► V , a finite vocabulary: aardvark, begin, cry, information, laugh, long, Marianne, retrieval, time, ...

→ Now so long Marianne it’s time that we began to laugh and cry and cry and laugh about it all again.

10

In practice

Indexing is the set of mechanisms used to build the vocabulary of indexing terms

This vocabulary is usually very large: ~several thousands of indexing terms

Page 11: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Indexing the documents

Definition

Selecting the relevant features in the documents (typically words) to use for representing the documents

11

Themes related to indexing:

► Tokenization

► Stop words

► Zipf and Luhn laws

► Bag of Words (BoW) model

Page 12: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Tokenization

Definition

Tokenization: splitting the text into words (Pre-requisite to choosing indexing terms)

Example► easy: whitespaces

Now is the winter of our discontent

Made glorious summer by this son of York

► less easy: space not always indicative of a term segmentation (compounds):Distributional Semantics Information Retrieval and Latent Semantics Indexing performance comparison

► agglutinative languages are a problem: Rinderkennzeichnungs- und

Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz

► Technical terms

12

Page 13: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Tokenization of technical terms &

conclusion on tokenisation

e.g. in Chemistry

Methionyl-glutaminyl-arginyl-tyrosyl-glutamyl-seryl-leucyl-phenyl-alanyl-alanyl-glutaminyl-leucyl-

lysyl-glutamyl-arginyl-lysy-glutamyl-gycyl-alanyl-phenyl-alanyl-valyl-prolyl-phenyl-alanyl-valyl-

threonyl-leucyl-glycyl-aspartyl-prolyl-glycyl-isoleucyl-glutamyl-glutaminyl-seryl-leucyl-lysyl-

isoleucyl-aspartyl-threonyl-leucyl-isoleucyl-glutamyl-alanyl-glycyl-alanyl-aspartyl-alanyl-leucyl-

glutamyl-leucyl-glycyl-isoleucyl-prolyl-phenyl-alanyl-seryl-aspartyl-prolyl-leucyl-alanyl-aspartyl-

glycyl-prolyl-threonyl-isoleucyl-glutaminyl-asparaginyl-alanyl-threonyl-leucyl-arginyl-alanyl-

phenyl-alanyl-alanyl-alanyl-glycyl-valyl-threonyl-prolyl-alanyl-glutaminyl-cysteinyl-phenyl-alanyl-

glutamyl-methionyl-leucyl-alanyl-leucyl-isoleucyl-arginyl-glutaminyl-lysyl-histidyl-prolyl-threonyl-

isoleucyl-prolyl-isoleucyl-glycyl-leucyl-leucyl-methionyl-tyrosyl-alanyl-asparaginyl-leucyl-valyl-

phenyl-...

13

Conclusion on tokenization: it is a full fledged NLP issue → requires NLP techniques)

Page 14: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Choice of indexing terms

Filtering

Choice of indexing terms using various filters:

► on morpho-syntactic categories

► on stop-words

► on frequencies

14

Benefits

► more informative indexes► smaller indexes (tractability)

Page 15: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Indexing terms: filtering with Morpho-syntactic

categories

Filtering

Some morpho-syntactic categories (e.g. determiners, conjunctions, …) do not have much semantic content, so others (e.g. nouns, verbs, …) do!...

► Keep only the terms in a selected set of morpho-syntactic categories (e.g. nouns and adjectives)

15

Page 16: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Indexing terms: filtering with stop words

Definition

Stop word: term explicitely to be excluded from indexing.

Example

stoplist: the; a; ’s; in; but; I; we; my; your; their; then

Young men’s love then liesNot truly in their hearts, but in their eyes.

Document: Young men love lies truly hearts eyes

16

► Benefits:

► cheap way to remove classes of words without semantic content

► Problems:

To be or not to be

► this sentence would be entirely removed.

Page 17: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Indexing terms: filtering with frequencies

Zipf and Luhn laws

If r is the rank of a term and n is its number of occurrences (frequency) in the collection:

► Zipf (1949): n ∼ 1/r

► Luhn (1958): mid-rank terms are the best indicators of topics

17

Page 18: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Indexing terms: filtering with frequencies

Wo r d r a nk

Wo

rd

fre

qu

en

cy

U ppe r cu t-off

Lower cu t-off

W o r d t oo ra re

W o rd t oo

co m -m o n

S ig n ifi ca n t Wo r d

18

Page 19: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Bag of words model

Assumption

Positions of the terms are ignored. Term distribution is indicative enough of the meaning.

Model

d1 = {(t1,n(d1, t1));(t2,n(d1, t2)); ...}

d2 = {(t1,n(d2, t1));(t2,n(d2, t2)); ...}

A document is a multiset of terms

Example

Now so long, Marianne ; it’s time that we beganto laugh and cry and cry ; and laugh about it all again.

→ ([begin,1] [cry,2] [laugh,2] [long,1] [Marianne,1] [time,1])

19

Page 20: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Conclusions on indexing

► Bad indexing can ruin the performances of an otherwise sophisticated IR system

► Good indexing is anything but trivial

20

Page 21: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Representation function

DefinitionTo represent documents as vectors, we need to associate the indexing terms with weights

Example► Now so long, Marianne

it’s time that we began to laugh and cry and cryand laugh about it all again.

► R representation space: R|V|

→ ([aardvark,?] [begin,?] [cry,?] [information,?] [laugh,?]

[long,?] [Marianne,?] [retrieval,?] [time,?])

21

Page 22: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Weighting schemes: tf.idf

Term Frequency

tf(ti , dj ) = nb of occurrences of term ti in document dj

Note: sometimes (1 + log(tf(ti ,dj ))) is used in place of tf(ti,dj )

Term Frequency - Inverse Document Frequency

tf∙idf(ti ,dj ) = tf(ti ,dj ) · idf(ti )

with

idf(ti ) = log( )|D|

nb(dk∍ ti )

|D|: number of documents in the collectionnb(dk ∍ ti ): "document frequency of ti“, i.e. the number of documents that contain term ti

22

Page 23: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Example: tf weighting scheme

Example► Now so long, Marianne

it’s time that we began to laugh and cry and cryand laugh about it all again.

► R D : V∗ → R representation function: here: Term Frequency

→ ([aardvark,0] [begin,1] [cry,2] [information,0] [laugh,2] [long,1] [Marianne,1] [retrieval,0] [time,1])

→(0 1 2 0 2 1 1 0 1 ...)

In practice

the vectors are very sparse

23

Page 24: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Okapi BM25 weighting scheme

Origine

BM25 weight for term t in document d

where |d| is the length of document d, avdl is the average document length, and

k, b are constants, usually set to values s.t. k∈[1.2, 2.0] and b=0.75

BM25 is a very good scheme and is often used as reference

24

The Okapi BM25 is originally based on the Sparck-Jones’ probabilistic model for IR and has been strongly experimentally fine-tuned

𝑤𝐵𝑀25 𝑡, 𝑑 =(𝑘 + 1)

(𝑡𝑓 𝑡, 𝑑 + 𝑘(1 − 𝑏 +𝑑

𝑎𝑣𝑑𝑙))

∙ (𝑡𝑓∙𝑖𝑑𝑓 𝑡, 𝑑 )

Page 25: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Proximity measure between documents

Cosine similarity

► bounded (0 < cos(d1,d2) < 1,∀ d1,d2)

► it is a similarity: the greater, the more similar the documents (as opposed to ametric)

► independent of the length of the document

25

cos(d1, d2) =d1

d1∙

d2

d2=

σ𝑗=1𝑁 𝑑1𝑗𝑑2𝑗

σ𝑗 𝑑1𝑗2 σ𝑗 𝑑2𝑗

2

Page 26: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Proximity measure between documents

Document 1► Now so long, Marianne, it’s time that we

began to laugh and cry and cry and laugh about it all again.

► ...,[long,1] [Marianne,1]

[time,1] [begin,1] [laugh,2]

[cry,2],...

► d1 = (...,1,1,1,1,2,2,...)

Document 2► I haven’t seen Marianne laugthing for some

time, is she crying all day long ?

► ...,[long,1] [Marianne,1]

[time,1] [begin,0] [laugh,1]

[cry,1],...

► d2 = (...,1,1,1,0,1,1,...)

Example

cos(d1,d2) = 7/(√12 ·√5) = 0.904

26

Page 27: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Summary

Choices depending on the application

► Weighting: allows to translate semantic notions into computable models

► Proximity measure: defines the topology of the representation space

Constants

► |V|-dimensional vector space

► very sparse vectors

27

Page 28: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Queries: definition

Definition

Queries (or “topics”) are “questions” asked to the system

Typically keywords, possibly augmented with operators

dreamt WITHIN 5 philosophy

Supposed unknown at indexing time (difference between IR and classification or clustering)Visit http://www.google.com/trends for real-life examples

28

Page 29: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Query representation

Example

► easy: as for documents

more things in heaven and earth

► less easy (verbatim sentence)

"more things in heaven and earth"

► quite different from the document (positional information)

dreamt WITHIN 5 philosophy

Conclusion:

Query representation is not necessarily trivial (not always the same as representation of documents).

29

Page 30: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Problem with short queries

Web queries

On the web,

► the average query length is under three words

► very few users use operators

Language being ambiguous, three-word queries are difficult to satisfy.

Solutions

► query expansion: use knowledge about the query terms to associate them with other terms and improve the query.

► query term reweighting: weight the terms of the query as to obtain maximum retrieval performance.

► relevance feedback: User provides the system an evaluation of the relevance of its answers.

30

Page 31: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Performances of IR systems: Precision and Recall

Reminder:Given an IR system, a document collection, queries, referential and an answer by the system:

Definition

Precision is the proportion of the documents retrieved by the system that are relevant (according to the referential)

Definition

Recall is the proportion of the relevant documents which were retrieved by the system

► Precision is problematic if no documents are returned

► Recall can be cheated by returning all documents

31

Page 32: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Performances of IR systems: Precision and Recall (2)

More formally, if, for any query q,

• S(q) is the set of documents retrieved by the IR system; and

• R(q) is the set of documents truly relevant for q, as defined by a referential associated to the used document collection,

then

• Precision:

and

• Recall:

Col lect ion

R S

C

RR e l e va n t

SRet r ieved

32

𝑃r(𝑞) =𝑅(𝑞)∩𝑆(𝑞)

𝑆(𝑞)

𝑅𝑒𝑐(𝑞) =𝑅(𝑞)∩𝑆(𝑞)

𝑅(𝑞)

Page 33: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Other performance measures:

Precision@n and R-Precision

Precision@n

Precision at n documents:

with Sn(q) = the set of the n first documents retrieved

R-Precision

Precision obtained after retrieving as many documents as there are relevant documents, averaged over queries

33

R−Precision =1

𝑁σ𝑖=1

𝑁 Pr 𝑅(𝑞𝑖) (𝑞𝑖)

Pr𝑛(𝑞) =𝑅(𝑞)∩𝑆𝑛(𝑞)

𝑆𝑛(𝑞)

Page 34: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Other performance measures:

Mean Average Precision

Average Precision

Average of the precisions whenever all relevant documents below rank rk(d, q) are retrieved:

Mean Average Precision

Mean over the queries of the Average Precisions

MAP measures the tendency of the system to retrieve relevant documents first.

34

AvgP(𝑞) =1

𝑅(𝑞)σ𝑑∈𝑅(𝑞) Pr𝑟𝑘(𝑑,𝑞)(𝑞)

MAP =1

𝑁σ𝑖=1

𝑁 AvgP(𝑞𝑖)

Page 35: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Plotting average Precision and Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

Recall

lpha=0) ha=0.5)lpha=1)

DSIR (aHybrid (alp

VS (a

Aim of the game: push the curve towards the upper right corner

35

Page 36: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Evaluation campaigns

Evaluation set

1. Document collection

2. Query set

3. Referential

Definition

Referential: list of documents of a collection to be retrieved for one given query (handmade).

Examples of evaluation campaigns

► Smart (1970s)

► T R E C (since the 1990s; large collections)

► AMARYLL IS (French)

36

Page 37: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Limitations of the Vector Space model

Problem

Among the well known limitations, the impact of

► Polysemy

► Synonymy

37

Page 38: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Polysemy

Example

Query includes term Bank

→ Bank of England? Bank of fishes? Grand bank? Airplane bank?

Consequences

Negative impact on precision

38

Page 39: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Synonymy

Example

Query includes term freedom

→ liberty will not be seen as relevant

Consequences

Negative impact on recall

39

Page 40: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Distributional Semantics Information Retrieval

Idea

There is a high degree of correlation between the observable distributional

characteristics of a term and its meaning:

"a word is characterized by the company it keeps";

Z. Harris (1954), J.R. Firth (1957)

Example

► Some X , for instance, naturally attack rats.

► The X on the roof was exposing its back to the shine of the sun.

► He heard the mewings of X in the forest .

► X is a: . . .

40

Page 41: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

X is a . . .

Bertil Videt, GFDL & CC-by-2.5

41

Page 42: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

More advanced IR models

• Latent Semantic Indexing

→ aiming at reducing the dimensionality of the original representation space

• Sparck-Jones’ probabilistic models

→ inspirational for the Okapi BM25 weighting scheme

• LDA: Latent Dirichlet Allocation (Blei, Ng, Jordan 2003)

→ probabilistic model with hidden states (“topics”)

Some specific reference:

► D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84,

2012.

► J.-C. Chappelier, Topic-based Generative Models for Text Information Access, In

Textual Information Access – Statistical Models, E. Gaussier and F. Yvon eds, ch.

5, pp. 129-178, Wiley-ISTE, April 2012.

42

Page 43: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

Keypoints

► Vector-space model

► Indexing (and why it is important)

► Weighting schemes: tf-idf

► Cosine similarity measure

► Evaluation: Precision and Recall

43

Page 44: Jean-Cédric C Martin R Emmanuel EMartin RA J M A N Emmanuel EC K A R D LIA 1 Contents 2 Introduction The vector space model • Indexing terms (tokenization, filtering, Bags of Words)

References

1 C. D. Manning, P. Raghavan and H. Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008.

2 R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval", Addison

Wesley, 1999.

3 "Topics in Information Retrieval", chap. 15 in "Foundations of Statistical Natural

Language Processing", C. D. Manning and H. Schütze, MIT Press, 1999.

44