jean-cédric c martin r emmanuel emartin ra j m a n emmanuel ec k a r d lia 1 contents 2...
TRANSCRIPT
Fundamentals in Information Retrieval
Jean-Cédric C H A P P E L I E R
M a r t i n R A J M A N
Emmanuel E C K A R D
LIA
1
Contents
2
► Introduction
► The vector space model
• Indexing terms (tokenization, filtering, Bags of Words)
• Weighting schemes (tf.idf, Okapi BM25)
• Cosine proximity measure
► Representing the queries
► Evaluating IR systems
► Limitations of the vector space model and Distributional Semantics
► More advanced models
► Key points and references
Information Retrieval
Definition
selection of documents relevant to a query in an unstructured collection of documents.
► unstructured: not produced with IR in mind, not a database.
► document: here, natural language text (but could also be video, audio or images)
► query: utterance in natural language (possibly augmented with commands, see later)
► relevant:
1. User’s perspective: answering their IR needs2. Mathematically: maximizing a formally defined “proximitymeasure”
3
Example of Information retrieval: issuing a query on an
unstructured collection
► query (“Alan Turing”)
► search among unstructured collection (Wikipedia articles)
4
Example of Information retrieval: results returned by the
system
► list of results with a percentage match
► highest matches first
5
Ambiguity
Sometimes unintended results occur
Examplequery: “Chicago school”
wanted?
► schools in Chicago (IL)?
► body of works in economy?
► architectural style?
► where to learn how to play Chicago (game):
► bridge?► or poker?
6
How is IR done?
Tasks
► have the computer represent documents (at the adequate level): preprocessing,indexing, ...
► represent the queries, not necessarily the same way as the documents (short queries, operators, . . . )
► define a relevance measure between document and query representations
Similarities with other N L P tasks
► Classification (no query)
► Data mining (formatted data)
► Information extraction (retrieve shorts parts of documents)
7
Representing the documents: the vector space model
t2
t3
d1
d2
d3
t1
► indexing terms define axis
► documents are point in the vector space
8
Vector Space model
Objective
Representing documents as vectors derived from the distributions of indexing terms in the document collection.
Principle
► V , a finite vocabulary of indexing terms
► R representation space
► R D : V∗ → R representation function
► similarity: Mprox : R ×R →R+
9
Indexing terms
Example► Now so long, Marianne
it’s time that we began to laugh and cry and cryand laugh about it all again.
► V , a finite vocabulary: aardvark, begin, cry, information, laugh, long, Marianne, retrieval, time, ...
→ Now so long Marianne it’s time that we began to laugh and cry and cry and laugh about it all again.
10
In practice
Indexing is the set of mechanisms used to build the vocabulary of indexing terms
This vocabulary is usually very large: ~several thousands of indexing terms
Indexing the documents
Definition
Selecting the relevant features in the documents (typically words) to use for representing the documents
11
Themes related to indexing:
► Tokenization
► Stop words
► Zipf and Luhn laws
► Bag of Words (BoW) model
Tokenization
Definition
Tokenization: splitting the text into words (Pre-requisite to choosing indexing terms)
Example► easy: whitespaces
Now is the winter of our discontent
Made glorious summer by this son of York
► less easy: space not always indicative of a term segmentation (compounds):Distributional Semantics Information Retrieval and Latent Semantics Indexing performance comparison
► agglutinative languages are a problem: Rinderkennzeichnungs- und
Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
► Technical terms
12
Tokenization of technical terms &
conclusion on tokenisation
e.g. in Chemistry
Methionyl-glutaminyl-arginyl-tyrosyl-glutamyl-seryl-leucyl-phenyl-alanyl-alanyl-glutaminyl-leucyl-
lysyl-glutamyl-arginyl-lysy-glutamyl-gycyl-alanyl-phenyl-alanyl-valyl-prolyl-phenyl-alanyl-valyl-
threonyl-leucyl-glycyl-aspartyl-prolyl-glycyl-isoleucyl-glutamyl-glutaminyl-seryl-leucyl-lysyl-
isoleucyl-aspartyl-threonyl-leucyl-isoleucyl-glutamyl-alanyl-glycyl-alanyl-aspartyl-alanyl-leucyl-
glutamyl-leucyl-glycyl-isoleucyl-prolyl-phenyl-alanyl-seryl-aspartyl-prolyl-leucyl-alanyl-aspartyl-
glycyl-prolyl-threonyl-isoleucyl-glutaminyl-asparaginyl-alanyl-threonyl-leucyl-arginyl-alanyl-
phenyl-alanyl-alanyl-alanyl-glycyl-valyl-threonyl-prolyl-alanyl-glutaminyl-cysteinyl-phenyl-alanyl-
glutamyl-methionyl-leucyl-alanyl-leucyl-isoleucyl-arginyl-glutaminyl-lysyl-histidyl-prolyl-threonyl-
isoleucyl-prolyl-isoleucyl-glycyl-leucyl-leucyl-methionyl-tyrosyl-alanyl-asparaginyl-leucyl-valyl-
phenyl-...
13
Conclusion on tokenization: it is a full fledged NLP issue → requires NLP techniques)
Choice of indexing terms
Filtering
Choice of indexing terms using various filters:
► on morpho-syntactic categories
► on stop-words
► on frequencies
14
Benefits
► more informative indexes► smaller indexes (tractability)
Indexing terms: filtering with Morpho-syntactic
categories
Filtering
Some morpho-syntactic categories (e.g. determiners, conjunctions, …) do not have much semantic content, so others (e.g. nouns, verbs, …) do!...
► Keep only the terms in a selected set of morpho-syntactic categories (e.g. nouns and adjectives)
15
Indexing terms: filtering with stop words
Definition
Stop word: term explicitely to be excluded from indexing.
Example
stoplist: the; a; ’s; in; but; I; we; my; your; their; then
Young men’s love then liesNot truly in their hearts, but in their eyes.
Document: Young men love lies truly hearts eyes
16
► Benefits:
► cheap way to remove classes of words without semantic content
► Problems:
To be or not to be
► this sentence would be entirely removed.
Indexing terms: filtering with frequencies
Zipf and Luhn laws
If r is the rank of a term and n is its number of occurrences (frequency) in the collection:
► Zipf (1949): n ∼ 1/r
► Luhn (1958): mid-rank terms are the best indicators of topics
17
Indexing terms: filtering with frequencies
Wo r d r a nk
Wo
rd
fre
qu
en
cy
U ppe r cu t-off
Lower cu t-off
W o r d t oo ra re
W o rd t oo
co m -m o n
S ig n ifi ca n t Wo r d
18
Bag of words model
Assumption
Positions of the terms are ignored. Term distribution is indicative enough of the meaning.
Model
d1 = {(t1,n(d1, t1));(t2,n(d1, t2)); ...}
d2 = {(t1,n(d2, t1));(t2,n(d2, t2)); ...}
A document is a multiset of terms
Example
Now so long, Marianne ; it’s time that we beganto laugh and cry and cry ; and laugh about it all again.
→ ([begin,1] [cry,2] [laugh,2] [long,1] [Marianne,1] [time,1])
19
Conclusions on indexing
► Bad indexing can ruin the performances of an otherwise sophisticated IR system
► Good indexing is anything but trivial
20
Representation function
DefinitionTo represent documents as vectors, we need to associate the indexing terms with weights
Example► Now so long, Marianne
it’s time that we began to laugh and cry and cryand laugh about it all again.
► R representation space: R|V|
→ ([aardvark,?] [begin,?] [cry,?] [information,?] [laugh,?]
[long,?] [Marianne,?] [retrieval,?] [time,?])
21
Weighting schemes: tf.idf
Term Frequency
tf(ti , dj ) = nb of occurrences of term ti in document dj
Note: sometimes (1 + log(tf(ti ,dj ))) is used in place of tf(ti,dj )
Term Frequency - Inverse Document Frequency
tf∙idf(ti ,dj ) = tf(ti ,dj ) · idf(ti )
with
idf(ti ) = log( )|D|
nb(dk∍ ti )
|D|: number of documents in the collectionnb(dk ∍ ti ): "document frequency of ti“, i.e. the number of documents that contain term ti
22
Example: tf weighting scheme
Example► Now so long, Marianne
it’s time that we began to laugh and cry and cryand laugh about it all again.
► R D : V∗ → R representation function: here: Term Frequency
→ ([aardvark,0] [begin,1] [cry,2] [information,0] [laugh,2] [long,1] [Marianne,1] [retrieval,0] [time,1])
→(0 1 2 0 2 1 1 0 1 ...)
In practice
the vectors are very sparse
23
Okapi BM25 weighting scheme
Origine
BM25 weight for term t in document d
where |d| is the length of document d, avdl is the average document length, and
k, b are constants, usually set to values s.t. k∈[1.2, 2.0] and b=0.75
BM25 is a very good scheme and is often used as reference
24
The Okapi BM25 is originally based on the Sparck-Jones’ probabilistic model for IR and has been strongly experimentally fine-tuned
𝑤𝐵𝑀25 𝑡, 𝑑 =(𝑘 + 1)
(𝑡𝑓 𝑡, 𝑑 + 𝑘(1 − 𝑏 +𝑑
𝑎𝑣𝑑𝑙))
∙ (𝑡𝑓∙𝑖𝑑𝑓 𝑡, 𝑑 )
Proximity measure between documents
Cosine similarity
► bounded (0 < cos(d1,d2) < 1,∀ d1,d2)
► it is a similarity: the greater, the more similar the documents (as opposed to ametric)
► independent of the length of the document
25
cos(d1, d2) =d1
d1∙
d2
d2=
σ𝑗=1𝑁 𝑑1𝑗𝑑2𝑗
σ𝑗 𝑑1𝑗2 σ𝑗 𝑑2𝑗
2
Proximity measure between documents
Document 1► Now so long, Marianne, it’s time that we
began to laugh and cry and cry and laugh about it all again.
► ...,[long,1] [Marianne,1]
[time,1] [begin,1] [laugh,2]
[cry,2],...
► d1 = (...,1,1,1,1,2,2,...)
Document 2► I haven’t seen Marianne laugthing for some
time, is she crying all day long ?
► ...,[long,1] [Marianne,1]
[time,1] [begin,0] [laugh,1]
[cry,1],...
► d2 = (...,1,1,1,0,1,1,...)
Example
cos(d1,d2) = 7/(√12 ·√5) = 0.904
26
Summary
Choices depending on the application
► Weighting: allows to translate semantic notions into computable models
► Proximity measure: defines the topology of the representation space
Constants
► |V|-dimensional vector space
► very sparse vectors
27
Queries: definition
Definition
Queries (or “topics”) are “questions” asked to the system
Typically keywords, possibly augmented with operators
dreamt WITHIN 5 philosophy
Supposed unknown at indexing time (difference between IR and classification or clustering)Visit http://www.google.com/trends for real-life examples
28
Query representation
Example
► easy: as for documents
more things in heaven and earth
► less easy (verbatim sentence)
"more things in heaven and earth"
► quite different from the document (positional information)
dreamt WITHIN 5 philosophy
Conclusion:
Query representation is not necessarily trivial (not always the same as representation of documents).
29
Problem with short queries
Web queries
On the web,
► the average query length is under three words
► very few users use operators
Language being ambiguous, three-word queries are difficult to satisfy.
Solutions
► query expansion: use knowledge about the query terms to associate them with other terms and improve the query.
► query term reweighting: weight the terms of the query as to obtain maximum retrieval performance.
► relevance feedback: User provides the system an evaluation of the relevance of its answers.
30
Performances of IR systems: Precision and Recall
Reminder:Given an IR system, a document collection, queries, referential and an answer by the system:
Definition
Precision is the proportion of the documents retrieved by the system that are relevant (according to the referential)
Definition
Recall is the proportion of the relevant documents which were retrieved by the system
► Precision is problematic if no documents are returned
► Recall can be cheated by returning all documents
31
Performances of IR systems: Precision and Recall (2)
More formally, if, for any query q,
• S(q) is the set of documents retrieved by the IR system; and
• R(q) is the set of documents truly relevant for q, as defined by a referential associated to the used document collection,
then
• Precision:
and
• Recall:
Col lect ion
R S
C
RR e l e va n t
SRet r ieved
32
𝑃r(𝑞) =𝑅(𝑞)∩𝑆(𝑞)
𝑆(𝑞)
𝑅𝑒𝑐(𝑞) =𝑅(𝑞)∩𝑆(𝑞)
𝑅(𝑞)
Other performance measures:
Precision@n and R-Precision
Precision@n
Precision at n documents:
with Sn(q) = the set of the n first documents retrieved
R-Precision
Precision obtained after retrieving as many documents as there are relevant documents, averaged over queries
33
R−Precision =1
𝑁σ𝑖=1
𝑁 Pr 𝑅(𝑞𝑖) (𝑞𝑖)
Pr𝑛(𝑞) =𝑅(𝑞)∩𝑆𝑛(𝑞)
𝑆𝑛(𝑞)
Other performance measures:
Mean Average Precision
Average Precision
Average of the precisions whenever all relevant documents below rank rk(d, q) are retrieved:
Mean Average Precision
Mean over the queries of the Average Precisions
MAP measures the tendency of the system to retrieve relevant documents first.
34
AvgP(𝑞) =1
𝑅(𝑞)σ𝑑∈𝑅(𝑞) Pr𝑟𝑘(𝑑,𝑞)(𝑞)
MAP =1
𝑁σ𝑖=1
𝑁 AvgP(𝑞𝑖)
Plotting average Precision and Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
0 0.2 0.4 0.6 0.8 1
Pre
cis
ion
Recall
lpha=0) ha=0.5)lpha=1)
DSIR (aHybrid (alp
VS (a
Aim of the game: push the curve towards the upper right corner
35
Evaluation campaigns
Evaluation set
1. Document collection
2. Query set
3. Referential
Definition
Referential: list of documents of a collection to be retrieved for one given query (handmade).
Examples of evaluation campaigns
► Smart (1970s)
► T R E C (since the 1990s; large collections)
► AMARYLL IS (French)
36
Limitations of the Vector Space model
Problem
Among the well known limitations, the impact of
► Polysemy
► Synonymy
37
Polysemy
Example
Query includes term Bank
→ Bank of England? Bank of fishes? Grand bank? Airplane bank?
Consequences
Negative impact on precision
38
Synonymy
Example
Query includes term freedom
→ liberty will not be seen as relevant
Consequences
Negative impact on recall
39
Distributional Semantics Information Retrieval
Idea
There is a high degree of correlation between the observable distributional
characteristics of a term and its meaning:
"a word is characterized by the company it keeps";
Z. Harris (1954), J.R. Firth (1957)
Example
► Some X , for instance, naturally attack rats.
► The X on the roof was exposing its back to the shine of the sun.
► He heard the mewings of X in the forest .
► X is a: . . .
40
X is a . . .
Bertil Videt, GFDL & CC-by-2.5
41
More advanced IR models
• Latent Semantic Indexing
→ aiming at reducing the dimensionality of the original representation space
• Sparck-Jones’ probabilistic models
→ inspirational for the Okapi BM25 weighting scheme
• LDA: Latent Dirichlet Allocation (Blei, Ng, Jordan 2003)
→ probabilistic model with hidden states (“topics”)
Some specific reference:
► D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84,
2012.
► J.-C. Chappelier, Topic-based Generative Models for Text Information Access, In
Textual Information Access – Statistical Models, E. Gaussier and F. Yvon eds, ch.
5, pp. 129-178, Wiley-ISTE, April 2012.
42
Keypoints
► Vector-space model
► Indexing (and why it is important)
► Weighting schemes: tf-idf
► Cosine similarity measure
► Evaluation: Precision and Recall
43
References
1 C. D. Manning, P. Raghavan and H. Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008.
2 R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval", Addison
Wesley, 1999.
3 "Topics in Information Retrieval", chap. 15 in "Foundations of Statistical Natural
Language Processing", C. D. Manning and H. Schütze, MIT Press, 1999.
44