document indexing, similarities and retrieval in large...
TRANSCRIPT
![Page 1: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/1.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Document indexing, similarities and retrievalin large scale text collections
Eric Gaussier
Univ. Grenoble Alpes - [email protected]
Eric Gaussier Document indexing, similarities & retrieval 1
![Page 2: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/2.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Course objectives
Introduce the main concepts, models and algorithms for computingindexing text documents and computing similarities in large scaletext collections
We will focus on:
Document indexing and representations in large scale collectionsStandard models for Information Retrieval (IR)PageRank (computing the importance of a page on the Web)Learning to rank models
Eric Gaussier Document indexing, similarities & retrieval 2
![Page 3: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/3.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Application domains
Information retrieval
Query indexing moduleDocuments indexing moduleModule to match queries and documents
Classification
Binary, multi-class; mono-/multi-labelFlat vs hierarchical
Clustering
Hard vs soft clusteringFlat vs hierarchical
Eric Gaussier Document indexing, similarities & retrieval 3
![Page 4: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/4.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Part 1: Indexing, similarities, informationretrieval
Content
1 Indexing
2 Standard IR models
3 Evaluation
Eric Gaussier Document indexing, similarities & retrieval 4
![Page 5: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/5.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Indexing steps
1 Segmentation
Segment a text into words:
the importance of retrieving the good informationthe, importance, of, retrieving, the, good, information
7 words but only 6 word types; depending on languages, mayrequire a dictionary
2 Stop-word removal (stop-word list)
3 Normalization
Upper/lower-case, inflected forms, lexical familiesLemmatization, stemming
→ Bag-of-words: importance, retriev, inform
Eric Gaussier Document indexing, similarities & retrieval 5
![Page 6: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/6.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Vector space representation
The set of all word types constitute the vocabulary of a collection.Let M be the size of the vocabulary and N be the number ofdocuments in the collection → M-dimensional vector space (eachaxis corresponds to a word type)
Each document is represented by a vector the coordinates of whichcorrespond to:
Presence/absence or number of occurrences of the word type in thedoc: wd
i = tfdi
Normalized number of occurrences: wdi =
tfdi∑Mi=1 tfdi
tf*idf:
wdi =
tfdi∑Mi=1 tfdi
logN
dfi︸ ︷︷ ︸idfi
where dfi is the number of docs in which word (type) i occurs
Eric Gaussier Document indexing, similarities & retrieval 6
![Page 7: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/7.jpg)
Part 1: Indexing, IR Part 2: IR on the web
A sparse representation
Most of the words (terms) only occur in few documents and mostcoordinates of each document are null; storage space is thus savedby considering only words present in documents → sparserepresentation
Example
document d
int l (doc length)ArrWords int[l] (sorted word indices)ArrWeights float[l] (word weights)· · ·
How to compute a dot product between documents?
Eric Gaussier Document indexing, similarities & retrieval 7
![Page 8: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/8.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Dot product with sparse representations
Eric Gaussier Document indexing, similarities & retrieval 8
![Page 9: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/9.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Inverted file
It is possible, with sparse representations, to speed up thecomparison between docs by relying on an inverted file thatprovides, for each term, the list of documents they appear in:
word i
int l (number of docs)ArrDocs int[l] (sorted doc indices)· · ·
Remark Advantageous with measures (distances, similarities) thatdo not rely on words not present in docs; dot/scalar product?,cosine?, Euclidean distance?
Eric Gaussier Document indexing, similarities & retrieval 9
![Page 10: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/10.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Building an inverted file
With a static collection, 3 main steps:
1 Extraction of id pairs (term, doc) (complete pass over thecollection)
2 Sorting acc. to term id, then doc id
3 Grouping pairs corresponding to same term
Easy to implement when everything fits into memory
How to proceed with large collections?
Eric Gaussier Document indexing, similarities & retrieval 10
![Page 11: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/11.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Insufficient memory
Intermediate ”inverted files” are temporarily stored on disk. Asbefore, 3 main steps:
1 Extraction of id pairs (term, doc) (previous algo.) and writing onfile F
2 Reading file F by blocks that can fit into memory; inversion of eachblock (previous algo.) and writing in a series of files
3 Merging all local files to create global inverted file
→ Blocked sort-based indexing (BSBI) algorithm
Eric Gaussier Document indexing, similarities & retrieval 11
![Page 12: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/12.jpg)
Part 1: Indexing, IR Part 2: IR on the web
BSBI (1)
1 n← 0
2 while (some docs have not been processed)
3 do
4 n← n + 1
5 block ← ParseBlock()
6 BSBI-Invert(block)
7 WriteBlockToDisk(block,fn)
8 MergeBlocks(f1, ..., fn;fmerged)
Eric Gaussier Document indexing, similarities & retrieval 12
![Page 13: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/13.jpg)
Part 1: Indexing, IR Part 2: IR on the web
BSBI (2)
The inversion (in BSBI) consists in sorting pairs on two differentkeys (term and doc ids). Complexity in O(T logT ) where Trepresents the number of (term,doc) pairs
Example
t1 = ”brutus”, t2 = ”caesar”, t3 = ”julius”, t4 = ”kill”, t5 =”noble”
t1 : d1 t2 : d4 t2 : d1
t3 : d10 t1 : d3 t4 : d8
t5 : d5 t2 : d2 t1 : d7
Eric Gaussier Document indexing, similarities & retrieval 13
![Page 14: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/14.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Standard IR models
Eric Gaussier Document indexing, similarities & retrieval 14
![Page 15: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/15.jpg)
Part 1: Indexing, IR Part 2: IR on the web
The different standard models
Boolean model
Vector-space model
Prob. models
Eric Gaussier Document indexing, similarities & retrieval 15
![Page 16: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/16.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Notations
xqw Nbr of occ. of w in query qxdw Nbr of occ. of w in doc dtdw Normalized version of xdw (weight)N Nbr of docs in collectionM Nbr of words in collectionFw Total nbr of occ. of w : Fw =
∑d x
dw
Nw Nbr of docs in which w occurs:Nw =
∑d I (x
dw > 0)
yd Length of doc dm Longueur moyenne dans la collectionL Longueur de la collection
RSV Retrieval Status Value (score)
Eric Gaussier Document indexing, similarities & retrieval 16
![Page 17: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/17.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Boolean model (1)
Simple model based on set theory and Boole algebra, characterizedby:
Binary weights (presence/absence)
Queries as boolean expressions
Binary relevance
System relevance: satisfaction of the boolean query
Eric Gaussier Document indexing, similarities & retrieval 17
![Page 18: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/18.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Boolean model (2)
Exampleq = programming ∧ language ∧ (C ∨ java)(q = [prog. ∧ lang. ∧ C] ∨ [prog. ∧ lang. ∧ java])
programming language C java · · ·d1 3 (1) 2 (1) 4 (1) 0 (0) · · ·d2 5 (1) 1 (1) 0 (0) 0 (0) · · ·d0 0 (0) 0 (0) 0 (0) 3 (1) · · ·
Relevance scoreRSV (dj , q) = 1 iff ∃ qcc ∈ qdnf s.t.∀w , tdw = tqw ; 0 otherwise
Eric Gaussier Document indexing, similarities & retrieval 18
![Page 19: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/19.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Boolean model (3)
Algorithmic considerationsSparse term-document matrix: inverted file to select all documentin conjonctive blocks (can be processed in parallel) - intersection ofdocument lists
d1 d2 d3 · · ·programming 1 1 0 · · ·langage 1 1 0 · · ·C 1 0 0 · · ·· · · · · · · · · · · ·
Eric Gaussier Document indexing, similarities & retrieval 19
![Page 20: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/20.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Boolean model (4)
Advantages and disadvantages+ Easy to implement (at the basis of all models with a
union operator)
- Binary relevance not adapted to topical overlaps
- From an information need to a boolean query
Remark At the basis of many commercial systems
Eric Gaussier Document indexing, similarities & retrieval 20
![Page 21: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/21.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Vector space model (1)
Corrects two drawbacks of the boolean model: binary weights andrelevance
It is characterized by:
Positive weights for each term (in docs and queries)
A representation of documents and queries as vectors (see before onbag-of-words)
w1
w2
wM
q
d
Espace vectoriel des termes
Eric Gaussier Document indexing, similarities & retrieval 21
![Page 22: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/22.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Vector space model (2)
Docs and queries are vectors in an M-dimensional space the axesof which corresponds to word types
Similarity Cosine between two vectors
RSV (dj , q) =∑
w tdw tqw√∑
w (tdw )
2√∑
w (tqw )2
Proprerty The cosine is maximal when the document and the querycontain the same words, in the same proportion! It is minimalwhen they have no term in common (similarity score)
Eric Gaussier Document indexing, similarities & retrieval 22
![Page 23: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/23.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Vector space model (3)
Advantages and disadvantages+ Total order (on the document set): distinction between
documents that completely or partially answer the information need
- Framework relatively simple; not amenable to differentextensions
Complexity Similar to the boolean model (dot product onlycomputed on documents that contain at least one query term)
Eric Gaussier Document indexing, similarities & retrieval 23
![Page 24: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/24.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Probabilistic models
Binary Independence Model and BM25 (S. Robertson & K. SparckJones)
Inference Network Model (Inquery) - Belief Network Model (Turtle& Croft)
(Statistical) Language Models
Query likelihood (Ponte & Croft)Probabilistic distance retrieval model (Zhai & Lafferty)
Divergence from Randomness (Amati & Van Rijsbergen) -Information-based models (Clinchant & Gaussier)
Eric Gaussier Document indexing, similarities & retrieval 24
![Page 25: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/25.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Generalities
Boolean model → binary relevanceVector space model → similarity scoreProbabilistic model → probability of relevance
Two points of view: document generation (probability that thedocument is relevant to the query - BIR, BM25), query generation(probability that the document ”generated” the query - LM)
Eric Gaussier Document indexing, similarities & retrieval 25
![Page 26: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/26.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Introduction to language models: two die
Let D1 and D2 two (standard) die such that, for small ε:
For D1, P(1) = P(3) = P(5) = 13 − ε, P(2) = P(4) = P(6) = ε
For D2, P(1) = P(3) = P(5) = ε ; P(2) = P(4) = P(6) = 13 − ε
Imagine you observe the sequence Q = (1, 3, 3, 2). Which dicemost likely produced this sequence?
Answer
P(Q|D1) = (13 − ε)3ε ; P(Q|D2) = (13 − ε)ε
3
Eric Gaussier Document indexing, similarities & retrieval 26
![Page 27: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/27.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Introduction to language models: two die
Let D1 and D2 two (standard) die such that, for small ε:
For D1, P(1) = P(3) = P(5) = 13 − ε, P(2) = P(4) = P(6) = ε
For D2, P(1) = P(3) = P(5) = ε ; P(2) = P(4) = P(6) = 13 − ε
Imagine you observe the sequence Q = (1, 3, 3, 2). Which dicemost likely produced this sequence?
Answer
P(Q|D1) = (13 − ε)3ε ; P(Q|D2) = (13 − ε)ε
3
Eric Gaussier Document indexing, similarities & retrieval 26
![Page 28: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/28.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Language model - QL (1)
Documents are die; a query is a sequence → What is theprobability that a document (dice) generated the query(sequence)?
(RSV (q, d) =)P(q|d) =∏w∈q
P(w |d)xqw
How to estimate the quantities P(w |d)?
→ Maximum Likelihood principle Rightarrow p(w |d) = xdw∑w xdw
Problem with query words not present in docs
Eric Gaussier Document indexing, similarities & retrieval 27
![Page 29: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/29.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Language model - QL (2)
Solution: smoothingOne takes into account the collection model:p(w |d) = (1− αd) xdw∑
w xdw+ αd
Fw∑w Fw
Example with Jelinek-Mercer smoothing: αd = λ
D: development set (collection, some queries and associatedrelevance judgements)
λ = 0:
Repeat till λ = 1
IR on D and evaluation (store evaluation score and associated λ)λ← λ+ ε
Select best λ
Eric Gaussier Document indexing, similarities & retrieval 28
![Page 30: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/30.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Language model - QL (3)
Advantages and disadvantages+ Theoretical framework: simple, well-founded, easy to
implement and leading to very good results
+ Easy to extend to other settings as cross-language IR
- Training data to estimate smoothing parameters
- Conceptual deficiency for (pseudo-)relevance feedback
Complexity similar to vector space model
Eric Gaussier Document indexing, similarities & retrieval 29
![Page 31: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/31.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Evaluation of IR systems
Eric Gaussier Document indexing, similarities & retrieval 30
![Page 32: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/32.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Relevance judgements
Binary judgements: the doc is relevant (1) or not relevant (0) to thequery
Multi-valued judgements:Perfect > Excellent > Good > Correct > Bad
Preference pairs: doc dA more relevant than doc dB to the query
Several (large) collections with many (> 30) queries and associated(binary) relevance judgements: TREC collections (trec.nist.gov),CLEF (www.clef-campaign.org), FIRE (fire.irsi.res.in)
Eric Gaussier Document indexing, similarities & retrieval 31
![Page 33: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/33.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Common evaluation measures
MAP (Mean Average Precision)
MRR (Mean Reciprocal Rank)
For a given query q, let rq be the rank of the first relevantdocument retrievedMRR: mean of rq over all queries
WTA (Winner Takes All)
If the first retrieved doc is relevant, sq = 1; sq = 0 otherwiseWTA: mean of sq over all queries
NDCG (Normalized Discounted Cumulative Gain)
Eric Gaussier Document indexing, similarities & retrieval 32
![Page 34: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/34.jpg)
Part 1: Indexing, IR Part 2: IR on the web
NDCG
• NDCG at position k :
N(k) =
normalization︷︸︸︷Zk
k∑j=1︸︷︷︸
cumul
gain︷ ︸︸ ︷(2p(j) − 1) / log2(j + 1)︸ ︷︷ ︸
position discount
• Averaged over all queries
Eric Gaussier Document indexing, similarities & retrieval 33
![Page 35: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/35.jpg)
Part 1: Indexing, IR Part 2: IR on the web
G : Gain
Relevance Value (gain)
Perfect (5) 31 = 25 − 1Excellent (4) 15 = 24 − 1Good (3) 7 = 23 − 1Correct (2) 3 = 22 − 1Bad (0) 0 = 21 − 1
Eric Gaussier Document indexing, similarities & retrieval 34
![Page 36: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/36.jpg)
Part 1: Indexing, IR Part 2: IR on the web
DCG : Discounted CG
Discounting factor: ln(2)ln(j+1)
Doc. (rg) Rel.. Gain CG DCG
1 Perf. (5) 31 31 312 Corr. (2) 3 34 = 31 + 3 32, 9 = 31 + 3× 0, 633 Exc. (4) 15 49 40, 44 Exc. (4) 15 64 46, 9· · · · · · · · · · · · · · ·
Eric Gaussier Document indexing, similarities & retrieval 35
![Page 37: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/37.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Ideal ranking: max DCG
Document (rank) Relevance Gain max DCG
1 Perfect (5) 31 313 Excellent (4) 15 40, 54 Excellent (4) 15 48· · · · · · · · · · · ·
Eric Gaussier Document indexing, similarities & retrieval 36
![Page 38: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/38.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Normalized DCG
Doc. (rang) Rel. Gain DCG max DCG NDCG
1 Perfect (5) 31 31 31 12 Correct (2) 3 32, 9 40, 5 0, 813 Excellent (4) 15 40, 4 48 0.844 Excellent (4) 15 46, 9 54, 5 0.86· · · · · · · · · · · · · · ·
Eric Gaussier Document indexing, similarities & retrieval 37
![Page 39: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/39.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Remarks on evaluation measures
• Measures for a given position (e.g. list of 10 retrieveddocuments)
• NDCG is more general than MAP (multi-valued relevance vsbinary relevance)
• Non continuous (and thus non derivable)
Eric Gaussier Document indexing, similarities & retrieval 38
![Page 40: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/40.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Part 2: IR on the web - PageRank andLearning to Rank
Content
1 PageRank
2 IR and ML: Learning to Rank (L2R)
3 Which training data?
Eric Gaussier Document indexing, similarities & retrieval 39
![Page 41: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/41.jpg)
Part 1: Indexing, IR Part 2: IR on the web
What is the particularity of the web?
→ A collection with hyperlinks, the graph of the web, and anchortexts
1 Possibility to augment the standard index of a page with anchortexts
2 Possibility to use the importance of a page in the retrieval score(PageRank)
3 Possibility to augment the representation of a page with newfeatures
Eric Gaussier Document indexing, similarities & retrieval 40
![Page 42: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/42.jpg)
Part 1: Indexing, IR Part 2: IR on the web
What is the particularity of the web?
→ A collection with hyperlinks, the graph of the web, and anchortexts
1 Possibility to augment the standard index of a page with anchortexts
2 Possibility to use the importance of a page in the retrieval score(PageRank)
3 Possibility to augment the representation of a page with newfeatures
Eric Gaussier Document indexing, similarities & retrieval 40
![Page 43: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/43.jpg)
Part 1: Indexing, IR Part 2: IR on the web
PageRank
Eric Gaussier Document indexing, similarities & retrieval 41
![Page 44: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/44.jpg)
Part 1: Indexing, IR Part 2: IR on the web
What is the importance of a page?
1 Number of incoming links
2 Ratio of incoming/outgoing links
3 A page is important if it is often linked by important pages
Eric Gaussier Document indexing, similarities & retrieval 42
![Page 45: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/45.jpg)
Part 1: Indexing, IR Part 2: IR on the web
What is the importance of a page?
1 Number of incoming links
2 Ratio of incoming/outgoing links
3 A page is important if it is often linked by important pages
Eric Gaussier Document indexing, similarities & retrieval 42
![Page 46: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/46.jpg)
Part 1: Indexing, IR Part 2: IR on the web
A simple random walk
Imagine a walker that starts on a page and randomly steps to apage pointed to by the current page. In an infinite random walk,he/she will have visited pages according to their ”importance” (themore important the page is, the more likely the walker visits it)Problems
1 Dead ends, black holes
2 Cycles
Eric Gaussier Document indexing, similarities & retrieval 43
![Page 47: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/47.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Solution: teleportation
At each step, the walker can either randomly choose an outgoingpage, with prob. λ, or teleport to any page of the graph, with prob.(1− λ)
It’s as if all web pages were connected (completely connected graph)
The random walk thus defines a Markov chain with probabilitymatrix:
Pij =
{λ
Aij∑Nj=1 Aij
+ (1− λ) 1N si
∑Nj=1 Aij 6= 0
1N sinon
where Aij = 1 if there is a link from i to j and 0 otherwise
Eric Gaussier Document indexing, similarities & retrieval 44
![Page 48: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/48.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Definitions and notations
Definition 1 A sequence of random variables X0, ...,Xn is said tobe a (finite state) Markov chain for some state space S if for anyxn+1, xn, ..., x0 ∈ S :
P(Xn+1 = xn+1|X0 = x0, ...,Xn = xn) = P(Xn+1 = xn+1|Xn = xn)
X0 is called the initial state and its distribution the initialdistribution
Definition 2 A Markov chain is called homogeneous or stationaryif P(Xn+1 = y |Xn = x) is independent of n for any x , y
Definition 3 Let {Xn} be a stationary Markov chain. Theprobabilities Pij = P(Xn+1 = j |Xn = i) are called the one-steptransition probabilities. The associated matrix P is called thetransition probability matrix
Eric Gaussier Document indexing, similarities & retrieval 45
![Page 49: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/49.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Definitions and notations (cont’d)
Definition 4 Let {Xn} be a stationary Markov chain. The
probabilities P(n)ij = P(Xn+m = j |Xm = i) are called the n-step
transition probabilities. The associated matrix P(n) is called thetransition probability matrix
Remark: P is a stochastic matrix
Theorem (Chapman-Kolgomorov equation) Let {Xn} be astationary Markov chain and n,m ≥ 1. Then:
Pm+nij = P(Xm+n = j |X0 = i) =
∑k∈S
Pmik P
nkj
Eric Gaussier Document indexing, similarities & retrieval 46
![Page 50: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/50.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Regularity (ergodicity)
Definition 5 Let {Xn} be a stationary Markov chain withtransition probability matrix P. It is called regular if there exists
n0 > 0 such that p(n0)ij > 0 ∀i , j ∈ S
Theorem (fundamental theorem for finite Markov chains) Let{Xn} be a regular, stationary Markov chain on a state space S of telements. Then, there exists πj , j = 1, 2, ..., t such that:
(a) For any initial state i ,P(Xn = j |X0 = i)→ πj , j = 1, 2, ..., t
(b) The row vector π = (π1, π2, ..., πt) is the uniquesolution of the equations πP = π, π1 = 1
(c) Any row of P r converges towards π when r →∞
Remark: π is called the long-run or stationary distribution
Eric Gaussier Document indexing, similarities & retrieval 47
![Page 51: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/51.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Summary (1)
1 Stationary, regular Markov chains admit a stationary (steady-stable)distribution
2 This distribution can be obtained in different ways:
Power method: let the chain run for a sufficiently long time!π = limk→∞ Pk
Linear system: solve the linear system associated withπP = π, π1 = 1 (e.g. Gauss-Seidel)π is the left eigenvector associated with the highest eigenvalue (1)of P (eigenvector decomposition, e.g. Cholevsky)
The PageRank can be obtained by any of these methods
Eric Gaussier Document indexing, similarities & retrieval 48
![Page 52: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/52.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Summary (2)
Two main innovations at the basis of Web search engines at theend of the 90’s:
1 Rely on additional index terms contained in anchor texts
2 Integrate the importance of a web page (PageRank) into the scoreof a page
Eric Gaussier Document indexing, similarities & retrieval 49
![Page 53: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/53.jpg)
Part 1: Indexing, IR Part 2: IR on the web
IR and ML: Learning to Rank
Eric Gaussier Document indexing, similarities & retrieval 50
![Page 54: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/54.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Introduction to ML and SVMs (1)
One looks for a decision function that takes the form:
f (x) = sgn(< w , x > +b) = sgn(wT x + b) = sgn(b +
p∑j=1
wjxj)
The equation < w , x > +b = 0 defines an hyperplane with margin2/||w ||)
P1: KRU/IRP
irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 16:11
294 Support vector machines and machine learning on documents
support vectorsmaximummargindecisionhyperplane
margin ismaximized
Figure 15.1 The support vectors are the five points right up against the margin of the classifier.
15.1 Support vector machines: The linearly separable case
For two-class, separable training data sets, such as the one in Figure 14.8(page 278), there are lots of possible linear separators. Intuitively, a decisionboundary drawn in the middle of the void between data items of the twoclasses seems better than one which approaches very close to examples ofone or both classes. Although some learning methods such as the perceptronalgorithm (see references in Section 14.7, page 291) find just any linear sepa-rator, others, like Naive Bayes, search for the best linear separator accordingto some criterion. The SVM in particular defines the criterion to be lookingfor a decision surface that is maximally far away from any data point. Thisdistance from the decision surface to the closest data point determines themargin of the classifier. This method of construction necessarily means thatmargin
the decision function for an SVM is fully specified by a (usually small) sub-set of the data that defines the position of the separator. These points arereferred to as the support vectors (in a vector space, a point can be thought ofsupport
vector as a vector between the origin and that point). Figure 15.1 shows the marginand support vectors for a sample problem. Other data points play no part indetermining the decision surface that is chosen.
Maximizing the margin seems good because points near the decision sur-face represent very uncertain classification decisions; there is almost a 50%chance of the classifier deciding either way. A classifier with a large marginmakes no low-certainty classification decisions. This gives you a classifica-tion safety margin: A slight error in measurement or a slight document vari-ation will not cause a misclassification. Another intuition motivating SVMs
Eric Gaussier Document indexing, similarities & retrieval 51
![Page 55: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/55.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Introduction to ML and SVMs (2)
Finding the separating hyperplane with maximal margin amountsto solve the following problem, from a training set{(x (1), y (1)), · · · (x (n), y (n))}:{
Minimize 12w
Tw
subject to y (i)(< w , x (i) > +b) ≥ 1, i = 1, · · · , n
Non separable case:
{Minimize 1
2wTw + C
∑i ξi
subject to ξi ≥ 0, y (i)(< w , x (i) > +b) ≥ 1− ξi , i = 1, · · · , n
Eric Gaussier Document indexing, similarities & retrieval 52
![Page 56: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/56.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Introduction to ML and SVMs (2)
The decision functions can take two equivalent forms. The”primal” form:
f (x) = sgn(< w , x > +b) = sgn(< w∗, xaug >)
and the ”dual” form:
f (x) = sgn(n∑
i=1
αiy(i) < x (i), x > + b)
Eric Gaussier Document indexing, similarities & retrieval 53
![Page 57: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/57.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Modeling IR as a binary classification problem
What is an example? A doc? A query?→ A (query,doc) pair: x = (q, d) ∈ Rp
General coordinates (features) fi (q, d), i = 1, · · · , p, as:
f1(q, d) =∑
t∈q⋂
d log(td), f2(q, d) =∑
t∈q log(1 + td
|C| )
f3(q, d) =∑
t∈q⋂
d log(idf(t)), f4(q, d) =∑
t∈q⋂
d log( |C|tC )
f5(q, d) =∑
t∈q log(1 + td
|C| idf(t)), f6(q, d) =∑
t∈q log(1 + td
|C||C|tC )
f7(q, d) = RSVvect(q, d)
f8(q, d) = PageRank(d)
f8(q, d) = RSVLM(q, d)
...
Eric Gaussier Document indexing, similarities & retrieval 54
![Page 58: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/58.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Application
Each pair x(= (q, d)) containing a relevant (resp. non relevant)doc for the query in the pair is associated to the positive class +1(resp. to the negative class −1)Remarks
1 One uses the value of the decision function (not its sign) to obtainan order on documents
2 Method that assigns a score for a (query,doc) pair independently ofother documents → pointwise method
3 Main advantage over previous models: possibility to easily integratenew (useful) features
4 Main disadvantage: need for many more annotations
5 Another drawback: objective function different from evaluationfunction (true objective)
Eric Gaussier Document indexing, similarities & retrieval 55
![Page 59: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/59.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Preference pairs and ranking
1 Relevance is not an absolute notion and it is easier to comparerelative relevance of say two documents
2 One is looking for a function f that preserves partial order bet. docs(for a given query): x(i) ≺ x(j) ⇐⇒ f (x(i)) < f (x(j)), with x(i) beingagain a (query,doc) pair: xi = (di , q)
Can we apply the same approach as before? Idea: transform aranking information into a classification information by forming thedifference between pairsFrom two documents (di , dj), form:
x (i ,j) = (xi − xj , z =
{+1 if xi ≺ xj−1 if xj ≺ xi
)
then apply previous method!
Eric Gaussier Document indexing, similarities & retrieval 56
![Page 60: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/60.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Remarks on ranking SVM
How to use w∗ in practice? (
Property: d �q d ′ iff sgn(w∗,−−−→(d , q)−
−−−→(d ′, q)) positive
However, a strict application is too costly and one uses the SVMscore:
RSV (q, d) = (w∗.−−−→(q, d))
But
No difference between errors made at the top or at the middle ofthe list
Queries with more relevant documents have a stronger impact onw∗
Eric Gaussier Document indexing, similarities & retrieval 57
![Page 61: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/61.jpg)
Part 1: Indexing, IR Part 2: IR on the web
RSVM-IR (1)
Idea: modify the optimization problem so as to take into accountthe doc ranks (τk()) and the query type (µq())
{Minimize 1
2wTw + C
∑l τk(l)µq(l)ξl
subject to ξl ≥ 0, y (l)(w∗.x (l)) ≥ 1− ξl , l = 1, · · · , p
where q(l) is the query in the l th example and k(l) is the rank typeof the docs in the l th example
Eric Gaussier Document indexing, similarities & retrieval 58
![Page 62: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/62.jpg)
Part 1: Indexing, IR Part 2: IR on the web
RSVM-IR (2)
Once w∗ has been learnt (standard optimization), it is used as instandard RSVM
The results obtained are state-of-the-art, especially on web-likecollections
Pairwise approach, that dispenses with a limited view of relevance(absolute relevance)
Eric Gaussier Document indexing, similarities & retrieval 59
![Page 63: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/63.jpg)
Part 1: Indexing, IR Part 2: IR on the web
General remarks
1 Listwise approach: directly treat lists as examples; however no cleargain wrt pairwise approaches
2 Difficulty to rely on an optimal objective function
3 Methods that require a lot of annotations
Eric Gaussier Document indexing, similarities & retrieval 60
![Page 64: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/64.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Which training data?
Eric Gaussier Document indexing, similarities & retrieval 61
![Page 65: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/65.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Building training data
• Several annotated collections exist
TREC (TREC-vido)
CLEF
NTCIR
• For new collections, as intranets of companies, such collectionsdo not exist and it may be difficult to build them → standardmodels, with little training
• What about the web?
Eric Gaussier Document indexing, similarities & retrieval 62
![Page 66: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/66.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Training data on the web
• An important source of information; click data from users
Use clicks to infer preferences between docs (preference pairs)
In addition, and if possible, use eye-tracking data
• What can be deduced from clicks?
Eric Gaussier Document indexing, similarities & retrieval 63
![Page 67: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/67.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Exploiting clicks (1)
Clicks can not be used to infer absolute relevance judgements; theycan nevertheless be used to infer relative relevance judgements.Let (d1, d2, d3, · · · ) be an ordered list of documents retrieved for aparticular query and let C denote the set of clicked documents.The following strategies can be used to build relative relevancejudgements:
1 If di ∈ C and dj /∈ C , di �pert−q dj
2 If di is the last clicked doc, ∀j < i , dj /∈ C , di �pert−q dj
3 ∀i ≥ 2, di ∈ C , di−1 /∈ C , di �pert−q di−1
4 ∀i , di ∈ C , di+1 /∈ C , di �pert−q di+1
Eric Gaussier Document indexing, similarities & retrieval 64
![Page 68: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/68.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Exploiting clicks (2)
The above strategies yield a partial order between docs
Leading to a very large training set on which one can deploylearning to rank methods
IR on the web has been characterized by a ”data rush”:
Index as many pages as possibleGet as many click data as possible
Eric Gaussier Document indexing, similarities & retrieval 65
![Page 69: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/69.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Letor
http://research.microsoft.com/en-us/um/beijing/projects/letor/
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: ABenchmark Collection for Research on Learning to Rank forInformation Retrieval, Information Retrieval Journal, 2010
Eric Gaussier Document indexing, similarities & retrieval 66
![Page 70: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/70.jpg)
Part 1: Indexing, IR Part 2: IR on the web
Conclusion on L2R
Approaches aiming at exploiting all the available information (60features for the gov collection for example - including scores ofstandard IR models)
Approaches aiming at ”ranking” documents (pairwise, listwise)
Many proposals (neural nets, boosting and ensemble methods, ...);no clear difference on all collections
State-of-the-art methods when many features available
Eric Gaussier Document indexing, similarities & retrieval 67
![Page 71: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/71.jpg)
Part 1: Indexing, IR Part 2: IR on the web
References (1)
• Burges et al. Learning to Rank with Nonsmooth Cost Functions, NIPS2006
• Cao et al. Adapting Ranking SVM to Document Retrieval, SIGIR 2006
• Cao et al. Learning to Rank: From Pairwise to Listwise Approach,ICML 2007
• Goswami et al. Query-based learning of IR model parameters onunlabelled collections, ICTIR 2015
• Joachims et al. Accurately Interpreting Clickthrough Data as ImplicitFeedback, SIGIR 2005
• Liu Learning to Rank for Information Retrieval, tutoriel, 2008.
• Manning et al. Introduction to Information Retrieval. CambridgeUniversity Press 2008www-csli.stanford.edu/∼hinrich/information-retrieval-book.html
Eric Gaussier Document indexing, similarities & retrieval 68
![Page 72: Document indexing, similarities and retrieval in large ...ama.liglab.fr/~gaussier/Courses/AdvancedAlgosML-DM/Indexing-Si… · 2 Reading le F by blocks that can t into memory; inversion](https://reader034.vdocument.in/reader034/viewer/2022051910/5ffff18c62919130866b7979/html5/thumbnails/72.jpg)
Part 1: Indexing, IR Part 2: IR on the web
References (2)
• Nallapati Discriminative model for Information Retrieval, SIGIR 2004
• Yue et al. A Support Vector Method for Optimizing Average Precision,SIGIR 2007
• Workshop LR4IR, 2007 (Learning to Rank for Information Retrieval).
Eric Gaussier Document indexing, similarities & retrieval 69