probabilistic models of novel document rankings for faceted topic retrieval
DESCRIPTION
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval. Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science University of Delaware Newark, DE ( CIKM ’09 ). Date: 2010/05/03 Speaker: Lin, Yi-Jhen Advisor: Dr. Koh, Jia-Ling. Agenda. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/1.jpg)
Probabilistic Models of Novel Probabilistic Models of Novel Document Rankings for Document Rankings for Faceted Topic RetrievalFaceted Topic RetrievalBen Cartrette and Praveen Chandar
Dept. of Computer and Information ScienceUniversity of Delaware
Newark, DE ( CIKM ’09 )
Date: 2010/05/03Speaker: Lin, Yi-JhenAdvisor: Dr. Koh, Jia-Ling
![Page 2: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/2.jpg)
AgendaAgendaIntroduction
- Motivation, GoalFaceted Topic Retrieval
- Task, EvaluationFaceted Topic Retrieval Models
- 4 kinds of modelsExperiment & ResultsConclusion
![Page 3: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/3.jpg)
Introduction - Motivation Introduction - Motivation Modeling documents as independently
relevant does not necessarily provide the optimal user experience.
![Page 4: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/4.jpg)
Traditional evaluation measure
would reward System1 since it has higher
recall
Introduction - MotivationIntroduction - Motivation
Actually, we prefer System2 (since it has more
information)
System2 is better !
![Page 5: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/5.jpg)
Introduction Introduction Novelty and diversity become the
new definition of relevance and evaluation measures .
They can be achieved through retrieving documents that are relevant to query, but cover different facets of the topic.
we call faceted topic retrieval !
![Page 6: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/6.jpg)
Introduction - Goal Introduction - Goal The faceted topic retrieval
system must be able to find a small set of documents that covers all of the facets
3 documents that cover 10 facets is preferable to 5 documents that cover 10 facets
![Page 7: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/7.jpg)
Faceted Topic Retrieval - Faceted Topic Retrieval - TaskTaskDefine the task in terms ofInformation need :
A faceted topic retrieval information need is one that has a set of answers – facets – that are clearly delineated
How that need is best satisfied :Each answer is fully contained within at least one document
![Page 8: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/8.jpg)
Faceted Topic Retrieval - Faceted Topic Retrieval - TaskTask
Information need
invest in next generation technologies
increase use of renewable energy sourcesInvest in renewable energy sources
double ethanol in gas supply
shift to biodiesel
shift to coal
Facets (a set of answers)
![Page 9: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/9.jpg)
Faceted Topic Retrieval Faceted Topic Retrieval A Query :A sort list of keywords
A ranked list of documents that contain as many
unique facets as possible.
D1
Dn
D2
![Page 10: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/10.jpg)
Faceted Topic Retrieval -Faceted Topic Retrieval -EvaluationEvaluationS-recallS-precisionRedundancy
![Page 11: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/11.jpg)
Evaluation – Evaluation – an example for S-recall and S-precisionan example for S-recall and S-precisionTotal : 10 facets (assume all facets
in documents are non-overlapped)
![Page 12: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/12.jpg)
Evaluation – Evaluation – an example for Redundancyan example for Redundancy
![Page 13: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/13.jpg)
Faceted topic retrieval Faceted topic retrieval modelsmodels4 kinds of models
- MMR (Maximal Marginal Relevance)- Probabilistic Interpretation of MMR- Greedy Result Set Pruning- A Probabilistic Set-Based Approach
![Page 14: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/14.jpg)
1. MMR1. MMR
2. Probabilistic 2. Probabilistic Interpretation of MMRInterpretation of MMR
Let c1=0, c3=c4
![Page 15: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/15.jpg)
3. Greedy Result Set 3. Greedy Result Set PruningPruningFirst, rank without considering
novelty (in order of relevance)Second, step down the list of
documents, prune documents with similarity greater than some threshold ϴ
I.e., at rank i, remove any document Dj, j > i, with sim(Dj,Di) > ϴ
![Page 16: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/16.jpg)
4. A Probabilistic Set-Based 4. A Probabilistic Set-Based ApproachApproach P(F ϵ D) :Probability of D contains Fthe probability that a facet Fj occurs
in at least one document in a set D is
the probability that all of the facets in a set F are captured by the documents D is
![Page 17: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/17.jpg)
4. A Probabilistic Set-Based 4. A Probabilistic Set-Based ApproachApproach4.1 Hypothesizing Facets4.2 Estimating Document-Facet
Probabilities4.3 Maximizing Likelihood
![Page 18: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/18.jpg)
4.1 Hypothesizing Facets4.1 Hypothesizing FacetsTwo unsupervised probabilistic methods
:Relevance modelingTopic modeling with LDA
Instead of extract facets directly from any particular word or phrase, we build a “ facet model ”P(w|F)
![Page 19: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/19.jpg)
4.1 Hypothesizing Facets4.1 Hypothesizing FacetsSince we do not know the facet
terms or the set of documents relevant to the facet, we will estimate them from the retrieved documents
Obtain m models from the top m retrieved documents by taking each document along with its k nearest neighbors as the basis for a facet model
![Page 20: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/20.jpg)
Relevance modelingRelevance modelingEstimate m ”facet models“ P(w|Fj) from a set of retrieved documents using the so-called RM2 approach:
DFj : the set of documents relevant to facet Fj fk : facet terms
![Page 21: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/21.jpg)
Topic modeling with LDATopic modeling with LDAProbabilistic P(w|Fj) and P(Fj)
can found through expectation maximization
![Page 22: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/22.jpg)
4.2 Estimating Document-4.2 Estimating Document-Facet ProbabilitiesFacet ProbabilitiesBoth the facet relevance model and
LDA model produce generation probabilistic P(Di|Fj)
P(Di|Fj) : the probability that sampling terms from the facet model Fj will produce document Di
![Page 23: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/23.jpg)
4.3 Maximizing Likelihood4.3 Maximizing LikelihoodDefine the likelihood function
Constrain : K : hypothesized minimum number
required to cover the facetsMaximizing L(y) is a NP-Hard problemApproximate solution :For each facet Fj, take the document
Di with maximum
![Page 24: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/24.jpg)
Experiment - DataExperiment - DataA Query :A sort list of keywords
Top 130 retrieved documents
D1
D130
D2Query Likelihood L.M.
![Page 25: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/25.jpg)
Experiment - DataExperiment - Data
Top 130 retrieved
documents
D1
D130
D2
2 assessors to judge
44.7 relevant documents per query
Each document contains 4.3 facets
39.2 unique facets on average( for average one unique facet per relevant document )
Agreement :72% of all relevant documents were judged relevant by both assessors
For 60 queries :
![Page 26: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/26.jpg)
Experiment - DataExperiment - DataTDT5 sample topic definition
Judgments
Query
![Page 27: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/27.jpg)
Experiment – Retrieval Experiment – Retrieval EnginesEnginesUsing Lemur toolkitLM baseline: a query-likelihood language
modelRM baseline: a pseudo-feedback with
relevance modelMMR: query similarity scores from LM
baseline and cosine similarity for noveltyAvgMix (Prob MMR) : the probabilistic MMR
model using query-likelihood scores from LM baseline and the AvgMix novelty score.
Pruning: removing documents from the LM baseline on cosine similarity
FM: the set-based facet model
![Page 28: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/28.jpg)
Experiment – Retrieval Experiment – Retrieval EnginesEnginesFM: the set-based facet model FM-RM:
each of the top m documents and their K nearest neighbors becomes a “facet model ”P(w|Fj), then compute the probability P(Di|Fj)
FM-LDA: use LDA to discover subtopics zj, and get P(zj|D) , we extract 50 subtopics
![Page 29: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/29.jpg)
Experiments - EvaluationExperiments - EvaluationUse five-fold cross-validation to
train and test systems48 queries in four folds to train
model parameters Parameters are used to obtain
ranked results on the remaining 12 queries
At the minimum optimal rank S-rec, we report S-recall, redundancy, MAP
![Page 30: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/30.jpg)
ResultsResults
![Page 31: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/31.jpg)
ResultsResults
![Page 32: Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval](https://reader034.vdocument.in/reader034/viewer/2022051402/56815a91550346895dc8062b/html5/thumbnails/32.jpg)
ConclusionConclusionWe defined a type of novelty retrieval
task called faceted topic retrieval retrieve the facets of information need in a small set of documents.
We presented two novel models: One that prunes a retrieval ranking and one a formally-motivated probabilistic models.
Both models are competitive with MMR, and outperform another probabilistic model.