probabilistic models of novel document rankings for faceted topic retrieval

Probabilistic Models of Novel Probabilistic Models of Novel Document Rankings for Document Rankings for Faceted Topic RetrievalFaceted Topic RetrievalBen Cartrette and Praveen Chandar

Dept. of Computer and Information ScienceUniversity of Delaware

Newark, DE ( CIKM ’09 )

Date: 2010/05/03Speaker: Lin, Yi-JhenAdvisor: Dr. Koh, Jia-Ling

AgendaAgendaIntroduction

- Motivation, GoalFaceted Topic Retrieval

- Task, EvaluationFaceted Topic Retrieval Models

- 4 kinds of modelsExperiment & ResultsConclusion

Introduction - Motivation Introduction - Motivation Modeling documents as independently

relevant does not necessarily provide the optimal user experience.

Traditional evaluation measure

would reward System1 since it has higher

recall

Introduction - MotivationIntroduction - Motivation

Actually, we prefer System2 (since it has more

information)

System2 is better !

Introduction Introduction Novelty and diversity become the

new definition of relevance and evaluation measures .

They can be achieved through retrieving documents that are relevant to query, but cover different facets of the topic.

we call faceted topic retrieval !

Introduction - Goal Introduction - Goal The faceted topic retrieval

system must be able to find a small set of documents that covers all of the facets

3 documents that cover 10 facets is preferable to 5 documents that cover 10 facets

Faceted Topic Retrieval - Faceted Topic Retrieval - TaskTaskDefine the task in terms ofInformation need :

A faceted topic retrieval information need is one that has a set of answers – facets – that are clearly delineated

How that need is best satisfied :Each answer is fully contained within at least one document

Faceted Topic Retrieval - Faceted Topic Retrieval - TaskTask

Information need

invest in next generation technologies

increase use of renewable energy sourcesInvest in renewable energy sources

double ethanol in gas supply

shift to biodiesel

shift to coal

Facets (a set of answers)

Faceted Topic Retrieval Faceted Topic Retrieval A Query :A sort list of keywords

A ranked list of documents that contain as many

unique facets as possible.

D1

Dn

D2

Faceted Topic Retrieval -Faceted Topic Retrieval -EvaluationEvaluationS-recallS-precisionRedundancy

Evaluation – Evaluation – an example for S-recall and S-precisionan example for S-recall and S-precisionTotal : 10 facets (assume all facets

in documents are non-overlapped)

Evaluation – Evaluation – an example for Redundancyan example for Redundancy

Faceted topic retrieval Faceted topic retrieval modelsmodels4 kinds of models

- MMR (Maximal Marginal Relevance)- Probabilistic Interpretation of MMR- Greedy Result Set Pruning- A Probabilistic Set-Based Approach

1. MMR1. MMR

2. Probabilistic 2. Probabilistic Interpretation of MMRInterpretation of MMR

Let c1=0, c3=c4

3. Greedy Result Set 3. Greedy Result Set PruningPruningFirst, rank without considering

novelty (in order of relevance)Second, step down the list of

documents, prune documents with similarity greater than some threshold ϴ

I.e., at rank i, remove any document Dj, j > i, with sim(Dj,Di) > ϴ

4. A Probabilistic Set-Based 4. A Probabilistic Set-Based ApproachApproach P(F ϵ D) :Probability of D contains Fthe probability that a facet Fj occurs

in at least one document in a set D is

the probability that all of the facets in a set F are captured by the documents D is

4. A Probabilistic Set-Based 4. A Probabilistic Set-Based ApproachApproach4.1 Hypothesizing Facets4.2 Estimating Document-Facet

Probabilities4.3 Maximizing Likelihood

4.1 Hypothesizing Facets4.1 Hypothesizing FacetsTwo unsupervised probabilistic methods

:Relevance modelingTopic modeling with LDA

Instead of extract facets directly from any particular word or phrase, we build a “ facet model ”P(w|F)

4.1 Hypothesizing Facets4.1 Hypothesizing FacetsSince we do not know the facet

terms or the set of documents relevant to the facet, we will estimate them from the retrieved documents

Obtain m models from the top m retrieved documents by taking each document along with its k nearest neighbors as the basis for a facet model

Relevance modelingRelevance modelingEstimate m ”facet models“ P(w|Fj) from a set of retrieved documents using the so-called RM2 approach:

DFj : the set of documents relevant to facet Fj fk : facet terms

Topic modeling with LDATopic modeling with LDAProbabilistic P(w|Fj) and P(Fj)

can found through expectation maximization

4.2 Estimating Document-4.2 Estimating Document-Facet ProbabilitiesFacet ProbabilitiesBoth the facet relevance model and

LDA model produce generation probabilistic P(Di|Fj)

P(Di|Fj) : the probability that sampling terms from the facet model Fj will produce document Di

4.3 Maximizing Likelihood4.3 Maximizing LikelihoodDefine the likelihood function

Constrain : K : hypothesized minimum number

required to cover the facetsMaximizing L(y) is a NP-Hard problemApproximate solution :For each facet Fj, take the document

Di with maximum

Experiment - DataExperiment - DataA Query :A sort list of keywords

Top 130 retrieved documents

D1

D130

D2Query Likelihood L.M.

Experiment - DataExperiment - Data

Top 130 retrieved

documents

D1

D130

D2

2 assessors to judge

44.7 relevant documents per query

Each document contains 4.3 facets

39.2 unique facets on average( for average one unique facet per relevant document )

Agreement :72% of all relevant documents were judged relevant by both assessors

For 60 queries :

Experiment - DataExperiment - DataTDT5 sample topic definition

Judgments

Query

Experiment – Retrieval Experiment – Retrieval EnginesEnginesUsing Lemur toolkitLM baseline: a query-likelihood language

modelRM baseline: a pseudo-feedback with

relevance modelMMR: query similarity scores from LM

baseline and cosine similarity for noveltyAvgMix (Prob MMR) : the probabilistic MMR

model using query-likelihood scores from LM baseline and the AvgMix novelty score.

Pruning: removing documents from the LM baseline on cosine similarity

FM: the set-based facet model

Experiment – Retrieval Experiment – Retrieval EnginesEnginesFM: the set-based facet model FM-RM:

each of the top m documents and their K nearest neighbors becomes a “facet model ”P(w|Fj), then compute the probability P(Di|Fj)

FM-LDA: use LDA to discover subtopics zj, and get P(zj|D) , we extract 50 subtopics

Experiments - EvaluationExperiments - EvaluationUse five-fold cross-validation to

train and test systems48 queries in four folds to train

model parameters Parameters are used to obtain

ranked results on the remaining 12 queries

At the minimum optimal rank S-rec, we report S-recall, redundancy, MAP

ResultsResults

ConclusionConclusionWe defined a type of novelty retrieval

task called faceted topic retrieval retrieve the facets of information need in a small set of documents.

We presented two novel models: One that prunes a retrieval ranking and one a formally-motivated probabilistic models.

Both models are competitive with MMR, and outperform another probabilistic model.

probabilistic models of novel document rankings for faceted topic retrieval

Documents

set of documents relevant

set of retrieved documents

small set of documents

faceted topic retrieval

list of documents

prune documents

retrieving documents

set of answers facets