diversifying search result wsdm 2009 intelligent database systems lab. school of computer science...

Post on 17-Jan-2016

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Diversifying Search Result

WSDM 2009

Intelligent Database Systems Lab.

School of Computer Science & Engineering

Seoul National University

Center for E-Business TechnologySeoul National UniversitySeoul, Korea

Presented by Sung Eun, Park1/25/2011

Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, Samuel IeongMicrosoft Research

Copyright 2010 by CEBT

Contents

Introduction

Intuition

Preliminaries

Model

Problem Formulation

Complexity

Greedy algorithm

Evaluation

Measure

Empirical analysis

2

Copyright 2010 by CEBT

Introduction

Ambiguity and diversification

For the ambiguous queries, diversification may help users to find at least one relevant document

Ex) the other day, we were trying to find the meaning of the word “ 왕건” .

– In the context of “ 우와 저거 진짜 왕건이다”

– But search result was all about the king of Goguryu

3

King 왕건

왕건 as a Big thing

Copyright 2010 by CEBT

Preliminaries

4

Copyright 2010 by CEBT

Problem Formulation

d fails to satisfy user that issues query q with the intended category c

Multiple intents

The probability that some document will satisfy category c

Copyright 2010 by CEBT

Complexity

Copyright 2010 by CEBT

A Greedy Algorithm

R(q) be the top k documents selected by some classical ranking algorithm for the target query The algorithm reorder the R(q) to maximize the objective

P(S|q) Input: k, q, C, D, P(c | q), V (d | q, c), Output : set of

documents S

0.4

0.9

0.5

0.4

0.4

D V(d | q, c)

0.08

0.72

0.40

0.32

0.08

g(d | q, c)U(R | q) = U(B | q) =0.8 0.2

×0.8×0.8×0.8×0.2×0.2

×0.08×0.08×0.2×0.2

0.08

0.08

0.04

0.03

0.08

0.12

×0.08×0.08

×0.12 0.050.4

0.9

0.4

0.07S

• Produces an ordered set of results

• Results not proportional to intent distribution

• Results not according to (raw) quality

Copyright 2010 by CEBT

Greedy Algorithm (IA-SELECT)

Input: k, q, C, D, P(c | q), V (d | q, c)

Output : set of documents S

When documents may belong to multiple categories, IA-SELECT is no longer guaranteed to be optimal.(Notice this problem is NP-hard)

S = ∅∀c ∈ C, U(c | q) ← P(c | q)while |S| < k do for d ∈ D do g(d | q, c) ← c U(c | q)V (d | q, c) end for d∗ ← argmax g(d | q, c) S ← S {∪ d∗} ∀c ∈ C, U(c | q) ← (1 − V (d ∗ | q, c))U(c | q) D ← D \ {d∗}end while

Marginal Utility

U(c | q): conditional prob of intent c given query qg(d | q, c): current prob of d satisfying q, c

Copyright 2010 by CEBT

Classical IR Measures(1)

1. Doc 1, rel=32. Doc 2, rel=33. Doc 3, rel=24. Doc 4, rel=05. Doc 5, rel=16. Doc 6, rel=2

1. Doc 1, rel=32. Doc 2, rel=33. Doc 3, rel=24. Doc 4, rel=05. Doc 5, rel=16. Doc 6, rel=2

Result Doc Set

Copyright 2010 by CEBT

Classical IR Measures(2)

RR,MRR

Navigational Search/ Question Answering

– A need for a few high-ranked result

Reciprocal Ranking

– How far is an answer document from rank 1?

Example) ½=0.5

Mean Reciprocal Ranking

– Mean of RR of the query test set

1. Doc N2. Doc P3. Doc N4. Doc N5. Doc N

1. Doc N2. Doc P3. Doc N4. Doc N5. Doc N

Result Doc Set

Copyright 2010 by CEBT

Classical IR Measures(3)

MAP

Average Precision

– ( 1.00 + 1.00 + 0.75 +

0.67 + 0.38 ) / 6 = 0.633

Mean Average Precision

– Average of the average precision value for a set of queries

– MAP = ( AP1 + AP2 + ... + APn ) / (# of Queries)

Copyright 2010 by CEBT

Evaluation Measure

Copyright 2010 by CEBT

Empirical Evaluation

10,000 queries randomlysampled from logs Queries classified acc.

to ODP (level 2)

Keep only queries withat least two intents (~900)

Top 50 results from Live, Google, and Yahoo!

Documents are rated on a 5-pt scale >90% docs have ratings

Docs without ratings are assigned random grade according to the distribution of rated documents

QueryQuery

intentscategoryintents

category docdoc

ODP

Proprietary repository of human judgment

A queryclassifier

A queryclassifier

Copyright 2010 by CEBT

Results

NDCG-IA

MAP-IA and MRR-IA

Copyright 2010 by CEBT

Evaluation using Mechanical Turk

Sample 200 queries from the dataset used in Experiment 1

query

category1

category2

category3

+

a category they most closely associate with the given query

1. Doc 1, rel=?2. Doc 2, rel=?3. Doc 3, rel=?4. Doc 4, rel=?5. Doc 5, rel=?

Result Doc Set

Judge the corresponding results with respect to the chosen category using the same 4-point scale

Copyright 2010 by CEBT

Copyright 2010 by CEBT

Evaluation using Mechanical Turk

Copyright 2010 by CEBT

Conclusion

How best to diversify results in the presence of ambiguous queries

Provided a greed algorithm for the objective with good approximation guarantees

Q&A

Thank you

19

top related