sigmod 2006 context-sensitive ranking rakesh agrawalmicrosoft search labs ralf rantzauibm silicon...

36
SIGMOD 2006 Context-sensitive ranking Rakesh Agrawal Microsoft Search Labs Ralf Rantzau IBM Silicon Valley Lab Evimaria Terzi University of Helsinki & Microsoft Search Labs Work done largely while the authors were in IBM Almaden

Upload: lizbeth-holland

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

SIGMOD 2006

Context-sensitive ranking

Rakesh Agrawal Microsoft Search Labs

Ralf Rantzau IBM Silicon Valley Lab

Evimaria Terzi University of Helsinki &Microsoft Search Labs

Work done largely while the authors were in IBM Almaden

SIGMOD 2006

The curse of abundance:Too many data and too many answers

• Query shopping.com for a digital camera:

• Query Froogle for a tennis racquet:

SIGMOD 2006

Ranking query results

• Algorithms for ranking web pages have been quite successful ([BP’98,Kleinberg98])

– Key idea: Exploit the graph of hyperlinks between web pages

• Can we take similar approach for ranking database query results?

– Need for a graph structure that accurately describes the relationships between tuples in the database

- Past attempts: schema and key constraints or queries [BHP’04, BHNCS’02, GMT’04]

But are these graphs natural or do they reflect design optimization decisions?

SIGMOD 2006

Using preferences to induce a graph of tuples

Genre (G)

Actor (A) Title (T) Language

t1 Drama Kidman Birth English

t2 Drama Cruz Vanilla Sky English

t3 Sci-Fi Reeves Matrix English

t4 Comedy Cruz Sin noticias de Dios

Spanish

t5 Comedy Aniston Rumor has it… English

• Drama > Sci-Fi• Kidman > Reeves • Matrix > Birth

t1>t3 and t2>t3

t1>t3

t3>t1

t1 t2

t3

[ Preferences are predicates of the form “X=x1 > X=x2” ]

SIGMOD 2006

Augment preferences with context

Genre (G) Actor (A) Title (T) Language (L)

t1 Drama Kidman Birth English

t2 Drama Cruz Vanilla Sky English

t3 Sci-Fi Reeves Matrix English

t4 Comedy Cruz Sin noticias de Dios Spanish

t5 Comedy Aniston Rumor has it English

– in general (*)• English > Spanish | *

– but in the context of Comedies• Spanish > English| Comedies

[ Contexts are predicates of the form “Y=a” ]

SIGMOD 2006

Preferences in the past

• Preferences expressed via a numeric score [AW’00,KI’04,KI’05]– Nicole Kidman : 0.9– Penelope Cruz : 0.4– Dramas : 0.8– Comedies : 0.3

• Pairwise preferences in ML literature [CSS’97]• Preferences as partial orders [Kieβling’02]• Preferences as first-order formulas [Chomiki’03]

SIGMOD 2006

Contextual preferences

Genre (G)

Actor (A)

Title (T) Language (L)

t1 Drama Kidman Birth English

t2 Drama Cruz Vanilla Sky English

t3 Sci-Fi Reeves Matrix English

t4 Comedy Cruz Sin noticias de Dios

Spanish

t5 Comedy Aniston Rumor has it… English

• P1={G=Drama > G=Sci-Fi | L=English}

• P2={A=Kidman > A=Reeves | L = English}

• P3={T=matrix > T=Birth | L=English }

t1>t3|En and t2>t3|En

t1>t3|En

t3>t1|En

Genre (G)

Actor (A)

Title (T) Language (L)

t1 Drama Kidman Birth English

t2 Drama Cruz Vanilla Sky English

t3 Sci-Fi Reeves Matrix English

t4 Comedy Cruz Sin noticias de Dios

Spanish

t5 Comedy Aniston Rumor has it… English

t1 t2

t3

2/3

1/3

1

1/2

1/2t1 t2

t3

SIGMOD 2006

Obtaining preferences

• Users provide preferences voluntarily – in the same way users rate products and services

• Preferences can be automatically collected via browser plug-ins or taskbars (with user permission)

• Preferences can be learned from past data

• Preferences can also be learned from the data (e.g., using association-rule mining)

Preferences are obtained from various sources and can contain cycles and contradictions, which are resolved

democratically

SIGMOD 2006

Overview

Question:How to incorporate users preferences when ranking query results?

Approach:• Accumulate contextual preferences of the form i1>i2|X

• Order the answer tuples such that the preferences are maximally respected, giving higher weight to those preferences whose contexts have closer match to the query

SIGMOD 2006

Issues

• How to define similarity between a query and a context ? – See paper for the distance function.

• Can we create orders in an offline step and use their information at query time ?

• Should we save all orders?

• How to combine the saved orders while answering queries ?

SIGMOD 2006

Problem decomposition

[Problem 1]: For every context X build an order τX (Ordering)

[Problem 2]: Given a set of orders Tm = {τ1,…, τm} find ℓ representative orders Tℓ (ClusterOrders)

• Assign each of the input orders to one of the representatives (the closest)

• Associate with each representative σ a set of contexts Yσ

[Problem 3]: Provide top-k results for the query Q– respecting the representative orders and– weight respect according to the similarity between

query and contexts (Querying)

SIGMOD 2006

Problem 1: The Ordering problem

For a given context X and a set of preferences PX over the tuples D={t1,…,tn} find an ordering τ of D such that

)Agree(maxarg ' Xτ',P

t1 t2

t3

1/2

1/2

2/3

1/3

1

t1 t2

t3

t2

t1

t3

Agree = 1 +1/2 = 2/3 = 13/6

SIGMOD 2006

Problem 2: The ClusterOrders problem

Given m orders Tm={τ1,…,τm} , each corresponding to a single concept Xi, find ℓ representative orders Tℓ such that cost(Tℓ) is minimized where

and

We use the standard Spearman footrule and Kendall tau distances for comparing orderings

mT

TdT

),()(Cost

),(min),( dTdmT

SIGMOD 2006

The ClusterOrders problem: Example

a

b

c

d

e

f

a

b

c

d

e

f

a

b

c

d

e

f

f

e

d

c

b

a

f

e

d

c

b

a

a

b

c

d

e

f

f

e

d

c

b

a

Cost(τ1) = 2

0 1 1 0 1

Cost(τ2) = 1

Cost(τ1, τ2) = 2+1=3

SIGMOD 2006

Problem 3: The Querying problem

Provide top-k results for query Q respecting the representative orders and weighting respect using the corresponding set of contexts

SIGMOD 2006

Problem decomposition

[Problem 1]: For every context X build an order τX (Ordering)

[Problem 2]: Given a set of orders Tm = {τ1,…, τm} find ℓ representative orders Tℓ (ClusterOrders)

• Assign each of the input orders to one of the representatives (the closest)

• Associate with each representative σ a set of contexts Yσ

[Problem 3]: Provide top-k results for the query Q– respecting the representative orders and– weight respect according to the similarity between

query and contexts (Querying)

SIGMOD 2006

Constructing orders from preferences [Problem1]

• Problem is NP-hard; need for heuristics • PickPerm algorithm : pick a random permutation, inverse it

and pick the best of the two

t1 t2

t3

1/2

1/2

2/3

1/3

1

t1 t2

t3

t2

t3

t1

A = 11/6

t1

t3

t2

A = 5/6

t2

t3

t1

[ Inspired by the 2-approximation algorithm for finding the maximum acyclic subgraph of a given graph ]

SIGMOD 2006

Greedy algorithm [CSS’97]

• At the i-th iteration pick the i-th element of the output permutation

• At each iteration pick the tuple t with the highest s_val(t) = OutDegree(t)-InDegree(t)

in the remaining preference graph

t1 t2

t3

1/2

1/2

2/3

1/3

1

t1 t2

t3

1/3

2/3

1/3

t1

t3

1

-4/3

t2

1/3

-1/3

t2

t1

t2

t1

t3

SIGMOD 2006

MC-algorithm

• Reverse the directions of the edges on the preference graph

• Run a random walk (with random restarts) on the reversed graph

• Rank according to the stationary distribution

SIGMOD 2006

Performance

• Data generation– Fix an order on the tuples– Generate preferences that

respect this order– Pc: the probability that a

preference is generated between a pair of tuples

• Observations– For small pc values more

orders are compatible, all algorithms are good

– For large pc values MC and Greedy find the optimal order

SIGMOD 2006

Problem decomposition

[Problem 1]: For every context X build an order τX (Ordering)

[Problem 2]: Given a set of orders Tm = {τ1,…, τm} find ℓ representative orders Tℓ (ClusterOrders)

• Assign each of the input orders to one of the representatives (the closest)

• Associate with each representative σ a set of contexts Yσ

[Problem 3]: Provide top-k results for the query Q– respecting the representative orders and– weight respect according to the similarity between

query and contexts (Querying)

SIGMOD 2006

Reducing the number of orders [Problem 2]

• Finding ℓ representative orders is NP-hard

• Finding ℓ orders from the input ones (good approximation, but still hard)

• Need for heuristics

• Greedy algorithm– Always pick the order (from the input) that introduces the

minimum cost

• Furthest algorithm– Start by picking a random order τ and add it in the output set

of orders Tℓ

– For ℓ-1 iterations pick the order that is furthest away from the orders already in Tℓ

SIGMOD 2006

Refine the representative orders

• Given the set of representative orders Tℓ, assign each input order τЄTm to its closest representative in Tℓ. (partition Tm into ℓ partitions)*

– Discrete refinement: For each partition pick the best representative of the partition

– Continuous refinement: ([DKNS’01]) For each partition find the best representative of the partition

*Notice the resemblance between this problem and Catalog Segmentation problem by [KPR’04]

SIGMOD 2006

Performance

• Data generation– Fix ℓ underlying orders T– Generate other orders

from T by picking an order in T and adding noise (swaps)

– Compute the cost of the solution wrt to the ground truth• Observations

– Without refinements: Greedy performs steadily better than Furthest

– With refinements: Both algorithms are equally good

– The groupings are equivalent

SIGMOD 2006

Problem decomposition

[Problem 1]: For every context X build an order τX (Ordering)

[Problem 2]: Given a set of orders Tm = {τ1,…, τm} find ℓ representative orders Tℓ (ClusterOrders)

• Assign each of the input orders to one of the representatives (the closest)

• Associate with each representative σ a set of contexts Yσ

[Problem 3]: Provide top-k results for the query Q– respecting the representative orders and– weight respect according to the similarity between

query and contexts (Querying)

SIGMOD 2006

Problem 3: The Querying problem

• Use variation of the TA algorithms [FLN’02, FKS’03]– Assume k = 2 and query Q such that:

• sim(Q,Y1) = 0.5, sim(Q,Y2) = 0.3, sim(Q,Y3)=0.1

Y1,T1

t1 5

t2 4

t3 3

t4 2

T5 1

Y2,T2

t2 5

t3 4

t1 3

t4 2

t5 1

Y3,T3

t4 5

t3 4

t1 3

t5 2

t2 1

0.5 0.3 0.1

SIGMOD 2006

Problem 3: The Querying problem

1. At each sequential accessa. Set the threshold TH to be the aggregate of the

scores seen in this access

TH =0.5*5+0.3*5+0.1*5=4.5

Y1,T1

t1 5

t2 4

t3 3

t4 2

T5 1

Y2,T2

t2 5

t3 4

t1 3

t4 2

t5 1

Y3,T3

t4 5

t3 4

t1 3

t5 2

t2 1

0.5 0.3 0.1

SIGMOD 2006

Problem 3: The Querying problem

1. At each sequential accessb. Do random accesses and compute the score of the

objects seen

TH =0.5*5+0.3*5+0.1*5=4.5

Y1,T1

t1 5

t2 4

t3 3

t4 2

T5 1

Y2,T2

t2 5

t3 4

t1 3

t4 2

t5 1

Y3,T3

t4 5

t3 4

t1 3

t5 2

t2 1

t1 3.7

t2 3.6

t4 2.1

0.5 0.3 0.1

SIGMOD 2006

Problem 3: The Querying problem

1. At each sequential accessb. Do random accesses and compute the score of the

objects seen

TH =0.5*5+0.3*5+0.1*5=4.5

Y1,T1

t1 5

t2 4

t3 3

t4 2

T5 1

Y2,T2

t2 5

t3 4

t1 3

t4 2

t5 1

Y3,T3

t4 5

t3 4

t1 3

t5 2

t2 1

t1 3.7

t2 3.6

0.5 0.3 0.1

SIGMOD 2006

Problem 3: The Querying problem

1. At each sequential accessc. Maintain a list of the top-k objects seen so far

TH =0.5*5+0.3*5+0.1*5=4.5

Y1,T1

t1 5

t2 4

t3 3

t4 2

T5 1

Y2,T2

t2 5

t3 4

t1 3

t4 2

t5 1

Y3,T3

t4 5

t3 4

t1 3

t5 2

t2 1

t1 3.7

t2 3.6

0.5 0.3 0.1

SIGMOD 2006

Problem 3: The Querying problem

1. At each sequential accessd. When the scores of the top-k are greater or equal

to the threshold, stop

TH =0.5*4+0.3*4+0.1*4=3.6

Y1,T1

t1 5

t2 4

t3 3

t4 2

T5 1

Y2,T2

t2 5

t3 4

t1 3

t4 2

t5 1

Y3,T3

t4 5

t3 4

t1 3

t5 2

t2 1

t1 3.7

t2 3.6

0.5 0.3 0.1

SIGMOD 2006

Accuracy of top-k results

• IMDB dataset– Automatically generate

preferences via association-rule mining:‘A1=a’ > ‘A1=b’ |X if conf(Xa)>conf(Xb)

– Solk: top-k results obtained after clustering

– Gk: top-k results without clustering

|Sol|

|Sol|),,Accuracy(

kk

kk

G

GkGSol

SIGMOD 2006

Accuracy of top-k results

SIGMOD 2006

Recap

• Notion of contextual preferences

• Use of contextual preferences to order database results

• Use of association rules to obtain contextual preferences

• Experimental validation of the effectiveness of the proposed techniques using both synthetic and real data

SIGMOD 2006

Conclusions and future work

• The framework of contextual preferences is both intuitive and practical

• The framework is easily extended to accommodate for top-k lists and bucket orders

• Scalability of the algorithms needs further investigation

SIGMOD 2006

Questions?