rank aggregation

Rank Aggregation

Rank Aggregation: Settings• Multiple items– Web-pages, cars, apartments,….

• Multiple scores for each item– By different reviewers, users, according to different

features…

• Some aggregation function on the scores– Sum, Average, Max…

• Goal: compute the top-k items

Rank Aggregation ExampleModel PriceRank

Honda 9

Volvo 3

Subaru 9

Model ComfortRank

Honda 7

Volvo 10

Subaru 5

Model BeautyRank

Honda 3

Volvo 8

Subaru 4

Model TotalRank(min)

Honda 3

Volvo 3

Subaru 4

Model TotalRank(avg)

Honda 6.333

Volvo 7

Subaru 6

Naïve Algorithm

• Compute the aggregated rank for all items

• Find the best one, then the second best one… the k best one

• Good for small-scale problems• Still not feasible for web scales…

Can we do any better?

• An assumption to help us: each individual list comes sorted– Reasonable for search engines, user rankings…

• Another assumption: monotonicity of the aggregation function

• Now can we do any better?

Fagin's algorithm (FA)

• Do sorted access on all lists in parallel• For every item do random access to the other

lists to fetch all of its values• Stop when at least k items were seen (in the

sorted access) in all lists • Sort the list• Why is this enough?

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

Average

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

Average

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

C 4

Average

Example (top-3)

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

C 4

Average

How do we know not to look further?

Complexity

• Probabilistic analysis on the order of items can be used to show better bounds (with good probability)

• Can we do even better?

Cost model

• This is a very simple settings so we can define a finer cost model than worst case complexity

• In a web context it is important to do so – Since the scale is huge

• We associate some cost Cs with every sorted access , and some cost Cr with every random access

• Denote the cost for algorithm A on input instance I by cost(A,I)

Instance-optimality

• An algorithm A is instance-optimal if for every input instance I, cost(A,I) = O(cost(A',I)) for every algorithm A'

• A very strong notion

• But we can realize it here!

Threshold Algorithm (TA)

• Idea: sometimes we can stop before seeing k objects in every list

• Use a threshold on how good can a score of an unseen object be.

• Based on aggregating the minimal score seen so far in all lists

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

Average

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

Average

T=9.5

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

Average

T=9.5

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

C 4

Average

T=7

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

C 4

Average

T=4

One step less!

Instance-optimality• Theorem: If the aggregation function is strictly monotone and every

two items in a list have distinct grades, then TA is instance-optimal– Intuition: If an algorithm stops on input I before reaching the

threshold, then we can design an input I' on which it is wrong, by changing values it did not see

– TA sees at most K items more than any algorithm on any input

• Strict monotonicity is needed to avoid "lucky guesses" in breaking ties– Thm. In general no instance-optimal algorithm exists

• Theorem: TA is instance-optimal against all algorithms that do not "guess"– i.e. do not do random access to an item they did not see in

sorted access

Restricted Sorted Access

• Some rankings are not available as sorted– E.g. distances from a map site

• Then we can revise TA to do sorted access only on the list where it is possible

• And still instance-optimal! (Against algorithms that work under the same

restrictions, of course)

No Random Access

• Maintain bottom and upper bounds for every item (worst and best grades)

• Best is the aggregation of what we have seen and the worst we have seen in every list, Worst is the aggregation with what we have seen and zeros

• Keep in the list those with top-K "worst" grades– Break ties by "best" grades

• Halt if we have k items in the list, and the best grade for every item out of the list is less than the k'th in the list

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 4.5<S<9

Average

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort Average

Item Score

A 4.5<S<9

B 5<S<10

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3


Item Score

A 4.5<S<9

B 9.5

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3


Item Score

A 4.5<S<9

B 9.5

C 2.5<S<5

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

C 4

Average

Item Score

A 4.5<S<9

B 9.5

C 4

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

C 4

Average

Item Score

A 6.5

B 9.5

C 4

Example

Item Score

A 9

B 9

C 3

D 1

Item Score

B 10

C 5

A 4

D 3

Beauty Comfort

Item Score

A 6.5

B 9.5

C 4

Average

Item Score

A 6.5

B 9.5

C 4

Score(D)<3

rank aggregation

Documents