topic 3 top-k and skyline algorithms. 2 what is top-k processing? find k items that best answer a...

Topic 3Top-K and Skyline Algorithms

2

What is top-k processing?

• Find k items that best answer a user’s query– As a set, as a sorted list, or as a sorted list with scores– Usually from among N items, where N >> k

• Application domains– Search over structured datasets with user-defined preferences

• Find large apartments in a good school district in Brooklyn• Find cheap hotels that are near a beach

– Web search & other document retrieval / ranking tasks • Find documents about “Massachusetts election health care”

– Search over multimedia repositories (with probability scores)• Find images that show a palm tree next to a house

– Many others….

3

SQL Example

SELECT h.id , s.name, h.price, s.tuitionSELECT h.id , s.name, h.price, s.tuition

FROM Houses h , Schools sFROM Houses h , Schools s

WHERE h.location = s.locationWHERE h.location = s.location

ORDER BY h.price, s.tuition;ORDER BY h.price, s.tuition;

SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE h.location = s.locationWHERE h.location = s.locationORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as scorescoreSTOP AFTER 10STOP AFTER 10

SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE WHERE distance(h.location, s.location)distance(h.location, s.location) as dist as distORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as score + 2 * dist score + 2 * dist as as scorescoreSTOP AFTER 10STOP AFTER 10

4

Top-K vs. SQL querying

• Compared to standard SQL querying– Relevance to a degree, not Boolean– Return only the best items, not all items– Quality of an item is expressed by a score

SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE WHERE distance(h.location, s.location)distance(h.location, s.location) as dist as distORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as score + 2 * dist as scorescore + 2 * dist as scoreSTOP AFTER 10STOP AFTER 10

5

Why do we care about top-k processing?

• Many practical applications

• Representative of many data management problems– Solid application scenarios, and new emerging every day

• E.g., top-K over Web-accessible resources

• E.g., top-k for social search

– Variety of algorithmic approaches– Explores the trade-off between run-time performance and

space overhead– Costs and balances the use of available operators

6

Ranking functions

• Ranking (scoring) functions are used to compute the score of an item.• Item r(x1, …, xm), where xi are the ranking attributes, e.g., square

footage of a house, number of times “Massachusetts” occurs in the text, etc.

• score(r) = g (f1 (x1), … , fm (xm))– where fi are monotone functions, e.g. f(x) = 2 * x– g is a monotone aggregation functions, e.g. sum, average, max, min– e.g. score (i) = 2 * sq. ft. + 3 * quality of school district

Definition A function f is monotone if

f(x) < f(y) whenever x < yAn aggregation function g is monotone if

g(r) < g(r’) whenever r. xi < r’. xi, for all i

7

Quantifying top-K algorithm performance

• Execution time– Sequential access (SA)

• accessing items in order, e.g., by reading from a cursor• similar concept to a sequential disk read, where seek time is amortized

over multiple accesses

– Random access (RA)• accessing items out of order, e.g., a primary key lookup• similar to a random disk read• typically more expensive than an SA (even orders of magnitude),

sometimes impossible

– Why not use wall clock time?

• Buffer size– How much state do we have to keep during computation– Is the size bounded by some constant (e.g. k), or is it linear in the

size of the dataset (N)? (recall that k << N)

8

Outline Intro

• Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)

• Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….

• Skylines– Dominance– Fundamental algorithms– Extensions (demo)

9

Naïve Computation of the top-k Answers

• Example 1 (on the board & next slide): R (id, annual income, net worth)score(r) = r.income + r.net worth

• Algorithm– Compute the score of each item– Sort items in decreasing order of score– Return k items with the highest score

• Properties of naïve solution– Advantage - simple– Disadvantage - unacceptable run-time performance when N is high

• Idea : throw space at the problem– pre-compute inverted lists for components of the score– aggregate partial scores at run-time

10

Example 1

id income (K$) net worth (K$) score

= income + net worth

r1 150 350 500

r2 150 425 575

r3 125 450 575

r4 100 450 550

r5 100 200 300

r6 80 100 180

r7 75 500 575

r8 75 50 125

r9 50 300 350

r10 50 50 100

11

The Basic Indexing Structure: Inverted List

id income

r1 150

r2 150

r3 125

r4 100

r5 100

r6 80

r7 75

r8 75

r9 50

r10 50

L1 L2

id net worth

r7 500

r3 450

r4 450

r2 425

r1 350

r9 300

r5 200

r6 100

r8 50

r10 50

12

Fagin Algorithm (FA)

• Algorithm– Access all lists sequentially (SA), in parallel– STOP once k items have been seen sequentially in all lists– Compute scores of incomplete items by performing a random

access (RA)– Sort on score, return the best k items

• Work out Example 1 on the board for k = 3

• Performances– 8 SA + 2 RA– 5 objects in buffer

• Is this algorithm correct?

13

Threshold Algorithm (TA)

• Algorithm– Access all lists sequentially (SA), in parallel– After each cursor move

• Compute the score of the item r under the cursor with random accesses (RA)

• Record r in the buffer if – (i) buffer size < k– (ii) r’s score > kth score, remove kth item from buffer

• Update the threshold = current list scores

• STOP when kth score > – Return the k items currently in the buffer

• Work out Example 1 on the board for k = 3

• Performance – 5 SA + 4 RA; #RA = #SA * (m-1), where m is the number of lists– 3 objects in buffer

14

Comparison between FA and TA

• Example 2 on the board (k=3)

• Theorem: # SA in TA < # SA in FA

• Theorem: TA requires only bounded buffers, FAs buffers are not bounded

Instance optimiality of TA

• Theorem: TA is instance-optimal over the class of algorithms A that: 1. correctly find the top-k answers and

2. do not make any random guesses

• What about over all algorithms that correctly identify the top-k answers?– Example 4.4 (PODS 2001) on board

15

16

What if we couldn’t do random accesses?

• Sometimes it suffices to output the top-k as a set• Sometimes we can get away with outputting top-k in

sorted order, but with no scores

• Example 8.3 from [Fagin et al. JCSS 2003]

17

No Random Access Algorithm (NRA)

• Algorithm– Access all lists sequentially (SA), in parallel– After each cursor move compute

• Worst-case score W(r), best-case score B(r) for each seen r• Sort all seen items on W(r), breaking ties by B(r) = current list scores (this is the best-case score of any unseen

object)• STOP when W(r) of kth object >

– If random accesses are possible, compute complete scores of the top-K items

– Return the top-k items

• Example 1 on the board for k = 3• Performance

– 13 SA + 0 RA– Optimal performance if no RAs are allowed– In reality, computation may be slow -- re-sorting potentially large buffers at

each step

18

Outline Intro

Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)



19

Top-k Selection Top-k Selection Rank Aggregation from Multiple ListsRank Aggregation from Multiple Lists

BOTH

SortedAccessOnly

Random AccessOnly

FA, TA, Quick-combine, Multi-Step

NRA, Stream-combine

MPro, Upper, Pick

Slide courtesy of Ihab F. Ilyas and Walid G. ArefIhab F. Ilyas and Walid G. Aref

20

Minimal Probing (MPro)

• Recall: TA sequentially accesses one list, probes all other lists for each encountered object.

• What if random accesses (probes) were very expensive?– User-defined functions– Calls to external services, e.g., web-based

• Idea: execute only the necessary probes [Chang & Hwang, SIGMOD 2002]

• Assumptions– One component of the score accessed sequentially (SA), other

components are probed (RA)

– Probing schedule is given, global (same for all items) or per-item

21

The MPro ExampleSELECT idFROM HouseWHERE new (age) s, -- sequential access

cheap(price, size) pC -- random access, a.k.a. “probe”

large(size) pL -- random access

ORDER BY MIN(s, rC, rL) STOP AFTER 2

id snew (age)

pC

cheap (price, size)

pL

large (size)

score

MIN(s, pC, pl)

a 0.90 0.85 0.75 0.75

b 0.80 0.78 0.90 0.78

c 0.70 0.75 0.20 0.20

d 0.60 0.90 0.90 0.60

e 0.50 0.70 0.80 0.50

22

The MPro Algorithm

• For an item r : – W(r) - score seen so far– B(r) - best-case score, unseen components have max score (1.0)

• What is a necessary probe?– The probe pr(r) is necessary if there do not exist k items

v1, …, vk s.t. B(r) < B(vi) for each vi

• The MPro Algorithm– Access items sequentially– Maintain a best-case queue (a.k.a. the ceiling queue)

• STOP when k items with complete scores are at the head of the queue• Probe the first incomplete item, re-sort the queue

• Example 5 on the board (Figure 4 from [KH, SIGMOD 2002])– MIN is the aggregation, not SUM!– find the top-2 items (K=2)

The big picture: top-k is term-at-a-time

[Query evaluation: strategies and optimizations. Turtle & Flood, 1995.]• Term-at-a-time (TAAT)

– Partial scores for documents kept in accumulators

– May (it better!) terminate before all documents are considered, but each document may be considered more than once

– Typically outperform DAAT when contributions of terms to the score are independent and dataset is reasonably small

• Document-at-a-time (DAAT)– Complete scores for documents computed (i.e., typically all documents

are considered)

– Smaller memory footprint than in TAAT

– Easier to execute in parallel than TAAT

“Score-at-a-time” revisited

• Lots of trade-offs determine whether TAAT or DAAT is better– E.g., optimized DAAT costly because of re-sorting (for long queries)

– E.g., TAAT costly because of memory footprint

– E.g., DAAT much better for context-sensitive queries

– Particular algorithms, datasets, scoring functions, architectures bring their own trade-offs

• What about score-at-a-time (Anh and Moffat, SIGIR 2006)?– Similar to TAAT, but the order in which terms are considered is

determined by the impact of a given term in a given document

– A custom (efficient) indexing method for a custom (effective) scoring function

24

25

Outline Intro

Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)

Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….


Ranking functions and dominance

• Monotone aggregation ranking functions express user preferences– r (price, distToBeach); score(r) = w1 * price+ w2 * distToBeach

– score(r1) < score(r2) whenever

r1.price < r2.price AND r1.distToBeach < r2.distToBeach

– this holds for any positive weights w1 and w2! (What if one of the weights was negative? What if both were negative?)

– In this example, we are interested in minimizing the score

• Mapping a multi-dimensional (e.g.,, 2D) space into R– In top-k, we look for k tuples that have the best score, which is weight-

specific, but …• users may differ on the weights in their preferences• users may want to understand the structure of the 2D space

– Skylines to the rescue! • makes the structure of the space explicit• are independent of the weights

26

Skylines represent dominance

• Tuple r1 is better than r2, according to its score if

– score(r1) < score(r2) , i.e.,

– r1.price < r2.price AND r1.distToBeach < r2.distToBeach

• This is equivalent to a notion of dominance (Pareto-optimality)

– Definition: r1 dominates r2 iff r1 is as good or better than r2 in all dimensions, and strictly better in at least one dimension.

– Some tuples may be incomparable• r1: $100, 1.0 mi

• r2: $ 90, 1.2 mi

• r3: $110, 1.1 mi

• r4: $120, 1.1 mi

– Observe: dominance is transitive27

Skyline examples

28

Skyline properties

• For any monotone ranking function F– If point r is best according to F, then r is on the skyline

• A top-1 hotel will always be on the skyline, irrespective of the weights!

– If point r is on the skyline, then there exists an F for

which r is best• Every hotel on the skyline is someone’s favorite

29

Computing the Skyline: nested-loops

• We will focus on 2D skylines, these methods generalize to multiple dimensions (some non-trivially)

SELECT * FROM Hotels h

WHERE h.city = ‘Nassau’

AND NOT EXISTS (

SELECT * FROM Hotels h1

WHERE h1.city = ‘Nassau’

AND h1.distToBeach <= h.distToBeach

AND h1.price <= h.price

AND ( h1.distToBeach < h.distToBeach OR h1.price < h.price))

• How is this evaluated?• Complexity?

30

Computing the Skyline: block-nested-loops

• Maintain a window of incomparable tuples in memory• Iterate until all tuples are either on the skyline or discarded• For each tuple r

– If r is dominated by any tuple in the window, discard r

– If r dominates a tuple r’ in the window, discard r’, insert r

– If r is incomparable with all tuples in the window, insert r

– If the window if full – write r to a temporary file on disk

• At the end of the iteration, output all tuples in the window to a temporary file, with their sequence numbers

• r is on the skyline if it was compared to all other tuples; we can tell this by looking at the sequence number (see ICDE 2001 for details)

Complexity: O(n) best case – when skyline is small, O(n2) worst case

31

Computing the Skyline: divide-and-conquer

• Based on multidimensional divide-and-conquer [Bentley’80]

• In the block-nested-loops algorithm, tuples were considered multiple times– This is because no order was assumed, and a sequence number was assigned

to keep order

– Idea – sort tuples on one attributes, then compute dominance

• The algorithm– Compute the median along dimension d1

– Recursively compute skylines S1 and S2 for the partitions

– Merge S1 and S2: eliminate all r in S2 that are dominated by some r’ in S1

• Complexity: O(n log n) for d=2, best and worst case32

What if one of the coordinates is unknown?

• Skylines for scientific literature search [ICDE 2010]– PubMed: 19 million scientific articles– Score components: relevance (higher is better) + publication

date (more recent is better)– Relevance is expensive to compute: a combination of non-

independent components (so, document-at-a-time)– An upper bound on relevance is much cheaper to compute

• Idea: modify divide-an-conquer to work with score upper-bounds

• An add-on: compute the skyline and k contours• Demo

33

34

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

rele

vanc

ebatch

boundary

35

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

rele

vanc

ebatch

boundary

UB sortorder

36

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

batch boundary

3 skylinecontours

dominated,with score

dominated,no score

rele

vanc

e

37

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

batch boundary

3 skylinecontours


dominated,no score

rele

vanc

e

38

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

batch boundary

3 skylinecontours


dominated,no score

rele

vanc

e

39

Outline• Intro

• Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)



40

References

1. Optimal aggregation algorithms for middleware. Ronald Fagin, Amnon Lotem and Moni Naor. J. Comput. Syst. Sci. 66(4): 614-656 (2003).

2. Minimal probing: supporting expensive predicates for top-K queries. Kevin Chen-Chuan Chang and Seung-won Hwang. SIGMOD 2002.

3. The Skyline operator. Stephan Börzsönyi, Donald Kossmann, Konrad Stocker. ICDE 2001.

4. Semantic Ranking and Result Visualization for Life Sciences Publications. Julia Stoyanovich, William Mee and Kenneth A. Ross. ICDE 2010.

topic 3 top-k and skyline algorithms. 2 what is top-k processing? find k items that best answer a...

Documents

score i

schools s

score stop

houses h

th score

buffer slide

rs score

items example