topic 3 top-k and skyline algorithms. 2 what is top-k processing? find k items that best answer a...

40
Topic 3 Top-K and Skyline Algorithms

Upload: casey-pemberton

Post on 29-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Topic 3Top-K and Skyline Algorithms

Page 2: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

2

What is top-k processing?

• Find k items that best answer a user’s query– As a set, as a sorted list, or as a sorted list with scores– Usually from among N items, where N >> k

• Application domains– Search over structured datasets with user-defined preferences

• Find large apartments in a good school district in Brooklyn• Find cheap hotels that are near a beach

– Web search & other document retrieval / ranking tasks • Find documents about “Massachusetts election health care”

– Search over multimedia repositories (with probability scores)• Find images that show a palm tree next to a house

– Many others….

Page 3: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

3

SQL Example

SELECT h.id , s.name, h.price, s.tuitionSELECT h.id , s.name, h.price, s.tuition

FROM Houses h , Schools sFROM Houses h , Schools s

WHERE h.location = s.locationWHERE h.location = s.location

ORDER BY h.price, s.tuition;ORDER BY h.price, s.tuition;

SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE h.location = s.locationWHERE h.location = s.locationORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as scorescoreSTOP AFTER 10STOP AFTER 10

SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE WHERE distance(h.location, s.location)distance(h.location, s.location) as dist as distORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as score + 2 * dist score + 2 * dist as as scorescoreSTOP AFTER 10STOP AFTER 10

Page 4: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

4

Top-K vs. SQL querying

• Compared to standard SQL querying– Relevance to a degree, not Boolean– Return only the best items, not all items– Quality of an item is expressed by a score

SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE WHERE distance(h.location, s.location)distance(h.location, s.location) as dist as distORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as score + 2 * dist as scorescore + 2 * dist as scoreSTOP AFTER 10STOP AFTER 10

Page 5: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

5

Why do we care about top-k processing?

• Many practical applications

• Representative of many data management problems– Solid application scenarios, and new emerging every day

• E.g., top-K over Web-accessible resources

• E.g., top-k for social search

– Variety of algorithmic approaches– Explores the trade-off between run-time performance and

space overhead– Costs and balances the use of available operators

Page 6: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

6

Ranking functions

• Ranking (scoring) functions are used to compute the score of an item.• Item r(x1, …, xm), where xi are the ranking attributes, e.g., square

footage of a house, number of times “Massachusetts” occurs in the text, etc.

• score(r) = g (f1 (x1), … , fm (xm))– where fi are monotone functions, e.g. f(x) = 2 * x– g is a monotone aggregation functions, e.g. sum, average, max, min– e.g. score (i) = 2 * sq. ft. + 3 * quality of school district

Definition A function f is monotone if

f(x) < f(y) whenever x < yAn aggregation function g is monotone if

g(r) < g(r’) whenever r. xi < r’. xi, for all i

Page 7: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

7

Quantifying top-K algorithm performance

• Execution time– Sequential access (SA)

• accessing items in order, e.g., by reading from a cursor• similar concept to a sequential disk read, where seek time is amortized

over multiple accesses

– Random access (RA)• accessing items out of order, e.g., a primary key lookup• similar to a random disk read• typically more expensive than an SA (even orders of magnitude),

sometimes impossible

– Why not use wall clock time?

• Buffer size– How much state do we have to keep during computation– Is the size bounded by some constant (e.g. k), or is it linear in the

size of the dataset (N)? (recall that k << N)

Page 8: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

8

Outline Intro

• Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)

• Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….

• Skylines– Dominance– Fundamental algorithms– Extensions (demo)

Page 9: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

9

Naïve Computation of the top-k Answers

• Example 1 (on the board & next slide): R (id, annual income, net worth)score(r) = r.income + r.net worth

• Algorithm– Compute the score of each item– Sort items in decreasing order of score– Return k items with the highest score

• Properties of naïve solution– Advantage - simple– Disadvantage - unacceptable run-time performance when N is high

• Idea : throw space at the problem– pre-compute inverted lists for components of the score– aggregate partial scores at run-time

Page 10: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

10

Example 1

id income (K$) net worth (K$) score

= income + net worth

r1 150 350 500

r2 150 425 575

r3 125 450 575

r4 100 450 550

r5 100 200 300

r6 80 100 180

r7 75 500 575

r8 75 50 125

r9 50 300 350

r10 50 50 100

Page 11: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

11

The Basic Indexing Structure: Inverted List

id income

r1 150

r2 150

r3 125

r4 100

r5 100

r6 80

r7 75

r8 75

r9 50

r10 50

L1 L2

id net worth

r7 500

r3 450

r4 450

r2 425

r1 350

r9 300

r5 200

r6 100

r8 50

r10 50

Page 12: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

12

Fagin Algorithm (FA)

• Algorithm– Access all lists sequentially (SA), in parallel– STOP once k items have been seen sequentially in all lists– Compute scores of incomplete items by performing a random

access (RA)– Sort on score, return the best k items

• Work out Example 1 on the board for k = 3

• Performances– 8 SA + 2 RA– 5 objects in buffer

• Is this algorithm correct?

Page 13: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

13

Threshold Algorithm (TA)

• Algorithm– Access all lists sequentially (SA), in parallel– After each cursor move

• Compute the score of the item r under the cursor with random accesses (RA)

• Record r in the buffer if – (i) buffer size < k– (ii) r’s score > kth score, remove kth item from buffer

• Update the threshold = current list scores

• STOP when kth score > – Return the k items currently in the buffer

• Work out Example 1 on the board for k = 3

• Performance – 5 SA + 4 RA; #RA = #SA * (m-1), where m is the number of lists– 3 objects in buffer

Page 14: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

14

Comparison between FA and TA

• Example 2 on the board (k=3)

• Theorem: # SA in TA < # SA in FA

• Theorem: TA requires only bounded buffers, FAs buffers are not bounded

Page 15: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Instance optimiality of TA

• Theorem: TA is instance-optimal over the class of algorithms A that: 1. correctly find the top-k answers and

2. do not make any random guesses

• What about over all algorithms that correctly identify the top-k answers?– Example 4.4 (PODS 2001) on board

15

Page 16: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

16

What if we couldn’t do random accesses?

• Sometimes it suffices to output the top-k as a set• Sometimes we can get away with outputting top-k in

sorted order, but with no scores

• Example 8.3 from [Fagin et al. JCSS 2003]

Page 17: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

17

No Random Access Algorithm (NRA)

• Algorithm– Access all lists sequentially (SA), in parallel– After each cursor move compute

• Worst-case score W(r), best-case score B(r) for each seen r• Sort all seen items on W(r), breaking ties by B(r) = current list scores (this is the best-case score of any unseen

object)• STOP when W(r) of kth object >

– If random accesses are possible, compute complete scores of the top-K items

– Return the top-k items

• Example 1 on the board for k = 3• Performance

– 13 SA + 0 RA– Optimal performance if no RAs are allowed– In reality, computation may be slow -- re-sorting potentially large buffers at

each step

Page 18: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

18

Outline Intro

Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)

• Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….

• Skylines– Dominance– Fundamental algorithms– Extensions (demo)

Page 19: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

19

Top-k Selection Top-k Selection Rank Aggregation from Multiple ListsRank Aggregation from Multiple Lists

BOTH

SortedAccessOnly

Random AccessOnly

FA, TA, Quick-combine, Multi-Step

NRA, Stream-combine

MPro, Upper, Pick

Slide courtesy of Ihab F. Ilyas and Walid G. ArefIhab F. Ilyas and Walid G. Aref

Page 20: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

20

Minimal Probing (MPro)

• Recall: TA sequentially accesses one list, probes all other lists for each encountered object.

• What if random accesses (probes) were very expensive?– User-defined functions– Calls to external services, e.g., web-based

• Idea: execute only the necessary probes [Chang & Hwang, SIGMOD 2002]

• Assumptions– One component of the score accessed sequentially (SA), other

components are probed (RA)

– Probing schedule is given, global (same for all items) or per-item

Page 21: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

21

The MPro ExampleSELECT idFROM HouseWHERE new (age) s, -- sequential access

cheap(price, size) pC -- random access, a.k.a. “probe”

large(size) pL -- random access

ORDER BY MIN(s, rC, rL) STOP AFTER 2

id snew (age)

pC

cheap (price, size)

pL

large (size)

score

MIN(s, pC, pl)

a 0.90 0.85 0.75 0.75

b 0.80 0.78 0.90 0.78

c 0.70 0.75 0.20 0.20

d 0.60 0.90 0.90 0.60

e 0.50 0.70 0.80 0.50

Page 22: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

22

The MPro Algorithm

• For an item r : – W(r) - score seen so far– B(r) - best-case score, unseen components have max score (1.0)

• What is a necessary probe?– The probe pr(r) is necessary if there do not exist k items

v1, …, vk s.t. B(r) < B(vi) for each vi

• The MPro Algorithm– Access items sequentially– Maintain a best-case queue (a.k.a. the ceiling queue)

• STOP when k items with complete scores are at the head of the queue• Probe the first incomplete item, re-sort the queue

• Example 5 on the board (Figure 4 from [KH, SIGMOD 2002])– MIN is the aggregation, not SUM!– find the top-2 items (K=2)

Page 23: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

The big picture: top-k is term-at-a-time

[Query evaluation: strategies and optimizations. Turtle & Flood, 1995.]• Term-at-a-time (TAAT)

– Partial scores for documents kept in accumulators

– May (it better!) terminate before all documents are considered, but each document may be considered more than once

– Typically outperform DAAT when contributions of terms to the score are independent and dataset is reasonably small

• Document-at-a-time (DAAT)– Complete scores for documents computed (i.e., typically all documents

are considered)

– Smaller memory footprint than in TAAT

– Easier to execute in parallel than TAAT

Page 24: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

“Score-at-a-time” revisited

• Lots of trade-offs determine whether TAAT or DAAT is better– E.g., optimized DAAT costly because of re-sorting (for long queries)

– E.g., TAAT costly because of memory footprint

– E.g., DAAT much better for context-sensitive queries

– Particular algorithms, datasets, scoring functions, architectures bring their own trade-offs

• What about score-at-a-time (Anh and Moffat, SIGIR 2006)?– Similar to TAAT, but the order in which terms are considered is

determined by the impact of a given term in a given document

– A custom (efficient) indexing method for a custom (effective) scoring function

24

Page 25: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

25

Outline Intro

Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)

Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….

• Skylines– Dominance– Fundamental algorithms– Extensions (demo)

Page 26: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Ranking functions and dominance

• Monotone aggregation ranking functions express user preferences– r (price, distToBeach); score(r) = w1 * price+ w2 * distToBeach

– score(r1) < score(r2) whenever

r1.price < r2.price AND r1.distToBeach < r2.distToBeach

– this holds for any positive weights w1 and w2! (What if one of the weights was negative? What if both were negative?)

– In this example, we are interested in minimizing the score

• Mapping a multi-dimensional (e.g.,, 2D) space into R– In top-k, we look for k tuples that have the best score, which is weight-

specific, but …• users may differ on the weights in their preferences• users may want to understand the structure of the 2D space

– Skylines to the rescue! • makes the structure of the space explicit• are independent of the weights

26

Page 27: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Skylines represent dominance

• Tuple r1 is better than r2, according to its score if

– score(r1) < score(r2) , i.e.,

– r1.price < r2.price AND r1.distToBeach < r2.distToBeach

• This is equivalent to a notion of dominance (Pareto-optimality)

– Definition: r1 dominates r2 iff r1 is as good or better than r2 in all dimensions, and strictly better in at least one dimension.

– Some tuples may be incomparable• r1: $100, 1.0 mi

• r2: $ 90, 1.2 mi

• r3: $110, 1.1 mi

• r4: $120, 1.1 mi

– Observe: dominance is transitive27

Page 28: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Skyline examples

28

Page 29: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Skyline properties

• For any monotone ranking function F– If point r is best according to F, then r is on the skyline

• A top-1 hotel will always be on the skyline, irrespective of the weights!

– If point r is on the skyline, then there exists an F for

which r is best• Every hotel on the skyline is someone’s favorite

29

Page 30: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Computing the Skyline: nested-loops

• We will focus on 2D skylines, these methods generalize to multiple dimensions (some non-trivially)

SELECT * FROM Hotels h

WHERE h.city = ‘Nassau’

AND NOT EXISTS (

SELECT * FROM Hotels h1

WHERE h1.city = ‘Nassau’

AND h1.distToBeach <= h.distToBeach

AND h1.price <= h.price

AND ( h1.distToBeach < h.distToBeach OR h1.price < h.price))

• How is this evaluated?• Complexity?

30

Page 31: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Computing the Skyline: block-nested-loops

• Maintain a window of incomparable tuples in memory• Iterate until all tuples are either on the skyline or discarded• For each tuple r

– If r is dominated by any tuple in the window, discard r

– If r dominates a tuple r’ in the window, discard r’, insert r

– If r is incomparable with all tuples in the window, insert r

– If the window if full – write r to a temporary file on disk

• At the end of the iteration, output all tuples in the window to a temporary file, with their sequence numbers

• r is on the skyline if it was compared to all other tuples; we can tell this by looking at the sequence number (see ICDE 2001 for details)

Complexity: O(n) best case – when skyline is small, O(n2) worst case

31

Page 32: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

Computing the Skyline: divide-and-conquer

• Based on multidimensional divide-and-conquer [Bentley’80]

• In the block-nested-loops algorithm, tuples were considered multiple times– This is because no order was assumed, and a sequence number was assigned

to keep order

– Idea – sort tuples on one attributes, then compute dominance

• The algorithm– Compute the median along dimension d1

– Recursively compute skylines S1 and S2 for the partitions

– Merge S1 and S2: eliminate all r in S2 that are dominated by some r’ in S1

• Complexity: O(n log n) for d=2, best and worst case32

Page 33: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

What if one of the coordinates is unknown?

• Skylines for scientific literature search [ICDE 2010]– PubMed: 19 million scientific articles– Score components: relevance (higher is better) + publication

date (more recent is better)– Relevance is expensive to compute: a combination of non-

independent components (so, document-at-a-time)– An upper bound on relevance is much cheaper to compute

• Idea: modify divide-an-conquer to work with score upper-bounds

• An add-on: compute the skyline and k contours• Demo

33

Page 34: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

34

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

rele

vanc

ebatch

boundary

Page 35: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

35

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

rele

vanc

ebatch

boundary

UB sortorder

Page 36: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

36

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

batch boundary

3 skylinecontours

dominated,with score

dominated,no score

rele

vanc

e

Page 37: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

37

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

batch boundary

3 skylinecontours

dominated,with score

dominated,no score

rele

vanc

e

Page 38: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

38

Skyline Computation

Mar 2010 Feb 2010 Jan 2010 Dec 2009

batch boundary

3 skylinecontours

dominated,with score

dominated,no score

rele

vanc

e

Page 39: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

39

Outline• Intro

• Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)

• Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….

• Skylines– Dominance– Fundamental algorithms– Extensions (demo)

Page 40: Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted

40

References

1. Optimal aggregation algorithms for middleware. Ronald Fagin, Amnon Lotem and Moni Naor. J. Comput. Syst. Sci. 66(4): 614-656 (2003).

2. Minimal probing: supporting expensive predicates for top-K queries. Kevin Chen-Chuan Chang and Seung-won Hwang. SIGMOD 2002.

3. The Skyline operator. Stephan Börzsönyi, Donald Kossmann, Konrad Stocker. ICDE 2001.

4. Semantic Ranking and Result Visualization for Life Sciences Publications. Julia Stoyanovich, William Mee and Kenneth A. Ross. ICDE 2010.