topic 3 top-k and skyline algorithms. 2 what is top-k processing? find k items that best answer a...
TRANSCRIPT
Topic 3Top-K and Skyline Algorithms
2
What is top-k processing?
• Find k items that best answer a user’s query– As a set, as a sorted list, or as a sorted list with scores– Usually from among N items, where N >> k
• Application domains– Search over structured datasets with user-defined preferences
• Find large apartments in a good school district in Brooklyn• Find cheap hotels that are near a beach
– Web search & other document retrieval / ranking tasks • Find documents about “Massachusetts election health care”
– Search over multimedia repositories (with probability scores)• Find images that show a palm tree next to a house
– Many others….
3
SQL Example
SELECT h.id , s.name, h.price, s.tuitionSELECT h.id , s.name, h.price, s.tuition
FROM Houses h , Schools sFROM Houses h , Schools s
WHERE h.location = s.locationWHERE h.location = s.location
ORDER BY h.price, s.tuition;ORDER BY h.price, s.tuition;
SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE h.location = s.locationWHERE h.location = s.locationORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as scorescoreSTOP AFTER 10STOP AFTER 10
SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE WHERE distance(h.location, s.location)distance(h.location, s.location) as dist as distORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as score + 2 * dist score + 2 * dist as as scorescoreSTOP AFTER 10STOP AFTER 10
4
Top-K vs. SQL querying
• Compared to standard SQL querying– Relevance to a degree, not Boolean– Return only the best items, not all items– Quality of an item is expressed by a score
SELECT h.id , s.nameSELECT h.id , s.nameFROM Houses h , Schools sFROM Houses h , Schools sWHERE WHERE distance(h.location, s.location)distance(h.location, s.location) as dist as distORDER BY ORDER BY h.price + 10 x s.tuition as h.price + 10 x s.tuition as score + 2 * dist as scorescore + 2 * dist as scoreSTOP AFTER 10STOP AFTER 10
5
Why do we care about top-k processing?
• Many practical applications
• Representative of many data management problems– Solid application scenarios, and new emerging every day
• E.g., top-K over Web-accessible resources
• E.g., top-k for social search
– Variety of algorithmic approaches– Explores the trade-off between run-time performance and
space overhead– Costs and balances the use of available operators
6
Ranking functions
• Ranking (scoring) functions are used to compute the score of an item.• Item r(x1, …, xm), where xi are the ranking attributes, e.g., square
footage of a house, number of times “Massachusetts” occurs in the text, etc.
• score(r) = g (f1 (x1), … , fm (xm))– where fi are monotone functions, e.g. f(x) = 2 * x– g is a monotone aggregation functions, e.g. sum, average, max, min– e.g. score (i) = 2 * sq. ft. + 3 * quality of school district
Definition A function f is monotone if
f(x) < f(y) whenever x < yAn aggregation function g is monotone if
g(r) < g(r’) whenever r. xi < r’. xi, for all i
7
Quantifying top-K algorithm performance
• Execution time– Sequential access (SA)
• accessing items in order, e.g., by reading from a cursor• similar concept to a sequential disk read, where seek time is amortized
over multiple accesses
– Random access (RA)• accessing items out of order, e.g., a primary key lookup• similar to a random disk read• typically more expensive than an SA (even orders of magnitude),
sometimes impossible
– Why not use wall clock time?
• Buffer size– How much state do we have to keep during computation– Is the size bounded by some constant (e.g. k), or is it linear in the
size of the dataset (N)? (recall that k << N)
8
Outline Intro
• Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)
• Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….
• Skylines– Dominance– Fundamental algorithms– Extensions (demo)
9
Naïve Computation of the top-k Answers
• Example 1 (on the board & next slide): R (id, annual income, net worth)score(r) = r.income + r.net worth
• Algorithm– Compute the score of each item– Sort items in decreasing order of score– Return k items with the highest score
• Properties of naïve solution– Advantage - simple– Disadvantage - unacceptable run-time performance when N is high
• Idea : throw space at the problem– pre-compute inverted lists for components of the score– aggregate partial scores at run-time
10
Example 1
id income (K$) net worth (K$) score
= income + net worth
r1 150 350 500
r2 150 425 575
r3 125 450 575
r4 100 450 550
r5 100 200 300
r6 80 100 180
r7 75 500 575
r8 75 50 125
r9 50 300 350
r10 50 50 100
11
The Basic Indexing Structure: Inverted List
id income
r1 150
r2 150
r3 125
r4 100
r5 100
r6 80
r7 75
r8 75
r9 50
r10 50
L1 L2
id net worth
r7 500
r3 450
r4 450
r2 425
r1 350
r9 300
r5 200
r6 100
r8 50
r10 50
12
Fagin Algorithm (FA)
• Algorithm– Access all lists sequentially (SA), in parallel– STOP once k items have been seen sequentially in all lists– Compute scores of incomplete items by performing a random
access (RA)– Sort on score, return the best k items
• Work out Example 1 on the board for k = 3
• Performances– 8 SA + 2 RA– 5 objects in buffer
• Is this algorithm correct?
13
Threshold Algorithm (TA)
• Algorithm– Access all lists sequentially (SA), in parallel– After each cursor move
• Compute the score of the item r under the cursor with random accesses (RA)
• Record r in the buffer if – (i) buffer size < k– (ii) r’s score > kth score, remove kth item from buffer
• Update the threshold = current list scores
• STOP when kth score > – Return the k items currently in the buffer
• Work out Example 1 on the board for k = 3
• Performance – 5 SA + 4 RA; #RA = #SA * (m-1), where m is the number of lists– 3 objects in buffer
14
Comparison between FA and TA
• Example 2 on the board (k=3)
• Theorem: # SA in TA < # SA in FA
• Theorem: TA requires only bounded buffers, FAs buffers are not bounded
Instance optimiality of TA
• Theorem: TA is instance-optimal over the class of algorithms A that: 1. correctly find the top-k answers and
2. do not make any random guesses
• What about over all algorithms that correctly identify the top-k answers?– Example 4.4 (PODS 2001) on board
15
16
What if we couldn’t do random accesses?
• Sometimes it suffices to output the top-k as a set• Sometimes we can get away with outputting top-k in
sorted order, but with no scores
• Example 8.3 from [Fagin et al. JCSS 2003]
17
No Random Access Algorithm (NRA)
• Algorithm– Access all lists sequentially (SA), in parallel– After each cursor move compute
• Worst-case score W(r), best-case score B(r) for each seen r• Sort all seen items on W(r), breaking ties by B(r) = current list scores (this is the best-case score of any unseen
object)• STOP when W(r) of kth object >
– If random accesses are possible, compute complete scores of the top-K items
– Return the top-k items
• Example 1 on the board for k = 3• Performance
– 13 SA + 0 RA– Optimal performance if no RAs are allowed– In reality, computation may be slow -- re-sorting potentially large buffers at
each step
18
Outline Intro
Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)
• Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….
• Skylines– Dominance– Fundamental algorithms– Extensions (demo)
19
Top-k Selection Top-k Selection Rank Aggregation from Multiple ListsRank Aggregation from Multiple Lists
BOTH
SortedAccessOnly
Random AccessOnly
FA, TA, Quick-combine, Multi-Step
NRA, Stream-combine
MPro, Upper, Pick
Slide courtesy of Ihab F. Ilyas and Walid G. ArefIhab F. Ilyas and Walid G. Aref
20
Minimal Probing (MPro)
• Recall: TA sequentially accesses one list, probes all other lists for each encountered object.
• What if random accesses (probes) were very expensive?– User-defined functions– Calls to external services, e.g., web-based
• Idea: execute only the necessary probes [Chang & Hwang, SIGMOD 2002]
• Assumptions– One component of the score accessed sequentially (SA), other
components are probed (RA)
– Probing schedule is given, global (same for all items) or per-item
21
The MPro ExampleSELECT idFROM HouseWHERE new (age) s, -- sequential access
cheap(price, size) pC -- random access, a.k.a. “probe”
large(size) pL -- random access
ORDER BY MIN(s, rC, rL) STOP AFTER 2
id snew (age)
pC
cheap (price, size)
pL
large (size)
score
MIN(s, pC, pl)
a 0.90 0.85 0.75 0.75
b 0.80 0.78 0.90 0.78
c 0.70 0.75 0.20 0.20
d 0.60 0.90 0.90 0.60
e 0.50 0.70 0.80 0.50
22
The MPro Algorithm
• For an item r : – W(r) - score seen so far– B(r) - best-case score, unseen components have max score (1.0)
• What is a necessary probe?– The probe pr(r) is necessary if there do not exist k items
v1, …, vk s.t. B(r) < B(vi) for each vi
• The MPro Algorithm– Access items sequentially– Maintain a best-case queue (a.k.a. the ceiling queue)
• STOP when k items with complete scores are at the head of the queue• Probe the first incomplete item, re-sort the queue
• Example 5 on the board (Figure 4 from [KH, SIGMOD 2002])– MIN is the aggregation, not SUM!– find the top-2 items (K=2)
The big picture: top-k is term-at-a-time
[Query evaluation: strategies and optimizations. Turtle & Flood, 1995.]• Term-at-a-time (TAAT)
– Partial scores for documents kept in accumulators
– May (it better!) terminate before all documents are considered, but each document may be considered more than once
– Typically outperform DAAT when contributions of terms to the score are independent and dataset is reasonably small
• Document-at-a-time (DAAT)– Complete scores for documents computed (i.e., typically all documents
are considered)
– Smaller memory footprint than in TAAT
– Easier to execute in parallel than TAAT
“Score-at-a-time” revisited
• Lots of trade-offs determine whether TAAT or DAAT is better– E.g., optimized DAAT costly because of re-sorting (for long queries)
– E.g., TAAT costly because of memory footprint
– E.g., DAAT much better for context-sensitive queries
– Particular algorithms, datasets, scoring functions, architectures bring their own trade-offs
• What about score-at-a-time (Anh and Moffat, SIGIR 2006)?– Similar to TAAT, but the order in which terms are considered is
determined by the impact of a given term in a given document
– A custom (efficient) indexing method for a custom (effective) scoring function
24
25
Outline Intro
Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)
Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….
• Skylines– Dominance– Fundamental algorithms– Extensions (demo)
Ranking functions and dominance
• Monotone aggregation ranking functions express user preferences– r (price, distToBeach); score(r) = w1 * price+ w2 * distToBeach
– score(r1) < score(r2) whenever
r1.price < r2.price AND r1.distToBeach < r2.distToBeach
– this holds for any positive weights w1 and w2! (What if one of the weights was negative? What if both were negative?)
– In this example, we are interested in minimizing the score
• Mapping a multi-dimensional (e.g.,, 2D) space into R– In top-k, we look for k tuples that have the best score, which is weight-
specific, but …• users may differ on the weights in their preferences• users may want to understand the structure of the 2D space
– Skylines to the rescue! • makes the structure of the space explicit• are independent of the weights
26
Skylines represent dominance
• Tuple r1 is better than r2, according to its score if
– score(r1) < score(r2) , i.e.,
– r1.price < r2.price AND r1.distToBeach < r2.distToBeach
• This is equivalent to a notion of dominance (Pareto-optimality)
– Definition: r1 dominates r2 iff r1 is as good or better than r2 in all dimensions, and strictly better in at least one dimension.
– Some tuples may be incomparable• r1: $100, 1.0 mi
• r2: $ 90, 1.2 mi
• r3: $110, 1.1 mi
• r4: $120, 1.1 mi
– Observe: dominance is transitive27
Skyline examples
28
Skyline properties
• For any monotone ranking function F– If point r is best according to F, then r is on the skyline
• A top-1 hotel will always be on the skyline, irrespective of the weights!
– If point r is on the skyline, then there exists an F for
which r is best• Every hotel on the skyline is someone’s favorite
29
Computing the Skyline: nested-loops
• We will focus on 2D skylines, these methods generalize to multiple dimensions (some non-trivially)
SELECT * FROM Hotels h
WHERE h.city = ‘Nassau’
AND NOT EXISTS (
SELECT * FROM Hotels h1
WHERE h1.city = ‘Nassau’
AND h1.distToBeach <= h.distToBeach
AND h1.price <= h.price
AND ( h1.distToBeach < h.distToBeach OR h1.price < h.price))
• How is this evaluated?• Complexity?
30
Computing the Skyline: block-nested-loops
• Maintain a window of incomparable tuples in memory• Iterate until all tuples are either on the skyline or discarded• For each tuple r
– If r is dominated by any tuple in the window, discard r
– If r dominates a tuple r’ in the window, discard r’, insert r
– If r is incomparable with all tuples in the window, insert r
– If the window if full – write r to a temporary file on disk
• At the end of the iteration, output all tuples in the window to a temporary file, with their sequence numbers
• r is on the skyline if it was compared to all other tuples; we can tell this by looking at the sequence number (see ICDE 2001 for details)
Complexity: O(n) best case – when skyline is small, O(n2) worst case
31
Computing the Skyline: divide-and-conquer
• Based on multidimensional divide-and-conquer [Bentley’80]
• In the block-nested-loops algorithm, tuples were considered multiple times– This is because no order was assumed, and a sequence number was assigned
to keep order
– Idea – sort tuples on one attributes, then compute dominance
• The algorithm– Compute the median along dimension d1
– Recursively compute skylines S1 and S2 for the partitions
– Merge S1 and S2: eliminate all r in S2 that are dominated by some r’ in S1
• Complexity: O(n log n) for d=2, best and worst case32
What if one of the coordinates is unknown?
• Skylines for scientific literature search [ICDE 2010]– PubMed: 19 million scientific articles– Score components: relevance (higher is better) + publication
date (more recent is better)– Relevance is expensive to compute: a combination of non-
independent components (so, document-at-a-time)– An upper bound on relevance is much cheaper to compute
• Idea: modify divide-an-conquer to work with score upper-bounds
• An add-on: compute the skyline and k contours• Demo
33
34
Skyline Computation
Mar 2010 Feb 2010 Jan 2010 Dec 2009
rele
vanc
ebatch
boundary
35
Skyline Computation
Mar 2010 Feb 2010 Jan 2010 Dec 2009
rele
vanc
ebatch
boundary
UB sortorder
36
Skyline Computation
Mar 2010 Feb 2010 Jan 2010 Dec 2009
batch boundary
3 skylinecontours
dominated,with score
dominated,no score
rele
vanc
e
37
Skyline Computation
Mar 2010 Feb 2010 Jan 2010 Dec 2009
batch boundary
3 skylinecontours
dominated,with score
dominated,no score
rele
vanc
e
38
Skyline Computation
Mar 2010 Feb 2010 Jan 2010 Dec 2009
batch boundary
3 skylinecontours
dominated,with score
dominated,no score
rele
vanc
e
39
Outline• Intro
• Fundamentals of top-k processing – Fagin algorithm (FA)– Threshold algorithm (TA)– No random access algorithm (NRA)
• Extensions and alternatives– top-k with expensive predicates (MPro)– TAAT vs. DAAT vs. ….
• Skylines– Dominance– Fundamental algorithms– Extensions (demo)
40
References
1. Optimal aggregation algorithms for middleware. Ronald Fagin, Amnon Lotem and Moni Naor. J. Comput. Syst. Sci. 66(4): 614-656 (2003).
2. Minimal probing: supporting expensive predicates for top-K queries. Kevin Chen-Chuan Chang and Seung-won Hwang. SIGMOD 2002.
3. The Skyline operator. Stephan Börzsönyi, Donald Kossmann, Konrad Stocker. ICDE 2001.
4. Semantic Ranking and Result Visualization for Life Sciences Publications. Julia Stoyanovich, William Mee and Kenneth A. Ross. ICDE 2010.