answering top-k queries using views by: gautam das (univ. of texas), dimitrios gunopulos (univ. of...

64
Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris Tsirogiannis (Univ. of Toronto) Presented By:

Upload: sharon-white

Post on 14-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Answering Top-k Queries Using Views

By:Gautam Das (Univ. of Texas),

Dimitrios Gunopulos (Univ. of California Riverside),

Nick Koudas (Univ. of Toronto),

Dimitris Tsirogiannis (Univ. of Toronto)

Presented By: Kushal Shah

Lipsa Patel

Page 2: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Views

• Definition: Views

• Declaring Views

• Advantages of using Views

Page 3: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Views

• A view may be thought of as a table, that is derived from one or more underlying base table.

• Two kinds:

1. Virtual: Not stored in the database; just a query for constructing the relation.

2. Materialized: Actually constructed and stored.

Page 4: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Declaring Views

• Materialized:

CREATE [MATERIALIZED]

VIEW <name> AS <query>;

• Virtual: Default

Page 5: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Advantages of using Views

• If we have several tables in a DB and we want to view only specific columns from specific tables we can go for views.

• Suffice the needs of security: Sometimes allowing specific users to see only specific columns based on the permission that we can configure on the views.

Page 6: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Answering Top-k Queries Using Views

By:Gautam Das (Univ. of Texas),

Dimitrios Gunopulos (Univ. of California Riverside),

Nick Koudas (Univ. of Toronto),

Dimitris Tsirogiannis (Univ. of Toronto)

Presented By: Kushal Shah

Lipsa Patel

Page 7: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Top-k Query

• Top-k Query Processing – Definition

• Top-k Example

• Algorithms for Top-k Query Processing

Page 8: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Top-k Query Processing

Top-k query processing =

Finding k objects that have the highest overall Score

Page 9: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Top-k Example

R

• Users preferences regarding the ordering of the tuples of a relation can be expressed as a scoring functions on the attributes of a relation, eg

• fq = 3x1 + 2x2 + 5x3• The top-k problem is to find the k tuples with the highest

score according to a given scoring function.

tid X1 X2 X3

1 82 1 59

2 53 19 83

3 29 99 15

4 80 45 8

5 28 32 39

fQ

tid Score

2 612

1 543

4 370

3 360

5 343

Page 10: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Algorithms for Top-k Query Processing

• How? Which algorithms? – Related Work How we complement existing approaches?

• TA [Fagin]• PREFER [Hristidis] Stores the multiple copies of a relation and each

copy is ordered according to a different scoring function.

In order to answer a top-k query the algorithm utilizes a single copy with a scoring function which is closest to the scoring function of the query.

Page 11: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

Sorted L1

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

N

a

b

c

d

.

.

.

.

ObjectID

0.9

0.8

0.72

0.6

.

.

.

.

Attribute 1

0.85

0.2

0.9

.

.

.

.

Attribute 2

0.7

M

Sorted L2

Example – Simple Database model

Page 12: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

ID A1 A2 Min(A1,A2)

Step 1: - parallel sorted access to each list

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85 0.85

0.6 0.6

For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer

Example – Threshold Algorithm

Page 13: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

ID A1 A2 Min(A1,A2)

a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2)

a

d

0.9

0.9

0.85 0.85

0.6 0.6

T = min(0.9, 0.9) = 0.9

- 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1

Example – Threshold Algorithm

Page 14: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

ID A1 A2 Min(A1,A2)

Step 1 (Again): - parallel sorted access to each list

(a, 0.9)

(b, 0.8)

(c, 0.72)

(d, 0.6)

.

.

.

.

L1 L2

(d, 0.9)

(a, 0.85)

(b, 0.7)

(c, 0.2)

.

.

.

.

a

d

0.9

0.9

0.85 0.85

0.6 0.6

For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer

b 0.8 0.7 0.7

Example – Threshold Algorithm

Page 15: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

ID A1 A2 Min(A1,A2)

a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2)

a

b

0.9

0.7

0.85 0.85

0.8 0.7

T = min(0.8, 0.85) = 0.8

- 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1

Example – Threshold Algorithm

Page 16: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

c

ID A1 A2 Min(A1,A2)

a: 0.9

b: 0.8

c: 0.72

d: 0.6

.

.

.

.

L1 L2

d: 0.9

a: 0.85

b: 0.7

c: 0.2

.

.

.

.

Situation at stopping condition

a

b

0.9

0.7

0.85 0.85

0.8 0.7

T = min(0.72, 0.7) = 0.7

Example – Threshold Algorithm

0.72 0.2 0.2

Page 17: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Related Work for Top-k Query Processing

• TA: Sequential as well as Random Access

• PREFER

Page 18: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Approach for Top-k Query Processing

• Top-k Query Answering using Views

• Views are Materialized (incurring space overhead)

• Advantages of using views: increased performance because views are small in size

• Space-Performance tradeoff

Page 19: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Example Views

R tid X1 X2 X3

1 82 1 59

2 53 19 83

3 29 99 15

4 80 45 8

5 28 32 39

Three attribute relation R

V1 tid Score

3 553

4 385

5 216

2 201

1 169Top-5 query using function f1 = 2x1 + 5x2

V2 tid Score

2 351

1 237

5 177

3 159

4 88

Top-5 query using function f2 = x2 + 2x3

• Top-k ranking queries in SQL-like syntax: SELECT TOP[k] FROM R ORDER BY Score(q) Score(q) - function that assigns numeric score to any tuple t

• Ranking Views: Views only aim to rank• A ranking view is the materialized result of a previously asked top-k query.

Can we answer new top-k queries efficiently using ranking views? Let’s see

Page 20: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Formal Definitions

• Ranking Queries

• Ranking Views

Page 21: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Ranking Queries

• Ranking Queries: Top-k ranking queries in SQL-like syntax: Select Top[k] from R where Range(q) Order By Score(q)

• A ranking query may be expressed as a triple Q = (Score(q), k, Range(q)), where

Score(q)= Function that assigns numeric score to any tuple t

Range(q) = defines selection condition for the tuples of R• Semantics: Retrieve the k tuples with the top scores

satisfying the selection condition.

Page 22: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Ranking Views

Materialized Ranking View V:

for a previously executed query

Q1 = (ScoreQ1, k1, RangeQ

1),

the corresponding materialized ranking view is a set of k(tid, scoreQ(tid)) pairs,

ordered by decreasing values of scoreQ(tid).

Page 23: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Problems we are going to solve

• Top-k Query Answer using Views

• View Selection

Page 24: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Top-k Query Answer using Views

Given: Set U of views

Query Q

Obtain an answer to Q combining all the information conveyed by the views in U

Solution: Algorithm named LPTA

Page 25: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Problems we are going to solve

• Top-k Query Answer using Views

• View Selection

Page 26: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

View Selection

• Problem: Given a collection of views V={V1…Vr} – base views and a query Q, determine the most efficient subset U of V to execute Q on.

• Input to LPTA: subset U• Obtaining an answer to ranking query: Running TA on base

views.• Find the subset U that when utilized by LPTA

1. Provide answer to query

2. Provide answer faster than running TA on the base views V

Page 27: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Outline

• LPTA Algorithm

• View Selection Problem

Page 28: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

LPTA: Linear Programming Adaptation of the Threshold Algorithm

1. Scoring function of Query: Q - fQ = 3x1 + 10x2

2. Scoring function of Views: V1 – fv1 = 2x1 + 5x2

Subset of Views U V2 – fv2 = x1 + 2x2

LPTA for Top-k Query Answer using Views

Top-1 Query• View is a set of pairs of (tuple identifier, score).

• The LPTA algorithm requires sorted access on each view in non-increasing order of that score.

Page 29: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

LPTA Example

tid x1 x2 x3

1 82 1 59

2 53 19 83

3 29 1 2

4 80 22 90

5 28 8 87

6 12 55 82

7 16 99 42

8 18 42 67

9 42 1 23

10 23 21 88

R tid Score

7 527

6 299

4 270

8 246

2 201

V1

Top-5 Query f1 = 2x1 + 5x2

tid Score

6 219

4 202

10 197

Top-3 Query f2 = x2 + 2x3

V2

Answer Top-2 Query using LPTA

Page 30: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

LPTA Setting

• The algorithm initializes the top-k buffer to empty.

Top-2 Buffer tid Score

7 527

6 299

4 270

8 246

2 201

tid Score

6 219

4 202

10 197

V1 V2

7 16 99 42

For each tid read, random access on R to retrieve tuple and compute score acc to query function f3 = 3x1 + 10x2 + 5x3

6 12 55 82 (7,1248)

(6,996)Top-2 Buffer

Check for stopping Condition

Page 31: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Check for Stopping Condition• The unseen tuples in the view have satisfy the following inequalities:

The domain of each attribute of R [1,100]0<=X1,X2,X3<=100---------------------------(1)2x1 + 5x2 <= sd1-------------------------------(2)x2 + 2x3 <= sd2---------------------------------(3)sd1 = 527 and sd2 = 219

• Unseenmax = Solution to the linear program where we maximize the function f3 = 3x1 + 10x2 + 5x3 subject to these inequalities.

• The solution of the linear program gives the maximum score of any unseen tuple.

• Unseenmax = the maximum possible score (with respect to the ranking query’s scoring function) of any tuple not yet visited in the views.

• The algorithm terminates when the top-k buffer is full and Unseenmax <= topkmin

Page 32: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Calculating Unseenmax

• Unseenmax = Solution to the linear program where we maximize the function f3 = 3x1 + 10x2 + 5x3 subject to these inequalities.

• A linear programming problem may be defined as the problem of maximizing or minimizing a linear function subject to linear constraints. The constraints may be equalities or inequalities. Here is a simple example.

• Find numbers x1 and x2 that maximize the sum x1 + x2 subject to the constraints

x1 ≥ 0, x2 ≥ 0, and

x1 + 2x2 ≤ 4

4x1 + 2x2 ≤ 12

−x1 + x2 ≤ 1

Objective Function

Page 33: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Maximize the function

Convex region

This system of inequalities defines a convex region.

Occasionally, the maximum occurs along an entire edge or face of the constraint set, but then the maximum occurs at a corner point as well.

Page 34: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

LPTA - Example

tid11

s11

tid21

tid31

tid41

tid51

s21

s31

s41

s51

tid12

s12

tid22

tid32

tid42

tid52

s22

s32

s42

s52

V1 V2

tid11

tid12

Top-1 queryV1

V2

Qstoppingcondition

X1

X2

R(X1, X2)

O (0,0)

P (1,0)

R (1,1)

T (0,1)

Normalized Domain[0,1]

Views and top-k query represented by vectors denoting the direction of increasing score

Sweeping line perpendicular to V1 from infinity to origin

Score of a tuple with respect to the query: project that tuple to the vector of the query

Score of a tuple with respect to a view: project that tuple to the the vector of the view

Max posssible score of any tuple not yet visited in the views with respect to the scoring func of query UNSEENMAX

Page 35: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

LPTA - Example (cont’)

tid11

s11

tid21

tid31

tid41

tid51

s21

s31

s41

s51

tid12

s12

tid22

tid32

tid42

tid52

s22

s32

s42

s52

V1 V2

tid11

tid12

tid21

tid22

Top-1 V1

V2

Qstoppingcondition

X1

X2

R(X1, X2)

O (0,0)

P (1,0)

R (1,1)

T (0,1)

The algorithm will stop early if the scoring function of the views is “similar” to the scoring function of the query.

Page 36: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

LPTA Algorithm – Pseudo Code

• There is Sequential as well as Random Access.

• Sequential access on views

• Random Access on base table to find the tuple

Page 37: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Comparison of LPTA with TA

• LPTA becomes TA when the set of views U = set of base views

• Execution cost: Both have Sequential as well as Random Access

These I/O Operations play a significant role – overshadow the costs of CPU operations such as updated top-k buffer, testing for stopping condition & so on.

Page 38: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Determining Factor for performance LPTA versus TA

• Highly correlated: every sequential access incurs a random access.

• As a result the determining factor for the performance is (distance from the beginning of the view each algorithm has to traverse (read sequentially) before coming into a halt with the correct answer) X (the number of views participating in the process).

• d=number of lock-step r = no of views• Running Cost:

O(dr)

Page 39: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Outline

• LPTA Algorithm

• View Selection Problem

Page 40: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

View Selection Problem

• Given a collection of views V = {V1,…,Vr} and a Query Q, determine the most efficient subset U C V to execute Q on.

• Conceptual discussion of View SelectionTwo attribute relation (in two dimension)Multi attribute relation (for any dimension)

• Domain of each attribute is normalized to [0,1]

• M-attribute relation is refer as m-dimension

Page 41: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

View Selection – Two Dimension(same side)

Min top-k tupleQ

V1

V2

O (0,0)

T (0,1)

P (1,0)

R (1,1)

X

Y

Square OPRT

Two views V1 and V2 and Query Q are represented by vectors. Both the view vectors are to the same side (clockwise) of the query vector

A

B

B1B2

M

AB 1 Q passes through M & intersect unit squareABR – Top-k tuplesABPOT – Remaining tuples

Sorted access to V1 – sweeping line 1 to V1 from infinity to origin

Stopping condition for V1: sweepline crosses AB1 b’coz convex polygon AB1POT – unseen tuples and score(unseen) <= Score(M)

Number of sorted access V1 = NumTuples(AB1R) V2 = NumTuples(AB2R)

Page 42: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

View Selection – Two Dimension(same side) Conclusion

• V2 is slower compared to V1

• If several views in two dimension are available &

all their vectors are to one side of query vector,

then it is optimal for LPTA to use the vector that is closet to the query vector.

Page 43: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Estimating the Number of Tuples

• Estimating and Comparing the Number of Tuples by simply comparing the areas of respective triangles.

• Such approach: Need to have an uniform distribution within the triangles, which is often quite unrealistic.

• In our approach for view selection,

utilize the conceptual conclusions + borrow knowledge of actual data distribution.

Page 44: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

View Selection – Two Dimension(either side of query)

A

B

Min top-k tupleQV1

V2

O (0,0)

T (0,1)

P (1,0)

R (1,1)

X

YA1

B1

M

Can use only V1 or only V2 for execution

If uses only v1 to answer the query the stopping condition will be reached once the sweepline perpendicular to v1 crosses position A1B/ For V2 - AB1

Page 45: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

View Selection – Two Dimension(either side of query)

A

B

Min top-k tupleQV1

V2

O (0,0)

T (0,1)

P (1,0)

R (1,1)

X

YA1

B1

M

Running LPTA on both V1 and V2, rather than just running on only one of V1 or V2? Two views are better than one

A11

B11

A21

B21

The intersection point of the sweep lines perpendicular to v1 and v2 is on the line AB

The stopping condition is reached when the sweeplines resp crosses A11B11 and A21B21 such that 1) intersection pt of A11B11 and A21B21 is on line AB

2) NumTuples(A11B11R) = NumTuples(A21B21PR) since algo sweeps each view in lock-step

Page 46: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

LPTA on both Views versus One • For two views the position of each sweepline is before the

respective stopping positions if only one view has been used.

• Total number of sorted accesses for two views:

NumTuples (A11B11R) + NumTuples (A21B21R) = 2 NumTuples (A11B11R)

• If Min (NumTuples (A1BR), NumTuples (AB1PR), 2 NumTuples

(A11B11R)) = NumTuples (A1BR) - Use V1

• If Min (NumTuples (A1BR), NumTuples (AB1PR), 2 NumTuples

(A11B11R)) = NumTuples (AB1PR) - Use V2

• Else use both V1, V2

Page 47: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Theorem for Two Dimensional Case

Theorem 1: Set of Views = {V1,…,Vr} Query = Q

Two Dimensional dataset

Va = Closest to query in Anticlockwise

Vc = Closest to query in Clockwise

So they are on either side of the query

Optimal execution of LPTA requires the use of either Va or Vc i.e., the use of subset from {Va , Vc }

Page 48: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

View Selection – Higher Dimension

Extension of Theorem 1

Theorem 2: Set of Views = {V1,…,Vr} Query = Q

m-dimensional dataset

Optimal execution of LPTA requires the use of subset of views U C V such that |U| <= m

Page 49: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Outline

• LPTA Algorithm

• View Selection Problem

Cost Estimation Framework

Page 50: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Cost Estimation Framework – Running LPTA

• Cost Estimation Framework: The cost of running LPTA when a specific set of views is used to answer a query.

• Cost = total number of sequential accesses in a view• Uses 2 views to answer a query

Cost = 6 sequential accesses

Min top-k tuple

Can we find that costwithout actually running LPTA?

A

B

QV1

V2

Page 51: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Cost Estimation Framework – without Running LPTA

• EstimateCost(Q, U): Returns an estimate of the cost of running LPTA on exactly this set of views: U

• Used within SelectViews(Q,V) to search the subset U that minimizes EstimateCost(Q,U)

• EstimateCost(Q,U) takes into account

– Multi-attribute views

– Non-uniform data distribution

Page 52: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Simulating LPTA on Histograms rather than on views U

• Equi-depth histograms: The number of tuples in each bucket is the same

• Base Table R : n tuples (10)

Hi Equi-depth histogram

b buckets – 2buckets : represent the distribution of points along the Xi attribute

Each bucket will represent n/b data points

10/2 = 5 data points

Page 53: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Simulating LPTA on Histograms rather than on views U

• In our estimation procedure:

HQ – represents the distribution of score of all tuples of the database according to the scoring function Q

Cannot calculate the score of all tuples, so approximate HQ

Page 54: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Simulation of LPTA on Histograms• Simulate LPTA in a

bucket by bucket lock step to estimate the cost.

HQ HV1 HV2

topkmin

HQ: approximates the scoredistribution of the query Q b buckets histograms for

the score distribution of views

n/b tuples per bucket

Cost

We cannot afford to run LPTA on views U

Pre-estimate topkmin b’coz we do not have access to actual tuples or their tids. The value of topkmin is estimated from HQ by determining the bucket that contains the kth highest tuple. Since topkmin is very likely inside this bucket we use linear interpolation with in the bucket to estimate the topkmin

Cheap procedure because we have one iteration of the LPTA algorithm for every n/b tuples using the values

from the bucket boundaries.

Approx the value of func

Page 55: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Calculating the Estimated cost

• Number of buckets visited along each view’s = d(3) Number of views = r1 (2)

Number of tuples per bucket n/b (10)

• Compute the smallest number of tuples n1 need to be scanned from the last bucket before stopping

• Estimated number of sorted access ((d-1)n/b +n1) r1

((2)(10) + 2) 2 = 44 Therefore running time is O((d-1) + logn1) lock-step iteration

Page 56: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Outline

• LPTA Algorithm

• View Selection Problem

Cost Estimation Framework

View Selection Algorithms

Page 57: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

EstimateCost(Q, U) – Pseudo-code

• SelectViews(Q, V) : Select the subset of views U which minimizes the EstimateCost

• Exhaustive (E) Approach: Estimate the cost of all possible subsets of V and select the subset of views with the smallest cost.

• Feasible for database with few attributes

• Greedy Approach: Keep expanding the set of views to use until the estimated cost stops reducing.

• SelectViews(Q,V) – Pseudo code

Page 58: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Requires the solution of a single linear program. Fix the score sUniform Data distribution & very cheapMaximize the scoring function of the query Max(fq) using the

inequalities that scoring function of each view <= s /fv <= s

(0,1)

(1,0)

(0,0)

Q Selected Views whose hyperplanes intersect at the point which maximize the scoring function

s

s

s

s

s

SelectViewsSpherical (SVS)

T

Page 59: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Select Views By Angle (SVA)

Select Views By Angle (SVA): Sort the views by increasing angle with respect to Query vector.

(0,1)

(1,0)

(0,0)

Q

Selected Views: view closer to query will result in minimum running time for the algo

V1

V2V3V4

1

2

3

4

1 2 3 4

Page 60: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Outline

• LPTA Algorithm

• View Selection Problem

Cost Estimation Framework

View Selection Algorithms

• Experimental Evaluation

Page 61: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Experimental Evaluation

• Two types of dataset: Real and synthetic (uniform and zipf data with varying skew distribution)

• The real dataset contains 30K tuples from a website specialized on automobiles.

• Experiments Conducted:– Performance comparison of LPTA, PREFER and

TA– Performance of LPTA using each of the view

selection algorithms– Scalability of the LPTA algorithm

Page 62: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Performance comparison of LPTA, PREFER and TA

Uniform dataset, 3dReal dataset, 2d

Page 63: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

Conclusions

• Using views for top-k query answering

• LPTA: linear programming adaptation of TA

• View selection problem, cost estimation framework, view selection algorithms

• Experimental evaluation

Page 64: Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

References

• Answering Top-k Queries Using Views:

Gautam Das, Dimitrios Gunopulos, Nick Koudas

• Optimal Aggregation Algorithms for Middleware : Ronald Fagin, Amnon Lotem & Moni Naor

• aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt