answering top-k queries with multi-dimensional selections: the ranking cube approach dong xin,...

29
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer Science University of Illinois at Urbana- Champaign VLDB 2006

Upload: rebecca-chase

Post on 08-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

3 Multi-Dimensional Ranking Analysis Consider an online used car database R Type (e.g., sedan, convertible) Maker (e.g., Ford, Hyundai) Color (e.g., red, silver) Price Mileage select top 10 * from R where type = “convertible” order by price + mileage asc select top 10 * from R where type = “convertible” and color = “red” order by price + mileage asc select top 10 * from R where type = “convertible” and color = “red” and maker = “Ford” order by price + mileage asc Roll Up Drill Down

TRANSCRIPT

Page 1: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach

Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

VLDB 2006

Page 2: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

2

Outline

• Introduction• Ranking cube• Answering top-k queries by Ranking Cube• Ranking Fragments• Performance study• Discussion and Conclusions

Page 3: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

3

Multi-Dimensional Ranking Analysis• Consider an online used car database R

Type (e.g., sedan, convertible)Maker (e.g., Ford, Hyundai)Color (e.g., red, silver)PriceMileage

select top 10 * from Rwhere type = “convertible” order by price + mileage asc

select top 10 * from Rwhere type = “convertible” and color = “red”order by price + mileage asc

select top 10 * from Rwhere type = “convertible” and color = “red” and maker = “Ford”order by price + mileage asc

Roll Up

Drill Down

Page 4: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

4

OLAP with Ranking?• Data cube revisited

– Pre-compute multi-dimensional group-bys– Traditional measures: SUM, COUNT, AVG

• Materializing all top-k results is not feasible– Different k values– Various ranking functions

• e.g., order by (price-20k)^2 + (mileage-10k)^2 asc

• Our Proposal: Ranking Cube– Semi-online computation with semi-offline materialization– Support a broad class of ranking functions

Page 5: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

5

More on Rank-Aware Data Cube

• Given a relation R– A1, A2, …, As are selection dimensions– N1,N2,…,Nr are ranking dimensions– {Ai} and {Nj} are not exclusive

• Our goal: efficiently answering top-k queries in a multi-dimensional space

Ranking function f(x) satisfies:

1. Given the sub-domain of x, the extreme point x* can be computed

2. Given a sub-domain of x, the upper and lower bounds of f(x) can be computed

Page 6: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

6

Rank-Aware Query Processing• Rank-aware materialization for linear functions

– Onion [Chang et al, SIGMOD’00], PREFER [Hristidis et al. SIGMOD’01], Robust Indexing [Xin et al. VLDB’06]

• Rank-aware query transformation– Map rank query to range query [Chaudhuri et al. VLDB’99, Bruno et al. TO

DS’02]

• Rank-aware query optimization– TA [Fagin et al. PODS’ 01], RankSQL [Li et al. SIGMOD’05], Boolean+Ran

king [Zhang et al. SIGMOD’06]

• Rank aggregate– RankAgg [Li et al. SIGMOD’06], ObjectFinder [Kaushik et al. SIGMOD’06]

• Rank query with Joins– Ranked Join indices [Tsaparas et al. ICDE’03], Rank-Join [Ilyas et al, VLD

B’03, SIGMOD ’04]

• And more…

Page 7: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

7

What’s New with Ranking Cube• An effort made to enrich the data cube

– Inherit the power of multi-dimensional analysis

– A new rank-aware materialization without assuming particular (e.g., linear) function

• Top-k query processing based on Rank Cube– Transfer a top-k query to a sequence of selection queries

– Block-level access instead of tuple-level access

– No modification needed in DBMS

Page 8: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

8

Outline

• Introduction• Ranking cube• Answering top-k queries by Ranking Cube• Ranking Fragments• Performance study• Discussion and Conclusions

Page 9: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

9

Ranking Cube• Intuition

– Given a ranking function, the ranking cube should be able to:• Quickly locate the most promising data region• How many tuples are there, and which tuples?• Efficient data retrieval

• Approach– Step 1: Create logical block space for rank analysis

• Group geometrically closed tuples into blocks by data partitioning

– Equi-depth– R-tree– Clustering

• Compute (logical) block ID for each block• The logical block space constitutes the basis for data cubing

Page 10: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

10

Ranking Cube (cont.)• Approach (cont.)

– Step 1: Create logical block space for rank analysis• Each tuple is associated with a (logical) block ID

– Step 2: Compute measures in ranking cube• Group-by with selection dimensions• Straight-forward measure: logical block IDs, as well as the lis

t of tuple ID (TID) inside• Alternative measure: Compressed version (will discuss later)

– Step 3: Create physical block space for efficient I/O• The size of the logical block differs in each cuboids due to th

e multi-dimensional selections• Group nearby logical blocks into a physical block for efficient

data retrieval

Page 11: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

11

Constructing Ranking Cube

A1 A2 {B: TID}

1 1 {1: 1,4} {5: 3}

1 2 {11: 2}

.. .. …

Expected Logical Block

Size P

Measure in Ranking Cube

A cell in ranking cube

Generating Logical Block Dimension

A1,A2: Selection Dimensions

N1,N2: Ranking Dimensions

Create Logical Block Space

N1

N2

Data Cubing

Table for data cubing Block table

Page 12: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

12

Constructing Ranking Cube (cont.)A1 A2 {B:TID}

1 1 {1:1,4}

.. .. …

A1 {B:TID}

1 {1: 1, 4}, {5: 3}

.. …

The sizes of TID list in different cuboids are not balanced due to the different

cardinality of each dimension

Physical block: Merge nearby logical blocks

Logical block: Original block

partitions

A1 A2 B’ {B:TID}

1 1 1 {1:1,4} ,{5:3}

1 1 3 {9:17}

.. .. .. …

Physical Block

Page 13: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

13

Outline

• Introduction• Ranking cube• Answering top-k queries by Ranking Cube• Ranking Fragments• Performance study• Discussion and Conclusions

Page 14: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

14

Query Processing (1)

• Data access methodsGet physical block from Ranking Cube:

Clustered index on Cell identifiers (A1, A2, B’)

Get logical block from Block Table:

Clustered index on B

A1 A2 B’ {B:TID}

1 1 1 {1:1,4} ,{5:3}

1 1 3 {9:17}

.. .. .. …

Page 15: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

15

Query Processing (2)Locate the first logical block (b1)The target physical block (t1, t3, t4) is retrieved(t1,t4) is returned, t3 is buffered

Query processing works on logical block space

Data accessing works on physical block space

Ranking Cube maintains the mapping between logical block and physical block

S list: maintains the current top answers

H list: maintains the best possible unseen answers

Locate the second logical block (b5)The target physical block (t1, t3, t4) is identifiedt3 was buffered, thus is directed returned

Select top 2 * from R where A1=1 and A2=1 order by N1+N2 asc

Page 16: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

16

Query Processing (3)• Determine next logical blocks to be retrieved

– First logical block: analyzed by ranking function– Continuing logical blocks

• Found in neighboring blocks (for convex functions)• Decompose the space and analyze each of them (for other

functions)

Ranking Function:

N1+N2

First Block

Second Block

Ranking Function:

(N1-0.5)^2+(N2-0.5)^2

First Block

Second Block

Page 17: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

17

Outline

• Introduction• Ranking cube• Answering top-k queries by Ranking Cube• Ranking Fragments• Performance study• Discussion and Conclusions

Page 18: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

18

Ranking Fragments (1)

• Curse of dimensionality?

ABCD

ABC ABD ACD BCD

AC

BC

AD

BD

CD

A DB C

ABPartition dimensions into several groups

Materialize low dimensional cuboids offline

Assembly high dimensional cuboids online

Mining Cube Approach

[Li et al, VLDB’04]

Page 19: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

19

Ranking Fragments (2)

Materialized:

Cuboids A1, Cuboids A2

To Assemble:

Cuboids A1A2

Do not assemble the whole cuboids

Assemble the required cells only

Requested Cell in Cuboids A1A2

Cuboids A1:

Request (A1=1, B=b1), return { t1, t4, t10,…}

Cuboids A2: Request (A2=1, B=b1), return {t1, t4, t3,…}

Merge two lists for cuboids A1A2:

Request (A1=1,A2=1, B=b1), return { t1, t4,}

Page 20: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

20

• Decouple the cubing and partitioning modules• Unified logical block space for all cuboids

– Reduces computation and space comparing with m-dim indices– Makes the online fragment assembly easier– Advanced partitioning methods for high-dimensional and structural

data

• Compressing TID list – Lossless compression: e.g., dictionary encoding, null suppression– Lossy compression: e.g., bloom filter– High-level summary: e.g., count (mean) in each logical block– Compressing across cuboids: e.g., correlation between cells

• Block-level data access instead of tuple-level access

Ranking Cube: Beyond the index

Page 21: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

21

Outline

• Introduction• Ranking cube• Answering top-k queries by Ranking Cube• Ranking Fragments• Performance study• Discussion and Conclusions

Page 22: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

22

Experimental Results

• Performance study– Baseline: SQL Server

• Index on each dimension– Query transformation [Chaudhuri et al. VLDB’99]

• Transform a ranked query to a range selection query• Multi-dimensional index

– Ranking cube (fragment)• Index on cube cells (cuboids)• Index on block ids (block tables)

Page 23: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

23

Experiment Setting• Implementation details

– C# + MS SQL Server 2005– Store all ranking cubes, block tables in SQL Server

• Synthetic data sets

• Real data set: Forest CoverType – 12 selection dimensions, 3 ranking dimensions, 3.5M tu

ples

Page 24: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

24

Execution Time w.r.t. K

Synthetic data

Default parameter setting

Page 25: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

25

# Dimensions in Ranking Function

Synthetic data with 4 ranking dimensions

Partitioning was built on all 4 ranking dimensions

The number of ranking attributes in queries are varied from 2 to 4

Page 26: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

26

Number of Data Tuples

Vary the size of the database from 1M to 10M

Very promising performance on large datasets

Page 27: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

27

Ranking Fragments

Forest CoverType data

Partition selection dimensions into groups with size 3

Build ranking fragments on each group

Vary the fragment size

Page 28: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

28

Space Usage

1. Space usage grows linearly with number of selection dimensions

2. Most space is used to store the cell identifiers in the relational table

3. The space usage can be greatly reduced by storing the ranking cube out of the relational table

Build ranking fragments with group size 2

Page 29: Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer

29

Conclusions and Future work• OLAP with Ranking

– Ranking Cube as semi-offline materialization– Ranked query processing by semi-online computation

• Current status– Extended to multi-relational ranked queries using

multi-rank-cube

• Future work– Apply compression techniques– Exploit and compare different partitioning strategies– Support more query types