cse 707: modern database system seminar lecture 1

CSE 707: Modern database system seminar

Lecture 2: Data Cube

Zhuoyue Zhao

9/8/2021

Before the lecture

◼ Presentations

– Background, problem definition, solution and experiments (30 – 50 min or 20 – 40 slides)

– Q&A and discussion

– Post a private message on Piazza with the slides – will be posted on course website

• Share the slides before every Wednesday if possible

◼ Sign-up for your talk if you haven’t

– There’re 3 slots left

◼ Paper summary

– Submit via Piazza

– Make it a PDF attachment

• LaTeX or print to PDF in any text editor (e.g., word)

2

Today’s agenda

◼ Required reading

– Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. In SIGMOD ‘96.

◼ Submit paper summary of the required reading by 9/14 11:59 pm

– Post a private note in assignment_9/8 folder in Piazza.

3

Data warehousing

◼ Star schema

– A large fact table connected to many small dimension tables via referential constraints

• Can be denormalized as a single table (conceptually, not necessarily materialized)

◼ SPJA (Select-Project-Join-Aggregation) queries on star schema

4

s_key s_city s_state s_region t_key t_day t_month t_year c_key p_key cost quantity discount price …

Denormalized schemaStar schema

SPJA query on the star schema

SELECT s_state AS state, t_year AS year, t_month AS month,

SUM(price * quantity * (1 – discount)) as revenue

FROM sales, store, time

WHERE t_year >= 2016 AND t_year <= 2020

AND sales.store_key = store.store_key

AND sales.time_key = time.time_key

GROUP BY s_state, t_year, t_month;

Data warehousing

◼ Star schema

– A large fact table connected to many small dimension tables via referential constraints

• Can be denormalized as a single table (conceptually, not necessarily materialized)

◼ SPJA (Select-Project-Join-Aggregation) queries on star schema

– => Aggregation queries on the fact table/denormalized schema

5

s_key s_city s_state s_region t_key t_day t_month t_year c_key p_key cost quantity discount price …

Denormalized schemaStar schema



FROM denormalized_sales

WHERE t_year >= 2016 AND t_year <= 2020


Aggregation query on denormalized schema

Materialized aggregation queries

◼ Fact tables/denormalized tables can be huge in size

– DBMS may not be able to answer aggregation queries very quickly

◼ Materialization helps reduce latency

– Queries may be answer using the views (which are smaller than the original table)

6

CREATE MATERIALIZED VIEW revenue_by_state_month AS



FROM denormalized_sales


state year month revenue

NY 2015 1

NY 2016 12

NJ 2016 1

NY 2021 1

…

Materialized view on revenue by state and month

Query against the view

NY 2016 12

NJ 2016 1

SELECT *

FROM revenue_by_state_month

WHERE year >= 2016 AND year <= 2020;

A running example

7

The fact table in TPC-D

◼ Fact table– lineitem (part, supplier, customer, sales, …)

– Dimension columns: part, supplier, customer

– Measure column: sales

◼ Interested in sales GROUPED BY all possible combinations of dimensions1. part, supplier, customer (6M rows)

2. part, customer (6M rows)

3. part, supplier (0.8M rows)

4. supplier, customer (6M rows)

5. part (0.2M rows)

6. supplier (0.01M rows)

7. customer (0.1M rows)

8. none (1 row)

part supplier customer sales

A running example (cont’d)

8






5. part (0.2M rows)



8. none (1 row)

◼ Assuming query cost = size of view/table & storage cost = the number of rows

– Some of the queries may be answered using others

– Not necessary to materialize all views – some may be large and may not reduce query cost a lot

• e.g., (part, customer) may be answered using (part, supplier, customer) with the same cost


Full

materialization

No

materialization

Materializing

everything

except 2. and 4.

Storage cost 19M 6M 7M

Avg. query cost 2.39 M 6M 2.39M

A running example (cont’d)

9






5. part (0.2M rows)



8. none (1 row)

◼ Assuming query cost = size of view/table & storage cost = the number of rows

– Some of the queries may be answered using others

– Not necessary to materialize all views – some may be large and may not reduce query cost a lot

• e.g., (part, customer) may be answered using (part, supplier, customer) with the same cost


Full

materialization

No

materialization

Materializing

everything

except 2. and 4.

Storage cost 19M 6M 7M

Avg. query cost 2.39 M 6M 2.39M

How many views do we need to materialize to ensure a reasonable performance?

Problem statement Given a large fact table with D dimension columns and one measure

column, as well as a space budget k, we want to decide which views to materialize in order to

minimize the average query cost of all possible group-by aggregation queries over the table.

Assumptions linear cost model. No index over dimension columns – selection must be

answered by scanning the whole table/view.

Some questions to keep in mind

1. How to define the space budget?

2. How to define the average query cost? (arithmetic average or weighted average)

3. How to determine the costs without running all the queries?

4. How to efficiently search in the space of solutions (2𝐷 in size)?

5. How to handle > 1 measure columns?

6. What if the fact tables have irrelevant columns that are not of interest?

7. …

The lattice framework

◼ Let 𝒬 = 𝑄1, 𝑄2, … , 𝑄2𝐷 be the set of queries (views)

– e.g., 𝑄1 = 𝑝𝑎𝑟𝑡, 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟 , denoted by its group-by column(s)

◼ Define the partial order 𝒬,≼

– 𝑄1 ≼ 𝑄2 iff 𝑄1 may be answered with only the results of 𝑄2• 𝑄1 = 𝑝𝑎𝑟𝑡, 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟 , 𝑄2 = 𝑝𝑎𝑟𝑡 ⇒ 𝑄2 ≼ 𝑄1

– Some queries are not comparable

• 𝑄1 = 𝑝𝑎𝑟𝑡, 𝑐𝑢𝑠𝑡𝑜𝑚𝑒𝑟 , 𝑄2 = 𝑝𝑎𝑟𝑡, 𝑠𝑢𝑝𝑝𝑙𝑖𝑒𝑟 ⇒ 𝑄1 ⋠ 𝑄2 ∧ 𝑄2 ⋠ 𝑄1– Can be represented as a lattice

• Adjacent queries in the lattice diagram differ in one group-by column

• Lattice for non-hierarchical dimensions is always a hyper-cube

10 Lattice diagram for the running example

Hierarchical dimensions

◼ Some dimension of the table may be hierarchical

– Consisting of more than one column

• E.g., 𝑡𝑖𝑚𝑒: (𝑑𝑎𝑦,𝑚𝑜𝑛𝑡ℎ, 𝑦𝑒𝑎𝑟, 𝑤𝑒𝑒𝑘)

– Complex hierarchy among the columns – the lattice may not be a hyper-cube

11

Lattice diagram of a hierarchical dimension

Composite lattice for multiple dimensions

◼ Denote queries as 𝑛-ary tuples of dimensions

– 𝑄1 = 𝑎1, 𝑎2, … , 𝑎𝑛 , 𝑄2 = 𝑏1, 𝑏2, … , 𝑏𝑛– 𝑄1 ≼ 𝑄2 iff ∀𝑖 ∈ 𝑛 , 𝑎𝑖 ≼ 𝑏𝑖

12

c (customer)

n (nation)

none

p (part)

s (size)

none

t (type)

Customer dimension Part dimension Lattice diagram of (customer, part)

Optimizing view selection via data-cube lattice

◼ Suppose the space budget k is the number of views we want to create

– In addition to the top view (which includes all dimensions)

◼ Average query cost ҧ𝑐 𝒬|𝑆 =σ𝑄∈𝒬 𝐶 𝑄|𝑆

|𝒬|given the set of selected view 𝑆

– 𝑐 𝑄|𝑆 = 𝑄′ where 𝑄′ is the smallest view s.t. 𝑄′ ∈ 𝑆 ∧ 𝑄 ≼ 𝑄′

◼ Goal: minimize ҧ𝑐 𝒬|𝑆 subject to 𝑆 = 𝑘 + 1 ∧ 𝑡𝑜𝑝 𝑣𝑖𝑒𝑤 ∈ 𝑆

◼ Unfortunately, the problem is NP-hard

13

A greedy algorithm for view selection

◼ Greedy algorithm

14

a (100)

b (50) c (75)

d (20) e (30) f (40)

g (1) h (10)

An example lattice diagram

S = {top view} k=3



15

a (100)

b (50) c (75)

d (20) e (30) f (40)

g (1) h (10)

An example lattice diagram

S = {top view}

for i = 1 to k do

v = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣∉𝑆 B v, SS = S ⋃ {v}

return S

Define the benefit of adding 𝑣 to 𝑆 as the total cost reduction

of all queries preceding 𝑣 in the partial order 𝒬,≼ .

1. ∀𝑄 ≼ 𝑣, 𝐵𝑄 𝑣, 𝑆 ≜ max 0, 𝑐 𝑄|𝑆 − 𝑣

2. 𝐵 𝑣, 𝑆 ≜ σ𝑄≼𝑣𝐵𝑄 𝑣, 𝑆

k=3



16

S = {top view}

for i = 1 to k do

v = 𝑎𝑟𝑔𝑚𝑖𝑛𝑣∉𝑆 B v, SS = S ⋃ {v}

return S

A bad case for the greedy algorithm

k=2



17

S = {top view}

for i = 1 to k do

v = 𝑎𝑟𝑔𝑚𝑖𝑛𝑣∉𝑆 B v, SS = S ⋃ {v}

return S

A bad case for the greedy algorithm

k=2

The greedy algorithm is an (1 – 1/𝑒)-approximation of the optimal algorithm.

Extensions to the basic model

◼ Queries are not likely to be asked with the same frequency

– Weight the benefits when computing 𝐵 𝑣, 𝑆 using query probabilities

– How to find the frequencies?

◼ Space budget = fixed amount of space, instead of the number of views

– a variant of knapsack problem

– use the benefit of view per unit space instead as selection criteria

– may be arbitrarily bad if a large view is excluded for slightly lower benefit/unit

• The algorithm is (1 – 1/e – f) optimal if no view takes more than 𝑓𝑆 space.

18

How to estimate the costs of views

◼ Estimating distinct number of groups

– Use samples of raw data or views to estimate preceding views in the lattice

– Many different estimators for distinct count exists [1]

19 [1] Peter J. Haas, et al. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDB’95.

How to choose k?

◼ The hypercube case

– If all dimensions have equal domain size r and the top view has m rows

• assuming the data are distributed evenly across groups

• any view with more than 𝑖 dimensions have 𝑟𝑖 possible groups

when 𝑟𝑖 ≥ 𝑚, some of the groups will definitely be missing => less benefit

when 𝑟𝑖 < 𝑚, most of the groups exist => more benefit

set 𝑘 = log𝑟𝑚

20

How to choose k?

◼ The hypercube case

– If there’re unequal domain sizes

• similar situation but the cliff is distributed among ranks

◼ Space and time optimal solutions

– See the full paper for details

21

Evaluation on TPC-D

22

Greedy order of view selections Time and Space trade-off

k

co

st

Summary

◼ A general lattice framework for analyze and optimize materialized view selections in data cubes

◼ A greedy algorithm for data-cube view selection that is (1-1/e) optimal in terms of average query cost given a fixed space budget

◼ Theoretical analysis of choice of k in hyper-cube lattice of data cubes

◼ Works well on benchmark data TPC-D

23

Next time

◼ Reading for 9/15

– Michael Stonebraker, et al. C-store: a column-oriented DBMS. In VLDB ‘05.

– Presenter: Songtao Wei

◼ This week’s paper summary is due on 9/14 at 11:59 pm

– Post a private note with PDF in assignment_9/8

24

https://cse.buffalo.edu/~zzhao35/teaching/cse707_fall21/cstore.pdf

cse 707: modern database system seminar lecture 1

Documents