cse 707: modern database system seminar lecture 1
TRANSCRIPT
![Page 1: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/1.jpg)
CSE 707: Modern database system seminar
Lecture 2: Data Cube
Zhuoyue Zhao
9/8/2021
![Page 2: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/2.jpg)
Before the lecture
โผ Presentations
โ Background, problem definition, solution and experiments (30 โ 50 min or 20 โ 40 slides)
โ Q&A and discussion
โ Post a private message on Piazza with the slides โ will be posted on course website
โข Share the slides before every Wednesday if possible
โผ Sign-up for your talk if you havenโt
โ Thereโre 3 slots left
โผ Paper summary
โ Submit via Piazza
โ Make it a PDF attachment
โข LaTeX or print to PDF in any text editor (e.g., word)
2
![Page 3: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/3.jpg)
Todayโs agenda
โผ Required reading
โ Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman. Implementing Data Cubes Efficiently. In SIGMOD โ96.
โผ Submit paper summary of the required reading by 9/14 11:59 pm
โ Post a private note in assignment_9/8 folder in Piazza.
3
![Page 4: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/4.jpg)
Data warehousing
โผ Star schema
โ A large fact table connected to many small dimension tables via referential constraints
โข Can be denormalized as a single table (conceptually, not necessarily materialized)
โผ SPJA (Select-Project-Join-Aggregation) queries on star schema
4
s_key s_city s_state s_region t_key t_day t_month t_year c_key p_key cost quantity discount price โฆ
Denormalized schemaStar schema
SPJA query on the star schema
SELECT s_state AS state, t_year AS year, t_month AS month,
SUM(price * quantity * (1 โ discount)) as revenue
FROM sales, store, time
WHERE t_year >= 2016 AND t_year <= 2020
AND sales.store_key = store.store_key
AND sales.time_key = time.time_key
GROUP BY s_state, t_year, t_month;
![Page 5: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/5.jpg)
Data warehousing
โผ Star schema
โ A large fact table connected to many small dimension tables via referential constraints
โข Can be denormalized as a single table (conceptually, not necessarily materialized)
โผ SPJA (Select-Project-Join-Aggregation) queries on star schema
โ => Aggregation queries on the fact table/denormalized schema
5
s_key s_city s_state s_region t_key t_day t_month t_year c_key p_key cost quantity discount price โฆ
Denormalized schemaStar schema
SELECT s_state AS state, t_year AS year, t_month AS month,
SUM(price * quantity * (1 โ discount)) as revenue
FROM denormalized_sales
WHERE t_year >= 2016 AND t_year <= 2020
GROUP BY s_state, t_year, t_month;
Aggregation query on denormalized schema
![Page 6: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/6.jpg)
Materialized aggregation queries
โผ Fact tables/denormalized tables can be huge in size
โ DBMS may not be able to answer aggregation queries very quickly
โผ Materialization helps reduce latency
โ Queries may be answer using the views (which are smaller than the original table)
6
CREATE MATERIALIZED VIEW revenue_by_state_month AS
SELECT s_state AS state, t_year AS year, t_month AS month,
SUM(price * quantity * (1 โ discount)) as revenue
FROM denormalized_sales
GROUP BY s_state, t_year, t_month;
state year month revenue
NY 2015 1
NY 2016 12
NJ 2016 1
NY 2021 1
โฆ
Materialized view on revenue by state and month
Query against the view
NY 2016 12
NJ 2016 1
SELECT *
FROM revenue_by_state_month
WHERE year >= 2016 AND year <= 2020;
![Page 7: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/7.jpg)
A running example
7
The fact table in TPC-D
โผ Fact tableโ lineitem (part, supplier, customer, sales, โฆ)
โ Dimension columns: part, supplier, customer
โ Measure column: sales
โผ Interested in sales GROUPED BY all possible combinations of dimensions1. part, supplier, customer (6M rows)
2. part, customer (6M rows)
3. part, supplier (0.8M rows)
4. supplier, customer (6M rows)
5. part (0.2M rows)
6. supplier (0.01M rows)
7. customer (0.1M rows)
8. none (1 row)
part supplier customer sales
![Page 8: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/8.jpg)
A running example (contโd)
8
The fact table in TPC-D
โผ Interested in sales GROUPED BY all possible combinations of dimensions1. part, supplier, customer (6M rows)
2. part, customer (6M rows)
3. part, supplier (0.8M rows)
4. supplier, customer (6M rows)
5. part (0.2M rows)
6. supplier (0.01M rows)
7. customer (0.1M rows)
8. none (1 row)
โผ Assuming query cost = size of view/table & storage cost = the number of rows
โ Some of the queries may be answered using others
โ Not necessary to materialize all views โ some may be large and may not reduce query cost a lot
โข e.g., (part, customer) may be answered using (part, supplier, customer) with the same cost
part supplier customer sales
Full
materialization
No
materialization
Materializing
everything
except 2. and 4.
Storage cost 19M 6M 7M
Avg. query cost 2.39 M 6M 2.39M
![Page 9: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/9.jpg)
A running example (contโd)
9
The fact table in TPC-D
โผ Interested in sales GROUPED BY all possible combinations of dimensions1. part, supplier, customer (6M rows)
2. part, customer (6M rows)
3. part, supplier (0.8M rows)
4. supplier, customer (6M rows)
5. part (0.2M rows)
6. supplier (0.01M rows)
7. customer (0.1M rows)
8. none (1 row)
โผ Assuming query cost = size of view/table & storage cost = the number of rows
โ Some of the queries may be answered using others
โ Not necessary to materialize all views โ some may be large and may not reduce query cost a lot
โข e.g., (part, customer) may be answered using (part, supplier, customer) with the same cost
part supplier customer sales
Full
materialization
No
materialization
Materializing
everything
except 2. and 4.
Storage cost 19M 6M 7M
Avg. query cost 2.39 M 6M 2.39M
How many views do we need to materialize to ensure a reasonable performance?
Problem statement Given a large fact table with D dimension columns and one measure
column, as well as a space budget k, we want to decide which views to materialize in order to
minimize the average query cost of all possible group-by aggregation queries over the table.
Assumptions linear cost model. No index over dimension columns โ selection must be
answered by scanning the whole table/view.
Some questions to keep in mind
1. How to define the space budget?
2. How to define the average query cost? (arithmetic average or weighted average)
3. How to determine the costs without running all the queries?
4. How to efficiently search in the space of solutions (2๐ท in size)?
5. How to handle > 1 measure columns?
6. What if the fact tables have irrelevant columns that are not of interest?
7. โฆ
![Page 10: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/10.jpg)
The lattice framework
โผ Let ๐ฌ = ๐1, ๐2, โฆ , ๐2๐ท be the set of queries (views)
โ e.g., ๐1 = ๐๐๐๐ก, ๐๐ข๐ ๐ก๐๐๐๐ , denoted by its group-by column(s)
โผ Define the partial order ๐ฌ,โผ
โ ๐1 โผ ๐2 iff ๐1 may be answered with only the results of ๐2โข ๐1 = ๐๐๐๐ก, ๐๐ข๐ ๐ก๐๐๐๐ , ๐2 = ๐๐๐๐ก โ ๐2 โผ ๐1
โ Some queries are not comparable
โข ๐1 = ๐๐๐๐ก, ๐๐ข๐ ๐ก๐๐๐๐ , ๐2 = ๐๐๐๐ก, ๐ ๐ข๐๐๐๐๐๐ โ ๐1 โ ๐2 โง ๐2 โ ๐1โ Can be represented as a lattice
โข Adjacent queries in the lattice diagram differ in one group-by column
โข Lattice for non-hierarchical dimensions is always a hyper-cube
10 Lattice diagram for the running example
![Page 11: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/11.jpg)
Hierarchical dimensions
โผ Some dimension of the table may be hierarchical
โ Consisting of more than one column
โข E.g., ๐ก๐๐๐: (๐๐๐ฆ,๐๐๐๐กโ, ๐ฆ๐๐๐, ๐ค๐๐๐)
โ Complex hierarchy among the columns โ the lattice may not be a hyper-cube
11
Lattice diagram of a hierarchical dimension
![Page 12: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/12.jpg)
Composite lattice for multiple dimensions
โผ Denote queries as ๐-ary tuples of dimensions
โ ๐1 = ๐1, ๐2, โฆ , ๐๐ , ๐2 = ๐1, ๐2, โฆ , ๐๐โ ๐1 โผ ๐2 iff โ๐ โ ๐ , ๐๐ โผ ๐๐
12
c (customer)
n (nation)
none
p (part)
s (size)
none
t (type)
Customer dimension Part dimension Lattice diagram of (customer, part)
![Page 13: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/13.jpg)
Optimizing view selection via data-cube lattice
โผ Suppose the space budget k is the number of views we want to create
โ In addition to the top view (which includes all dimensions)
โผ Average query cost าง๐ ๐ฌ|๐ =ฯ๐โ๐ฌ ๐ถ ๐|๐
|๐ฌ|given the set of selected view ๐
โ ๐ ๐|๐ = ๐โฒ where ๐โฒ is the smallest view s.t. ๐โฒ โ ๐ โง ๐ โผ ๐โฒ
โผ Goal: minimize าง๐ ๐ฌ|๐ subject to ๐ = ๐ + 1 โง ๐ก๐๐ ๐ฃ๐๐๐ค โ ๐
โผ Unfortunately, the problem is NP-hard
13
![Page 14: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/14.jpg)
A greedy algorithm for view selection
โผ Greedy algorithm
14
a (100)
b (50) c (75)
d (20) e (30) f (40)
g (1) h (10)
An example lattice diagram
S = {top view} k=3
![Page 15: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/15.jpg)
A greedy algorithm for view selection
โผ Greedy algorithm
15
a (100)
b (50) c (75)
d (20) e (30) f (40)
g (1) h (10)
An example lattice diagram
S = {top view}
for i = 1 to k do
v = ๐๐๐๐๐๐ฅ๐ฃโ๐ B v, SS = S โ {v}
return S
Define the benefit of adding ๐ฃ to ๐ as the total cost reduction
of all queries preceding ๐ฃ in the partial order ๐ฌ,โผ .
1. โ๐ โผ ๐ฃ, ๐ต๐ ๐ฃ, ๐ โ max 0, ๐ ๐|๐ โ ๐ฃ
2. ๐ต ๐ฃ, ๐ โ ฯ๐โผ๐ฃ๐ต๐ ๐ฃ, ๐
k=3
![Page 16: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/16.jpg)
A greedy algorithm for view selection
โผ Greedy algorithm
16
S = {top view}
for i = 1 to k do
v = ๐๐๐๐๐๐๐ฃโ๐ B v, SS = S โ {v}
return S
A bad case for the greedy algorithm
k=2
![Page 17: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/17.jpg)
A greedy algorithm for view selection
โผ Greedy algorithm
17
S = {top view}
for i = 1 to k do
v = ๐๐๐๐๐๐๐ฃโ๐ B v, SS = S โ {v}
return S
A bad case for the greedy algorithm
k=2
The greedy algorithm is an (1 โ 1/๐)-approximation of the optimal algorithm.
![Page 18: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/18.jpg)
Extensions to the basic model
โผ Queries are not likely to be asked with the same frequency
โ Weight the benefits when computing ๐ต ๐ฃ, ๐ using query probabilities
โ How to find the frequencies?
โผ Space budget = fixed amount of space, instead of the number of views
โ a variant of knapsack problem
โ use the benefit of view per unit space instead as selection criteria
โ may be arbitrarily bad if a large view is excluded for slightly lower benefit/unit
โข The algorithm is (1 โ 1/e โ f) optimal if no view takes more than ๐๐ space.
18
![Page 19: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/19.jpg)
How to estimate the costs of views
โผ Estimating distinct number of groups
โ Use samples of raw data or views to estimate preceding views in the lattice
โ Many different estimators for distinct count exists [1]
19 [1] Peter J. Haas, et al. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDBโ95.
![Page 20: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/20.jpg)
How to choose k?
โผ The hypercube case
โ If all dimensions have equal domain size r and the top view has m rows
โข assuming the data are distributed evenly across groups
โข any view with more than ๐ dimensions have ๐๐ possible groups
when ๐๐ โฅ ๐, some of the groups will definitely be missing => less benefit
when ๐๐ < ๐, most of the groups exist => more benefit
set ๐ = log๐๐
20
![Page 21: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/21.jpg)
How to choose k?
โผ The hypercube case
โ If thereโre unequal domain sizes
โข similar situation but the cliff is distributed among ranks
โผ Space and time optimal solutions
โ See the full paper for details
21
![Page 22: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/22.jpg)
Evaluation on TPC-D
22
Greedy order of view selections Time and Space trade-off
k
co
st
![Page 23: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/23.jpg)
Summary
โผ A general lattice framework for analyze and optimize materialized view selections in data cubes
โผ A greedy algorithm for data-cube view selection that is (1-1/e) optimal in terms of average query cost given a fixed space budget
โผ Theoretical analysis of choice of k in hyper-cube lattice of data cubes
โผ Works well on benchmark data TPC-D
23
![Page 24: CSE 707: Modern database system seminar Lecture 1](https://reader034.vdocument.in/reader034/viewer/2022050302/626edb99c6f26e46ac7d3eb5/html5/thumbnails/24.jpg)
Next time
โผ Reading for 9/15
โ Michael Stonebraker, et al. C-store: a column-oriented DBMS. In VLDB โ05.
โ Presenter: Songtao Wei
โผ This weekโs paper summary is due on 9/14 at 11:59 pm
โ Post a private note with PDF in assignment_9/8
24