1 query optimization in compressed database systems zhiyuan chen and johannes gehrke cornell...
Post on 21-Dec-2015
215 views
TRANSCRIPT
![Page 1: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/1.jpg)
1
Query Optimization Query Optimization In Compressed Database SystemsIn Compressed Database Systems
Zhiyuan Chen and Johannes Gehrke
Cornell University
Flip Korn
AT&T Labs
![Page 2: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/2.jpg)
2
Why Compression?Why Compression?
CPU speed outpaces Disk speed exponentially!– x10 / decade (bandwidth), x100 / decade (latency)
Trade CPU for I/O: improve query performance+ Save bandwidth for sequential I/O+ Improve buffer pool hit ratio- Pay decompression cost
Environment– Decision support queries– Lossless compression
![Page 3: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/3.jpg)
3
IssuesIssues
Database compression methods
Efficient query processing
![Page 4: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/4.jpg)
4
Database Compression MethodsDatabase Compression Methods
General-purpose compression
Only compression ratio matters
Large decompression unit
(whole file)
Database compression
Both compression ratio and decompression cost matter
Small decompression unit (attribute or tuple)
Our setting: allow to decompress a single attribute
![Page 5: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/5.jpg)
5
Efficient Query ProcessingEfficient Query Processing
Compared to uncompressed DB– When to decompress– Assumption: no compression in query processing
Our story– Different strategies of when to decompress– None of them is always optimal– Combined optimization problem:
Query plan + decompression placement– Solutions– Experiments
![Page 6: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/6.jpg)
6
Different Decompression StrategiesDifferent Decompression Strategies
R S
R.A = S.B
Eager
D(R) D(S)
All uncompressed
D(R.A) D(S.B)
AB uncompressed
R S
R.A = S.B
Lazy
R S
d(R.A) = d(S.B)
All compressed
Transient
Mem
Disk
![Page 7: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/7.jpg)
7
Which Strategy Is Optimal?Which Strategy Is Optimal?
Lazy vs. eager– Lazy is always better
Transient vs. Lazy – Transient: more I/O savings– Lazy: lower decompression cost
In practice– Numerical attributes: transient is always better– String attributes: no clear winner
• Expensive to decompress• High I/O savings if compressed
![Page 8: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/8.jpg)
8
An Example With TPCH DataAn Example With TPCH Data
Select S_NAME, S_ADDRESS, C_NAME, C_PHONEFrom Supplier, CustomerWhere S_ADDRESS = C_ADDRESSOrder by S_NAME, C_NAME
Supplier Customer
S_A = C_A
Sort(S_N, C_N)
![Page 9: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/9.jpg)
9
Lazy BNL (2s)
Lazy sort (7s)
Transient vs. LazyTransient vs. Lazy
1 attribute compressed
Lazy BNL (2s)
Transient sort (3s)
3 attributes compressed
Transient BNL (42s)
Transient sort (0.5s)
All attributescompressed
An optimization problem!
![Page 10: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/10.jpg)
10
Lazy BNL (2s)
Transient sort (3s)
Interactions With Traditional OptimizationInteractions With Traditional Optimization
Optimal plan returned by System R is no longer optimal!
Pruned by System R
Algorithm: run System R, then decide when to decompress.
3 attributes compressed
Transient SM (2.5s)
Transient sort (0.5s)
All attributes compressed
![Page 11: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/11.jpg)
11
Compression Aware OptimizationCompression Aware Optimization
Given a query and a compressed DB: Find the optimal query plan
New operators– Explicit decompression operators– Transient versions of existing relational operators
Search space: O (nm) factor over old search space– n is the depth of the plan – m is the number of attributes– Each attribute explicitly decompressed at most once– For each attribute, n places to decompress explicitly
![Page 12: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/12.jpg)
12
Dynamic Programming - OPTDynamic Programming - OPT
Extend system R optimizer – Bottom up, one minimal plan per interesting property– What attributes remain compressed as a new property
Blowup reduced from nm to 2m
Lazy BNL (2s)Property: S_A, C_A uncompressed
Customer Supplier
Transient SM join (2.5s)Property: all compressed
Customer Supplier
![Page 13: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/13.jpg)
13
Min-K Heuristic AlgorithmMin-K Heuristic Algorithm
Store plans for k rather than 2m properties– The k properties whose plans are cheapest
Storage blowup reduced from 2m to k Time: still exponential blowup in the worst case
Join on S_A, C_A
Stored plans: Lazy: S_A, C_ATransient: S_A, C_ALazy: S_A, transient: C_ATransient: S_A, Lazy: C_A
S_A,… C_A,…
![Page 14: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/14.jpg)
14
Min-K Heuristics (2)Min-K Heuristics (2)
If transient decompression is bad for one join attribute, often so for the other– BNL join: both S_A and C_A decompressed N2 times
Time blowup is 2k
Join on S_A, C_A
Stored plans: Lazy: S_A, C_A
Transient: S_A, C_AS_A,… C_A,…
Only consider two cases
![Page 15: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/15.jpg)
15
ExperimentsExperiments
Setup– Modify Predator query engine & optimizer– Algorithms
• Uncompressed, Eager, Lazy, Transient-Only,Two-Step, OPT, Min-1, Min-2
– 100 MB TPCH data– 50% compression ratio– Pentium III 550 Mhz, vary buffer pool size
![Page 16: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/16.jpg)
16
Experimental Setup (2)Experimental Setup (2)
Randomly add join conditions on string attributes Divide queries into workloads
– Number of string join conditions, number of join tables
Metrics: for algorithm X– Average relative-cost:
Average(cost of plan returned by X / cost of opt plan)– Average blowup factor:
Average(# plans searched by X / # plans by System R)
![Page 17: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/17.jpg)
17
Average Relative CostAverage Relative Cost
Queries with 3-4 join tables, buffer pool 10% of compressed DB
0
2
4
6
8
10
12
14
0 1 2 3Number of join conditions on string
attributes
Rel
-co
st
OPT
Min-2
Min-1
Two-Step
Eager
Lazy
Transient-Only
Uncompressed
![Page 18: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/18.jpg)
18
Distribution of Query PerformanceDistribution of Query Performance
Percentage of Good plans (cost within twice of OPT) for all queries
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Per
cen
tag
e o
f g
oo
d p
lan
s Min-2
Min-1 Two-Step
Eager
LazyTransient-
Only
NotCompressed
![Page 19: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/19.jpg)
19
Optimization CostOptimization Cost
Queries with 3-4 join tables
0
10
20
30
40
50
60
0 1 2 3
Number of join conditions on string attributes
Blo
wu
p F
acto
r OPT
Min-2
4
![Page 20: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/20.jpg)
20
Related WorkRelated Work
How to compress– Roth&Horn93, Iyer&Wilhite94, Goldstein98
How to query– Graefe&Shapiro91, Westmann00, Greer99
Query optimization– Compressed MOLAP aggregates: Li99– Compressed Bitmap indices:Amer-Yahia&Johnson00– Expensive predicates:
• Chaudhuri&Shim99, Hellerstein93
![Page 21: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/21.jpg)
21
Conclusions & Future WorkConclusions & Future Work
Novel optimization problem– Search for regular query plan + when to decompress– Separate search sub-optimal– OPT and Min-K heuristic– Up to an order improvement in experiments
Future work– Caching decompressed values– Updates
![Page 22: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/22.jpg)
22
Search SpaceSearch Space
S_A, …
S_A = C_A
Sort(S_A)
3 extended plans (3 is depth)
nm blow up over old space-n: depth of plan-m: number of attributes
D(S_A)
3 places to place D(S_A)
Transient join
Before: convert to transient
Regular sort
After: as it is
![Page 23: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/23.jpg)
23
Relative-CostRelative-Cost- Varying Buffer Pool Size- Varying Buffer Pool Size
Queries with 3- 4 join tables, 2 additional string joins
0
2
4
6
8
10
12
14
10% 40% 200%
Buffer Pool Size (% of compressed DB)
Rel
-cost
OPT
Min-2
Min-1
Two-Step
Eager
Lazy
Transient-Only
Uncompressed
![Page 24: 1 Query Optimization In Compressed Database Systems Zhiyuan Chen and Johannes Gehrke Cornell University Flip Korn AT&T Labs](https://reader030.vdocument.in/reader030/viewer/2022032521/56649d575503460f94a3596c/html5/thumbnails/24.jpg)
24
Relative Performance (2)Relative Performance (2)
Queries with more than 5 join tables
0
2
4
6
8
10
12
0 1 2 3Number of join conditions on string
attributes
Rel
-Cos
t
OPTMin-2
Min-1Two-Step
EagerLazyTransient-Only
Uncompressed