University of Massachusetts Amherst · Department of Computer Science
Optimizing Probabilistic Query Processing on Continuous Uncertain
Data
Liping PengYanlei DiaoAnna Liu
VLDB 2011Seattle WA, US
2Department of Computer Science
Applications of Uncertain Data Management
TV
3Department of Computer Science
Motivating Application – Sloan Digital Sky Survey
SELECT *FROM Galaxy AS GWHERE G.r < 22AND G.q_r2+G.u_r2 > 0.25
Q1:
SELECT *FROM Galaxy AS G1, Galaxy AS G2WHERE G1.OBJ_ID < G2.OBJ_IDAND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05AND (G1.rowc-G2.rowc)2+ (G1.colc-G2.colc)2 < 1E4
Q2:
name type descriptionOBJ_ID bigint SDSS identifier… …(rowc, rowc_err) real (row center position, error term)
(colc, colc_err) real (column center position, error term)(q_u, qErr_u) real (stokes Q parameter, error term)(u_u, uErr_u) real (stokes U parameter, error term)(ra, dec, ra_err, dec_err, ra_dec_corr)
real (right ascension, declination, error in ra, error in dec, ra/dec correlation)
… …
Continuous uncertain data
Complex selection and join predicates
Return answers of high confidence efficiently
4Department of Computer Science
Previously Proposed Data Model
[Tran et al. PODS: A New Model and Processing Algorithms for Uncertain Data Streams. SIGMOD 2010 Tran et al. Conditioning and Aggregating Uncertain Data Streams: Going Beyond Expectations. PVLDB 2010]
Gaussian Mixture Models (GMMs) for continuous uncertain attributes
Object_ID Speed X Y
MA123456
• Flexible• Succinct• Computation efficiency
Tuple model TEP
0.7
5Department of Computer Science
Scope of Problem
SELECT *FROM Galaxy AS GWHERE G.r > 22
rid
1
2
Probabilistic threshold query processing and optimization• Avoid expensive operations for non-viable tuples• Find efficient plans based on predicates and distributions
TEP
0.8
rid
1
2 0.5
(λ=0.7)
Continuous uncertain dataGaussian Mixture Models (GMMs)
Select-Project-Join (SPJ) queries with threshold λ
Results with tuple existence probability (TEP) >λ
TEP
1
1
6Department of Computer Science
Outline Motivation
Optimize Probabilistic Threshold Selections
Optimize Probabilistic Threshold Joins
Per-tuple Based Planning and Execution
Evaluation
7Department of Computer Science
SELECT *FROM Galaxy AS GWHERE G.q_r2+G.u_r2 < 0.25
Probabilistic Threshold Selections
Given a tuple with distribution f, the probability to satisfy θ:
Return tuples with TEP>0.8 (λ)
S={q_r, u_r}Continuous uncertain attributes:
Selection condition θ
Selection region Rθq_r
u_r
u_rq_r
f
>
8Department of Computer Science
Probabilistic Threshold Selections
Given a tuple with distribution f, the probability to satisfy θ:
Return tuples with TEP>0.8 (λ)
Selection condition θ
Selection region Rθ
SELECT * FROM Galaxy AS G1, Galaxy AS G2WHERE G1.OBJ_ID < G2.OBJ_IDAND |(G1.u-G1.g)-(G2.u-G2.g)| < 0.05AND |(G1.g-G1.r)-(G2.g-G2.r)| < 0.05AND (G1.rowc-G2.rowc)2+(G1.colc-G2.colc)2 < 1E4
Q2:
S={G1.u, G1.g, G1.r, G1.rowc, G1.colc, G2.u, G2.g, G2.r, G2.rowc, G2.colc}
Continuous uncertain attributes:
A high-dimensional integral for each tuple!
>
9Department of Computer Science
A general approach to derive an upper bound Given a tuple X, define a (multi-dim) Chebyshev region
Test the overlap of Rλ(X) with predicate region Rθ
• If Rλ(X) and Rθ are disjoint, filter the tuple
Applying Fast Filters to Avoid IntegralsDerive an upper bound (Ũ) for the integral at a low cost
• If Ũ<λ, filter tuples without computing integrals• Otherwise, still integrate to compute the true probability
A geometric intersection problem Constrained optimization generally. Can
use techniques like Lagrange multiplier
Rλ(X)
Rθ0.2
0.2
-0.2
-0.2
u
g
|u|<0.2 and |g|<0.2
Fast filters for common predicates
10Department of Computer Science
Reducing Dimensionality of Integration
σθ : n-dim space
• region: Rθ
• distribution: fX(x)• integral:
σθ’ : m-dim space
• region: R’θ = {y|y=Bx+b, x Rθ}
• distribution: fY(y)• integral:
Linear transformation (LT):
Y=BX+b
An algorithm to find a transformation matrix Bm×n m≤n
if m<n, LT helps to reduce dimensionality if m=n, LT does not help
Let Xn~N(μ,Σ) and Y=Bm×nX+bm×1 then Ym~N(Bμ+b,BΣBT)
11Department of Computer Science
Outline Motivation
Optimize Probabilistic Threshold Selections
Optimize Probabilistic Threshold Joins
Per-tuple Based Planning and Execution
Evaluation
12Department of Computer Science
Probabilistic Threshold Joins
Key idea: filtered cross-product using indexes• For each tuple r, the index returns a subset of S to pair with r• (r,s) pairs returned by include all true matches•
• A necessary condition for• “Tight” enough, a sufficient and necessary condition if possible
Large numbers of intermediate tuples!
A probabilistic threshold join of relations R and S is:
True match: tuple pair (r,s) such that >
>
13Department of Computer Science
Designing an Index
search key query region
Deterministic
Probabilistic
Quantities concerning S Instantiate with quantities concerning R
S.AInstantiate with a
deterministic value of R.AE.g. when R.A=5, 5-b<S.A<5-a
A necessary condition for
Build an index on S for a<R.A-S.A<b
A distribution instead of a deterministic value!
14Department of Computer Science
Theorem 1:
Search key:
Query region:
?Band Join of GMMs ( a<R.A-S.A<b) r.A: Xr, μr, σr
2
s.A: Xs, μs, σs2
Z=Xr-Xs follows a GMM with μz=μr-μs and σz
2=σr2+σs
2
x
y
Overlap test of RQ1 and RI [x1,x2;y1,y2]: RI overlaps with RQ1 if its upper left vertex (x1,y2) is in RQ1
μr-a
Necessary condition:
R1 R2
R3 R4 R5 R6 R7
…
x y
15Department of Computer Science
Band Join of Gaussians (a<R.A-S.A<b)
Given Z~N(μ,σ2), Pr[a<Z<b] > λ iff there exists an such that
Search key: Query region:
Gaussian properties offer a sufficient and necessary condition
Overlap test: Requires math derivation; can be implemented efficiently
inverse of the standard normal cdf
Theorem 2:
x’
y’
Z=Xr-Xs
16Department of Computer Science
Outline Motivation
Optimize Probabilistic Threshold Selections
Optimize Probabilistic Threshold Joins
Per-tuple Based Planning and Execution
Evaluation
17Department of Computer Science
Query Planning
Faster filters based on inequalities
Filtered cross-product using indexes
LogicalOperators
PhysicalOperators
Exact selection using integrals (with LT)How to arrange operators to get an efficient plan ?
18Department of Computer Science
Predicate Selectivities
20 25 3024
Per-tuple Based Planning
Tuple Attributesid r q_r u_r1 N(27, 2.2) N(1, 2.2) N(0.1, 1.1)
2 N(21, 0.1) N(0, 0.1) N(-0.1, 0.1)
Q1: SELECT * FROM Galaxy WHERE r < 24 AND q_r2+u_r2 > 0.25
Consider both selectivity and cost like the traditional planner Differences
• Exact selections are expensive due to the use of integrals• Selectivity should be defined on a per-tuple basis
=> The optimal order varies on a per-tuple basis
θ1θ2
Optimal plan for t1:
Optimal plan for t2:
0.08
1
0.95
0.0002
θ1 θ2
θ2 θ1
θ1 θ2
19Department of Computer Science
Tuple-based Query Planning and Execution Tuple t1 from R needs to go through three selection
predicates and five join predicates
To-process tuple pool
σθ1 σθ2 σθ3
θ4 θ5 θ6 θ7 θ8
Predicates on R σθ1 σθ2 σθ3
Est. cost 100 300 104
SelectivityRank
Join R with S TPredicateEst. cost 500 300 100 104 50Has index Y Y N Y N#candidatesChoose
0.8 0.2 0.12 1 3
10 4 105 1021✓
t1
t4 t3 t2
Step 1: Estimate selectivities and rank selection predicates Step 2: Execute filters first, then exact selections Step 3: Choose a relation to join with Step 4: Execute the (filtered) cross-product
selection: θ4 θ5 θ6 join: θ7 θ8
θ4 θ5 θ6 θ7 θ8
20Department of Computer Science
Outline Motivation
Optimize Probabilistic Threshold Selections
Optimize Probabilistic Threshold Joins
Per-tuple Based Planning and Execution
Evaluation using Data and Queries from SDSS
21Department of Computer Science
Fast Filters for Selections
General filter v.s. Exact integration
SELECT * FROM Galaxy WHERE 100<rowc<100+δ AND 100<colc<100+δ (λ=0.7)
• Without filters, constant high cost for all ranges tested• With filters, per tuple cost is very low for small predicate ranges• More improvement for larger λ values tested
Data Characteristics• Gaussians (from SDSS)Parameters • δ: predicate range • λ: probability thresholdMetrics• Time per tuple
22Department of Computer Science
xbound vs GaussJoin in efficiency
Xbound join index [R. Cheng et al. VLDB 2004 & CIKM 2006]• Given a distribution f and [l,u], store x% quantiles from both ends• A loose necessary condition for true matches
Indexes for Band Joins (stream)
xbound vs GaussJoin in filtering power
SELECT * FROM Galaxy AS R, Galaxy AS S WHERE |R.u-S.u|<δ (λ=0.7, W=500)
• Our index for Gaussians returns exactly the true match set• Xbound returns more candidates• Our index outperforms xbound in efficiency significantly
23Department of Computer Science
Optimal query planning• Generate the best plan for each tuple offline
and load it into memory before execution
Static query planning [Y. Qi et al. SIGMOD 2010]• A fixed plan for each query based on the
selectivities of predicates over entire data set
Dynamic query planning• Rank predicates for each tuple
δ1 δ2staticorder
statictime (ms)
dynamic time (ms)
performancegain
optimaltime (ms)
20 0.2 [1 2] 0.6 0.181 70% 0.177
20 0.5 [1 2] 0.6 0.068 89% 0.067
20 1 [2 1] 9.6 0.050 99% 0.048
22 0.2 [2 1] 18.2 7.216 60% 7.007
22 0.5 [2 1] 13.9 1.515 89% 1.482
22 1 [2 1] 9.6 0.351 96% 0.348
24 0.2 [2 1] 18.2 15.613 14% 15.287
24 0.5 [2 1] 14.4 6.390 56% 6.334
24 1 [2 1] 9.6 2.264 76% 2.236
Tuple Based Planning and ExecutionSELECT *FROM Galaxy AS GWHERE G.r < δ1
AND G.q_r2+G.u_r2 > δ22
Q1:
θ1 θ2
Over 50% gains in most cases
Very close to the optimal in all cases
24Department of Computer Science
Conclusions Optimize probabilistic threshold selections
• Fast filters to avoid integrals• Reducing dimensionality of integration by linear transformation
Optimize probabilistic threshold joins• Filtered cross-product using new indexes
Dynamic, per-tuple based planning Evaluation
• Significant performance gains over the state-of-the-art indexing technique and query optimizer
Future work• Extend to a larger class of queries including group-by aggregates• Support user-defined functions• Query optimization with correlated tuples
25Department of Computer Science
Thank you!
Q & AOptimizing Probabilistic Query
Processing on Continuous Uncertain Data
Liping Peng Yanlei Diao Anna Liu
http://claro.cs.umass.edu/