scorpion explaining away outliers in aggregate queries eugene wu and sam madden mit
Post on 22-Dec-2015
223 Views
Preview:
TRANSCRIPT
ScorpionExplaining Away Outliers in Aggregate Queries
eugene wu and sam maddenMIT
http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
Table
Split Aggregate Visualize
USA Italy
Exp
en
ses
China
SELECT sum(cost)FROM expensesGROUPBY country
USA ItalyChina
Exp
en
ses
SELECT sum(cost)FROM expensesGROUPBY country
USA ItalyChina
Exp
en
ses
SELECT sum(cost)FROM expensesGROUPBY country
USA ItalyChina
Exp
en
ses
SELECT sum(cost)FROM expensesGROUPBY country
USA ItalyChina
GivenOutlier and normal results
Understand Why
Exp
en
ses
SELECT sum(cost)FROM expensesGROUPBY country
GivenOutlier and normal results
caused the outliers?
most caused the outliers?
caused outliers but didn’t affect normal outputs?
USA ItalyChina
What input properties
Exp
en
ses
SELECT sum(cost)FROM expensesGROUPBY country
Can’t Touch This
Provenance
Data!
Provenance
$$$
SELECT SUM(cost)FROM sam’s bank account
Provenance
SELECT SUM(cost)FROM sam’s bank account
$$$
Provenance
SELECT SUM(cost)FROM sam’s bank account
$$$
Provenance
Darn! Ya caught
me
SELECT SUM(cost)FROM sam’s bank account
$$$
Provenance
http://weknowmemes.com/2012/04/whats-the-point/
Filter for “most
influential”
Provenance
SELECT SUM(cost)FROM sam’s bank account
Provenance
Provenance
Faceting
http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance
Faceting
http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance
FacetingDimensionality :(Dealing with multiple outliers?
http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf
Provenance
Faceting
Provenance
Faceting
Scorpion!
USA Italy
Understand Why
GivenOutlier and normal results
China
Exp
en
ses
USA ItalyChina
Predicates correlated with outliers
Find
GivenOutlier and normal results
Desc = “toilets”
Exp
en
ses
USA ItalyChina
Removing predicate from inputs “fixes” outliers & maintains normal results
Predicates correlated with outliers
Find
s.t.
GivenOutlier and normal results
Exp
en
ses
Desc = “toilets”
USA ItalyChina
Removing predicate from inputs “fixes” outliers & maintains normal results
Predicates correlated with outliers
Find
s.t.
GivenOutlier and normal results
Exp
en
ses
Desc = “toilets”
Removing predicate from inputs “fixes” outliers & maintains normal results
Predicates correlated with outliers
Find
s.t.
USA ItalyChina
GivenOutlier and normal results
Exp
en
ses
Desc = “toilets”
Formalize “influence” as metric
Predicate search heuristicsSome results
T
p(T)
T
Desc = “toilet”
p(T)
T
T – p(T)
p(T)
T
p(T)
p(T)
p(T)
Δoutput
|p(T)|p(T)
Δoutput
|p(T)|p(T)
Δoutput
Δoutput|p(T)|
InfluenceMetric
Δoutput|p(T)|
Δf(x)Δx
Sensitivity Analysis
InfluenceMetric
Δoutput
|p(T)|ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutput
|p(T)|
Δoutput V
|p(T)|
ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutput V
|p(T)|
Δoutput V
|p(T)|c
ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutput
|p(T)|
Δoutlier V
|p(T)|cΔNormal
Δoutput V
|p(T)|c
-
ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutput
|p(T)|
Δoutput V
|p(T)|
Δoutlier V
|p(T)|cΔNormal-
ΔOutput
“High vs Low”
|p(T)|
ΔNormal
Multiple Outputs
Δoutlier V
|p(T)|cmean ΔNormal maxoutlier normal-
Δoutput
|p(T)|
Δoutput V
|p(T)|c
Δoutput V
|p(T)|
Δoutlier V
|P(T)|cΔHold-out
Δoutlier
|P(T)|
Δoutlier V
|P(T)|
Δoutlier V
|P(T)|c
-
Δoutput
“High vs Low”
|P(T)|
ΔNormal
Multiple Outputs
Δoutlier V
|P(T)|cmean ΔHold-out maxoutlier normal-
influence(p)
Formalize “influence” as metricPredicate search heuristics
Some results
influence(p)argmaxp ∈ predicatesp* =
influence(p)argmaxp ∈ predicatesp* =
O(agg(T-p(T)))
influence(p)argmaxp ∈ predicatesp* =
O(agg(T-p(T)))
SUM({1,2,3,4,5}) = 15
influence(p)argmaxp ∈ predicatesp* =
O(agg(T-p(T)))
SUM({1,2,3,4,5}) = 15
p
influence(p)argmaxp ∈ predicatesp* =
O(agg(T-p(T)))
SUM({1,2,3,4,5}) = 15-{4,5}
p
influence(p)argmaxp ∈ predicatesp* =
O(agg(T-p(T)))
SUM({1,2,3,4,5}) = 15
SUM({1,2,3}) = 6
-{4,5}
p
influence(p)argmaxp ∈ predicatesp* =
O(exponential) O(agg(T-p(T)))
influence(p)argmaxp ∈ predicatesp* =
O(exponential)
Operator Properties
O(agg(T-p(T)))
influence(p)argmaxp ∈ predicatesp* =
O(exponential) O(agg(p(T)))
Operator Properties
Incrementally removable
influence(p)argmaxp ∈ predicatesp* =
O(exponential) O(agg(p(T)))
Incrementally removable
SUM({1,2,3,4,5}) = 15
p
influence(p)argmaxp ∈ predicatesp* =
O(exponential) O(agg(p(T)))
Incrementally removable
15 - SUM({ 4,5}) = 6
SUM({1,2,3,4,5}) = 15
p
influence(p)argmaxp ∈ predicatesp* =
O(exponential) O(agg(p(T)))
SUMCOUNTAVGSTDDEV
Incrementally removable
influence(p)argmaxp ∈ predicatesp* =
O(exponential) O(agg(p(T)))
SUMCOUNTAVGSTDDEV
MEDIANMODE
Incrementally removable
influence(p)argmaxp ∈ predicatesp* =
IndependentIncrementally removable
O(agg(p(T)))O(exponential)
Leastinfluence
Mostinfluence
influence(p)argmaxp ∈ predicatesp* =
IndependentIncrementally removable
O(agg(p(T)))O(exponential)
Leastinfluence
Mostinfluence
influence(p)argmaxp ∈ predicatesp* =
IndependentIncrementally removable
O(agg(p(T)))O(exponential)
Leastinfluence
Mostinfluence
influence(p)argmaxp ∈ predicatesp* =
IndependentIncrementally removable
O(agg(p(T)))O(exponential)
Leastinfluence
Mostinfluence
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
Anti-monotonic
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
Anti-monotonic
p’⊂p
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
Anti-monotonic
p’⊂p
influence(p’) ≤ influence(p)
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
Bottom Up Anti-monotonic
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
Bottom Up Anti-monotonic
influence(p)argmaxp ∈ predicatesp* =
Top DownIndependent
Incrementally removable
O(agg(p(T)))O(exponential)
Bottom Up Anti-monotonic
Formalize “influence” as metricPredicate search heuristics
Some results
SELECT sum(Y) GROUPBY XS
um
(Y)
X
SELECT sum(Y) GROUPBY XS
um
(Y)
X
Z
Y
SELECT sum(Y) GROUPBY XS
um
(Y)
X
Z
Y
Z
SELECT sum(Y) GROUPBY XS
um
(Y)
X
Z
Y
Z
1K 5K 10K
thousand tuples / group
1K 5K 10K
100
1000
10
thousand tuples / group
cost
seco
nd
s
1K 5K 10K
100
1000
10
thousand tuples / group
cost
seco
nd
sNaive
1K 5K 10K
100
1000
10
thousand tuples / group
cost
seco
nd
sNaive
Top down
1K 5K 10K
100
1000
10
thousand tuples / group
cost
seco
nd
sNaive
Top down
Bottom up
influence metric that is
accessible to end-usersfor
Data cleaningData explorationProvenance reduction
scorpion
http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
eugenewu@mit.edu
scorpion
http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
eugenewu@mit.edu
scorpion
http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html
eugenewu@mit.edu
C-parameter
Δoutput V |p(T)|c
Y
Z
Low C High C
Z
Δoutlier V
|p(T)|cmean ΔNormal maxoutlier normal
USA ItalyChina
Exp
en
ses
top related