scorpion explaining away outliers in aggregate queries eugene wu and sam madden mit

Post on 22-Dec-2015

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ScorpionExplaining Away Outliers in Aggregate Queries

eugene wu and sam maddenMIT

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

Table

Split Aggregate Visualize

USA Italy

Exp

en

ses

China

SELECT sum(cost)FROM expensesGROUPBY country

USA ItalyChina

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

USA ItalyChina

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

USA ItalyChina

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

USA ItalyChina

GivenOutlier and normal results

Understand Why

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

GivenOutlier and normal results

caused the outliers?

most caused the outliers?

caused outliers but didn’t affect normal outputs?

USA ItalyChina

What input properties

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

Can’t Touch This

Provenance

Data!

Provenance

$$$

SELECT SUM(cost)FROM sam’s bank account

Provenance

SELECT SUM(cost)FROM sam’s bank account

$$$

Provenance

SELECT SUM(cost)FROM sam’s bank account

$$$

Provenance

Darn! Ya caught

me

SELECT SUM(cost)FROM sam’s bank account

$$$

Provenance

http://weknowmemes.com/2012/04/whats-the-point/

Filter for “most

influential”

Provenance

SELECT SUM(cost)FROM sam’s bank account

Provenance

Provenance

Faceting

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Provenance

Faceting

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Provenance

FacetingDimensionality :(Dealing with multiple outliers?

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Provenance

Faceting

Provenance

Faceting

Scorpion!

USA Italy

Understand Why

GivenOutlier and normal results

China

Exp

en

ses

USA ItalyChina

Predicates correlated with outliers

Find

GivenOutlier and normal results

Desc = “toilets”

Exp

en

ses

USA ItalyChina

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

GivenOutlier and normal results

Exp

en

ses

Desc = “toilets”

USA ItalyChina

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

GivenOutlier and normal results

Exp

en

ses

Desc = “toilets”

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

USA ItalyChina

GivenOutlier and normal results

Exp

en

ses

Desc = “toilets”

Formalize “influence” as metric

Predicate search heuristicsSome results

T

p(T)

T

Desc = “toilet”

p(T)

T

T – p(T)

p(T)

T

p(T)

p(T)

p(T)

Δoutput

|p(T)|p(T)

Δoutput

|p(T)|p(T)

Δoutput

Δoutput|p(T)|

InfluenceMetric

Δoutput|p(T)|

Δf(x)Δx

Sensitivity Analysis

InfluenceMetric

Δoutput

|p(T)|ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput

|p(T)|

Δoutput V

|p(T)|

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput V

|p(T)|

Δoutput V

|p(T)|c

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput

|p(T)|

Δoutlier V

|p(T)|cΔNormal

Δoutput V

|p(T)|c

-

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput

|p(T)|

Δoutput V

|p(T)|

Δoutlier V

|p(T)|cΔNormal-

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutlier V

|p(T)|cmean ΔNormal maxoutlier normal-

Δoutput

|p(T)|

Δoutput V

|p(T)|c

Δoutput V

|p(T)|

Δoutlier V

|P(T)|cΔHold-out

Δoutlier

|P(T)|

Δoutlier V

|P(T)|

Δoutlier V

|P(T)|c

-

Δoutput

“High vs Low”

|P(T)|

ΔNormal

Multiple Outputs

Δoutlier V

|P(T)|cmean ΔHold-out maxoutlier normal-

influence(p)

Formalize “influence” as metricPredicate search heuristics

Some results

influence(p)argmaxp ∈ predicatesp* =

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

p

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15-{4,5}

p

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

SUM({1,2,3}) = 6

-{4,5}

p

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(T-p(T)))

influence(p)argmaxp ∈ predicatesp* =

O(exponential)

Operator Properties

O(agg(T-p(T)))

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

Operator Properties

Incrementally removable

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

Incrementally removable

SUM({1,2,3,4,5}) = 15

p

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

Incrementally removable

15 - SUM({ 4,5}) = 6

SUM({1,2,3,4,5}) = 15

p

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

SUMCOUNTAVGSTDDEV

Incrementally removable

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

SUMCOUNTAVGSTDDEV

MEDIANMODE

Incrementally removable

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

p’⊂p

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

p’⊂p

influence(p’) ≤ influence(p)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

Formalize “influence” as metricPredicate search heuristics

Some results

SELECT sum(Y) GROUPBY XS

um

(Y)

X

SELECT sum(Y) GROUPBY XS

um

(Y)

X

Z

Y

SELECT sum(Y) GROUPBY XS

um

(Y)

X

Z

Y

Z

SELECT sum(Y) GROUPBY XS

um

(Y)

X

Z

Y

Z

1K 5K 10K

thousand tuples / group

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nd

s

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nd

sNaive

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nd

sNaive

Top down

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nd

sNaive

Top down

Bottom up

influence metric that is

accessible to end-usersfor

Data cleaningData explorationProvenance reduction

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

eugenewu@mit.edu

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

eugenewu@mit.edu

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

eugenewu@mit.edu

C-parameter

Δoutput V |p(T)|c

Y

Z

Low C High C

Z

Δoutlier V

|p(T)|cmean ΔNormal maxoutlier normal

USA ItalyChina

Exp

en

ses

top related