scorpion explaining away outliers in aggregate queries

96
Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT p://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

Upload: glyn

Post on 24-Feb-2016

98 views

Category:

Documents


0 download

DESCRIPTION

Scorpion Explaining Away Outliers in Aggregate Queries . eugene wu and sam madden MIT . http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html. Table. Split. Visualize. Aggregate. SELECT sum(cost) FROMexpenses GROUPBY country. Expenses. USA. China. Italy. - PowerPoint PPT Presentation

TRANSCRIPT

ScorpionExplaining Away Outliers in Aggregate Queries

eugene wu and sam maddenMIT

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

Table

Split Aggregate Visualize

USA Italy

Expe

nses

China

SELECT sum(cost)FROM expensesGROUPBY country

USA ItalyChina

Expe

nses

SELECT sum(cost)FROM expensesGROUPBY country

USA ItalyChina

Expe

nses

SELECT sum(cost)FROM expensesGROUPBY country

USA ItalyChina

Expe

nses

SELECT sum(cost)FROM expensesGROUPBY country

USA ItalyChina

GivenOutlier and normal results

Understand Why

Expe

nses

SELECT sum(cost)FROM expensesGROUPBY country

GivenOutlier and normal results

caused the outliers?

most caused the outliers?

caused outliers but didn’t affect normal outputs?

USA ItalyChina

What input properties

Expe

nses

SELECT sum(cost)FROM expensesGROUPBY country

Can’t Touch This

Provenance

Data!

Provenance

$$$

SELECT SUM(cost)FROM sam’s bank account

Provenance

SELECT SUM(cost)FROM sam’s bank account

$$$

Provenance

SELECT SUM(cost)FROM sam’s bank account

$$$

Provenance

Darn! Ya caught

me

SELECT SUM(cost)FROM sam’s bank account

$$$

Provenance

http://weknowmemes.com/2012/04/whats-the-point/

Filter for “most

influential”

Provenance

SELECT SUM(cost)FROM sam’s bank account

Provenance

Provenance

Faceting

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Provenance

Faceting

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Provenance

FacetingDimensionality :(Dealing with multiple outliers?

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Provenance

Faceting

Provenance

Faceting

Scorpion!

USA Italy

Understand Why

GivenOutlier and normal results

China

Expe

nses

USA ItalyChina

Predicates correlated with outliers

Find

GivenOutlier and normal results

Desc = “toilets”

Expe

nses

USA ItalyChina

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

GivenOutlier and normal results

Expe

nses

Desc = “toilets”

USA ItalyChina

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

GivenOutlier and normal results

Expe

nses

Desc = “toilets”

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

USA ItalyChina

GivenOutlier and normal results

Expe

nses

Desc = “toilets”

Formalize “influence” as metric

Predicate search heuristicsSome results

T

p(T)

T

Desc = “toilet”

p(T)

T

T – p(T)

p(T)

T

p(T)

p(T)

p(T)

Δoutput

|p(T)|p(T)

Δoutput

|p(T)|p(T)

Δoutput

Δoutput|p(T)|

InfluenceMetric

Δoutput|p(T)|

Δf(x)Δx

Sensitivity Analysis

InfluenceMetric

Δoutput|p(T)|

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput|p(T)|

Δoutput V |p(T)|

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput V |p(T)|

Δoutput V |p(T)|c

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput|p(T)|

Δoutlier V |p(T)|c ΔNormal

Δoutput V |p(T)|c

-

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput|p(T)|

Δoutput V |p(T)|

Δoutlier V |p(T)|c ΔNormal-

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutlier V |p(T)|cmean ΔNormal max

outlier normal-

Δoutput|p(T)|

Δoutput V |p(T)|c

Δoutput V |p(T)|

Δoutlier V |P(T)|c ΔHold-out

Δoutlier|P(T)|

Δoutlier V |P(T)|

Δoutlier V |P(T)|c

-

Δoutput

“High vs Low”

|P(T)|

ΔNormal

Multiple Outputs

Δoutlier V |P(T)|cmean ΔHold-out max

outlier normal-

influence(p)

Formalize “influence” as metricPredicate search heuristics

Some results

influence(p)argmaxp ∈ predicatesp* =

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

p

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15-{4,5}

p

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

SUM({1,2,3}) = 6

-{4,5}

p

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(T-p(T)))

influence(p)argmaxp ∈ predicatesp* =

O(exponential)

Operator PropertiesO(agg(T-p(T)))

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

Operator PropertiesIncrementally removable

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))Incrementally removable

SUM({1,2,3,4,5}) = 15

p

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))Incrementally removable

15 - SUM({ 4,5}) = 6SUM({1,2,3,4,5}) = 15

p

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

SUMCOUNTAVGSTDDEV

Incrementally removable

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

SUMCOUNTAVGSTDDEVMEDIAN

MODE

Incrementally removable

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

p’⊂p

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

p’⊂p

influence(p’) ≤ influence(p)

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

Formalize “influence” as metricPredicate search heuristics

Some results

SELECT sum(Y) GROUPBY XSu

m(Y

)

X

SELECT sum(Y) GROUPBY XSu

m(Y

)

X

Z

Y

SELECT sum(Y) GROUPBY XSu

m(Y

)

X

Z

Y

Z

SELECT sum(Y) GROUPBY XSu

m(Y

)

X

Z

Y

Z

1K 5K 10Kthousand tuples / group

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nds

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nds

Naive

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nds

Naive

Top down

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nds

Naive

Top down

Bottom up

influence metric that is

accessible to end-usersfor

Data cleaningData explorationProvenance reduction

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

[email protected]

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

[email protected]

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

[email protected]

C-parameter

Δoutput V |p(T)|c

Y

Z

Low C High C

Z

Δoutlier V |p(T)|cmean ΔNormal max

outlier normal

USA ItalyChina

Expe

nses