scorpion explaining away outliers in aggregate queries eugene wu and sam madden mit

96
Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT p://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

Upload: wilfred-jacobs

Post on 22-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

ScorpionExplaining Away Outliers in Aggregate Queries

eugene wu and sam maddenMIT

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

Page 2: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT
Page 3: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT
Page 4: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT
Page 5: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT
Page 6: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT
Page 7: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT
Page 8: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Table

Split Aggregate Visualize

Page 9: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA Italy

Exp

en

ses

China

SELECT sum(cost)FROM expensesGROUPBY country

Page 10: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA ItalyChina

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

Page 11: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA ItalyChina

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

Page 12: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA ItalyChina

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

Page 13: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA ItalyChina

GivenOutlier and normal results

Understand Why

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

Page 14: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

GivenOutlier and normal results

caused the outliers?

most caused the outliers?

caused outliers but didn’t affect normal outputs?

USA ItalyChina

What input properties

Exp

en

ses

SELECT sum(cost)FROM expensesGROUPBY country

Page 15: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Can’t Touch This

Page 16: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT
Page 17: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

Data!

Page 18: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

$$$

SELECT SUM(cost)FROM sam’s bank account

Page 19: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

SELECT SUM(cost)FROM sam’s bank account

$$$

Page 20: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

SELECT SUM(cost)FROM sam’s bank account

$$$

Page 21: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

Darn! Ya caught

me

SELECT SUM(cost)FROM sam’s bank account

$$$

Page 22: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

http://weknowmemes.com/2012/04/whats-the-point/

Page 23: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Filter for “most

influential”

Provenance

SELECT SUM(cost)FROM sam’s bank account

Page 24: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

Page 25: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

Faceting

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Page 26: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

Faceting

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Page 27: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

FacetingDimensionality :(Dealing with multiple outliers?

http://www.perceptualedge.com/articles/Whitepapers/Three_Blind_Men.pdf

Page 28: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

Faceting

Page 29: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Provenance

Faceting

Scorpion!

Page 30: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA Italy

Understand Why

GivenOutlier and normal results

China

Exp

en

ses

Page 31: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA ItalyChina

Predicates correlated with outliers

Find

GivenOutlier and normal results

Desc = “toilets”

Exp

en

ses

Page 32: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA ItalyChina

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

GivenOutlier and normal results

Exp

en

ses

Desc = “toilets”

Page 33: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA ItalyChina

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

GivenOutlier and normal results

Exp

en

ses

Desc = “toilets”

Page 34: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Removing predicate from inputs “fixes” outliers & maintains normal results

Predicates correlated with outliers

Find

s.t.

USA ItalyChina

GivenOutlier and normal results

Exp

en

ses

Desc = “toilets”

Page 35: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Formalize “influence” as metric

Predicate search heuristicsSome results

Page 36: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

T

Page 37: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

p(T)

T

Desc = “toilet”

Page 38: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

p(T)

T

Page 39: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

T – p(T)

p(T)

T

Page 40: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

p(T)

Page 41: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

p(T)

Page 42: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

p(T)

Δoutput

Page 43: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

|p(T)|p(T)

Δoutput

Page 44: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

|p(T)|p(T)

Δoutput

Δoutput|p(T)|

InfluenceMetric

Page 45: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Δoutput|p(T)|

Δf(x)Δx

Sensitivity Analysis

InfluenceMetric

Page 46: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Δoutput

|p(T)|ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Page 47: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Δoutput

|p(T)|

Δoutput V

|p(T)|

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Page 48: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Δoutput V

|p(T)|

Δoutput V

|p(T)|c

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput

|p(T)|

Page 49: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Δoutlier V

|p(T)|cΔNormal

Δoutput V

|p(T)|c

-

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutput

|p(T)|

Δoutput V

|p(T)|

Page 50: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Δoutlier V

|p(T)|cΔNormal-

ΔOutput

“High vs Low”

|p(T)|

ΔNormal

Multiple Outputs

Δoutlier V

|p(T)|cmean ΔNormal maxoutlier normal-

Δoutput

|p(T)|

Δoutput V

|p(T)|c

Δoutput V

|p(T)|

Page 51: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Δoutlier V

|P(T)|cΔHold-out

Δoutlier

|P(T)|

Δoutlier V

|P(T)|

Δoutlier V

|P(T)|c

-

Δoutput

“High vs Low”

|P(T)|

ΔNormal

Multiple Outputs

Δoutlier V

|P(T)|cmean ΔHold-out maxoutlier normal-

influence(p)

Page 52: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Formalize “influence” as metricPredicate search heuristics

Some results

Page 53: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Page 54: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

Page 55: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

Page 56: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

p

Page 57: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15-{4,5}

p

Page 58: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(agg(T-p(T)))

SUM({1,2,3,4,5}) = 15

SUM({1,2,3}) = 6

-{4,5}

p

Page 59: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(T-p(T)))

Page 60: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(exponential)

Operator Properties

O(agg(T-p(T)))

Page 61: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

Operator Properties

Incrementally removable

Page 62: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

Incrementally removable

SUM({1,2,3,4,5}) = 15

p

Page 63: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

Incrementally removable

15 - SUM({ 4,5}) = 6

SUM({1,2,3,4,5}) = 15

p

Page 64: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

SUMCOUNTAVGSTDDEV

Incrementally removable

Page 65: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

O(exponential) O(agg(p(T)))

SUMCOUNTAVGSTDDEV

MEDIANMODE

Incrementally removable

Page 66: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

Page 67: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

Page 68: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

Page 69: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

IndependentIncrementally removable

O(agg(p(T)))O(exponential)

Leastinfluence

Mostinfluence

Page 70: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Page 71: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Page 72: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Page 73: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Page 74: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

Page 75: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

p’⊂p

Page 76: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Anti-monotonic

p’⊂p

influence(p’) ≤ influence(p)

Page 77: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

Page 78: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

Page 79: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence(p)argmaxp ∈ predicatesp* =

Top DownIndependent

Incrementally removable

O(agg(p(T)))O(exponential)

Bottom Up Anti-monotonic

Page 80: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Formalize “influence” as metricPredicate search heuristics

Some results

Page 81: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

SELECT sum(Y) GROUPBY XS

um

(Y)

X

Page 82: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

SELECT sum(Y) GROUPBY XS

um

(Y)

X

Z

Y

Page 83: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

SELECT sum(Y) GROUPBY XS

um

(Y)

X

Z

Y

Z

Page 84: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

SELECT sum(Y) GROUPBY XS

um

(Y)

X

Z

Y

Z

Page 85: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

1K 5K 10K

thousand tuples / group

Page 86: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nd

s

Page 87: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nd

sNaive

Page 88: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nd

sNaive

Top down

Page 89: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

1K 5K 10K

100

1000

10

thousand tuples / group

cost

seco

nd

sNaive

Top down

Bottom up

Page 90: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

influence metric that is

accessible to end-usersfor

Data cleaningData explorationProvenance reduction

Page 91: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

[email protected]

Page 92: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

[email protected]

Page 93: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

scorpion

http://springfieldpunx.blogspot.com/2010/11/mortal-kombat-ninjas-scorpion.html

[email protected]

Page 94: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

C-parameter

Δoutput V |p(T)|c

Y

Z

Low C High C

Z

Page 95: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

Δoutlier V

|p(T)|cmean ΔNormal maxoutlier normal

Page 96: Scorpion Explaining Away Outliers in Aggregate Queries eugene wu and sam madden MIT

USA ItalyChina

Exp

en

ses