1 wavelet synopses with error guarantees minos garofalakis phillip b. gibbons information sciences...

36
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibb ons Information Sciences Research Center Bell Labs, Lucent Technologies Murray Hill, NJ 07974 ACM SIGMOD 2002

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

1

Wavelet synopseswith Error Guarantees

Minos Garofalakis Phillip B Gibbons1048576Information Sciences Research Center Bell Labs Lucent Technologies Murray Hill NJ 07974

ACM SIGMOD 2002

2

Outline

Introduction Wavelet basics Probabilistic wavelet synopses Experimental study Conclusions

3

Introduction The wavelet decomposition has demonstrated

the effectiveness in reducing large amounts of data to compact sets of wavelet coefficients (termed ldquowavelet synopsesrdquo) that can be used to provide fast and reasonably accurate approximate answers to queries

Due to exploratory nature of many Decision Support Systems applications there are a number of scenarios in which the user may prefer a fast approximate answer

4

Introduction A major criticism of wavelet-based

techniques is the fact that conventional wavelet synopses can not provide guarantees on the error of individual approximate query answers

5

Introduction The problem for approximate query

processing with wavelet synopses due to their deterministic approach to selecting coefficients and their lack of error guarantees

We propose a approach to building wavelet synopses that enables unbiased approximate query answers with error guarantees on the accuracy of individual answers

6

Introduction The technique is based on probabilistic thre

sholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values and then flips coins to select the synopsis

7

Wavelet basics Given the data vector A the wavelet

transform of A can be computed as follow

In order equalize the importance of all wavelet coefficients we normalize the coefficient is

8

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 2: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

2

Outline

Introduction Wavelet basics Probabilistic wavelet synopses Experimental study Conclusions

3

Introduction The wavelet decomposition has demonstrated

the effectiveness in reducing large amounts of data to compact sets of wavelet coefficients (termed ldquowavelet synopsesrdquo) that can be used to provide fast and reasonably accurate approximate answers to queries

Due to exploratory nature of many Decision Support Systems applications there are a number of scenarios in which the user may prefer a fast approximate answer

4

Introduction A major criticism of wavelet-based

techniques is the fact that conventional wavelet synopses can not provide guarantees on the error of individual approximate query answers

5

Introduction The problem for approximate query

processing with wavelet synopses due to their deterministic approach to selecting coefficients and their lack of error guarantees

We propose a approach to building wavelet synopses that enables unbiased approximate query answers with error guarantees on the accuracy of individual answers

6

Introduction The technique is based on probabilistic thre

sholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values and then flips coins to select the synopsis

7

Wavelet basics Given the data vector A the wavelet

transform of A can be computed as follow

In order equalize the importance of all wavelet coefficients we normalize the coefficient is

8

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 3: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

3

Introduction The wavelet decomposition has demonstrated

the effectiveness in reducing large amounts of data to compact sets of wavelet coefficients (termed ldquowavelet synopsesrdquo) that can be used to provide fast and reasonably accurate approximate answers to queries

Due to exploratory nature of many Decision Support Systems applications there are a number of scenarios in which the user may prefer a fast approximate answer

4

Introduction A major criticism of wavelet-based

techniques is the fact that conventional wavelet synopses can not provide guarantees on the error of individual approximate query answers

5

Introduction The problem for approximate query

processing with wavelet synopses due to their deterministic approach to selecting coefficients and their lack of error guarantees

We propose a approach to building wavelet synopses that enables unbiased approximate query answers with error guarantees on the accuracy of individual answers

6

Introduction The technique is based on probabilistic thre

sholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values and then flips coins to select the synopsis

7

Wavelet basics Given the data vector A the wavelet

transform of A can be computed as follow

In order equalize the importance of all wavelet coefficients we normalize the coefficient is

8

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 4: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

4

Introduction A major criticism of wavelet-based

techniques is the fact that conventional wavelet synopses can not provide guarantees on the error of individual approximate query answers

5

Introduction The problem for approximate query

processing with wavelet synopses due to their deterministic approach to selecting coefficients and their lack of error guarantees

We propose a approach to building wavelet synopses that enables unbiased approximate query answers with error guarantees on the accuracy of individual answers

6

Introduction The technique is based on probabilistic thre

sholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values and then flips coins to select the synopsis

7

Wavelet basics Given the data vector A the wavelet

transform of A can be computed as follow

In order equalize the importance of all wavelet coefficients we normalize the coefficient is

8

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 5: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

5

Introduction The problem for approximate query

processing with wavelet synopses due to their deterministic approach to selecting coefficients and their lack of error guarantees

We propose a approach to building wavelet synopses that enables unbiased approximate query answers with error guarantees on the accuracy of individual answers

6

Introduction The technique is based on probabilistic thre

sholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values and then flips coins to select the synopsis

7

Wavelet basics Given the data vector A the wavelet

transform of A can be computed as follow

In order equalize the importance of all wavelet coefficients we normalize the coefficient is

8

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 6: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

6

Introduction The technique is based on probabilistic thre

sholding scheme that assigns each coefficient a probability of being retained based on its importance to the reconstruction of individual data values and then flips coins to select the synopsis

7

Wavelet basics Given the data vector A the wavelet

transform of A can be computed as follow

In order equalize the importance of all wavelet coefficients we normalize the coefficient is

8

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 7: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

7

Wavelet basics Given the data vector A the wavelet

transform of A can be computed as follow

In order equalize the importance of all wavelet coefficients we normalize the coefficient is

8

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 8: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

8

Wavelet basics A helpful tool for exploring and

understanding the key properties of the wavelet decomposition is error tree structure

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 9: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

9

Wavelet basics The important reconstruction properties

(P1)The reconstruction of any data value di depends on the values of the nodes in path(di)

(P2)The range sum d(lh)=

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 10: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

10

Wavelet basics

d5=c0-c2+c5-c10=65-14+(-20)-28=3 d(35)=3c0+(1-2)c2-c4+2c5-c9+(1-1)c10=93

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 11: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

11

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Conventional coefficient thresholding is a completely deterministic process that typically retain the B wavelet coefficients with the largest absolute value after normalization this deterministic process minimizes the overall L2 error

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 12: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

12

Probabilistic wavelet synopsesAThe problem with conventional wavelets

d5=65-0+0-0=65 d(35)=365-0-0+0-0=195

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 13: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

13

Probabilistic wavelet synopsesAThe problem with conventional wavelets

Root causes (1)strict deterministic thresholding (2)independent thresholding (3)the bias resulting from dropping coeffi

cients without compensating for their loss

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 14: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

14

Probabilistic wavelet synopses BGeneral Approach

Our scheme deterministically retains the most important coefficients while randomly rounding the other coefficients either up to a larger value( rounding value) or down to zero

By carefully selecting the rounding values we ensure that (1)We expect a total of B coefficients to be

retained (2)We minimize a desired error metric in the

reconstruction of the data

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 15: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

15

Probabilistic wavelet synopses BGeneral Approach

The key idea in thresholding scheme is to associate a random variable Ci such that (1)Ci=0 with some probability (2)E[Ci] = ci

where we select a rounding value λi for each non-zero ci such that

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 16: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

16

Probabilistic wavelet synopses BGeneral Approach

Our thresholding scheme essentially ldquoroundsrdquo each non-zero wavelet coefficient ci independently to either λi or zero by flipping a biased coin with success probability

It variance is simply

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 17: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

17

Probabilistic wavelet synopses BGeneral Approach 1

For example λ0=c0 λ10= 2c10 λi=3ci2

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 18: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

18

Probabilistic wavelet synopses BGeneral Approach The impact of the λirsquo s

λi closer ci reduce the variance

λi further from ci reduces the expected number of retained coefficients

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 19: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

19

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

A reasonable approach is to select the λi values in a way that minimize the some overall error metric (egL2)

1

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 20: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

20

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Letting and The expected L2 error minimization problem is

equivalent to

Based on the Cauchy-Schwarz inequality the minimum value of the objective is reached when

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 21: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

21

Probabilistic wavelet synopses CRounding to minimize the expected mean-square error

Let

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 22: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

22

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We focus on minimizing the maximum reconstruction error for individual (related error)

The goal is to produce estimate for each value di such that

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 23: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

23

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The expected value of we would like to minimize the variance

More precisely we seek to minimize the normalized standard error for a reconstructed data value

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 24: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

24

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

Note that by applying Chebyshevrsquos Inequality we obtain( for all αgt1)

So that minimizing NSE will indeed minimize the probabilistic bounds on relative error metric

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 25: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

25

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 26: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

26

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

We would like to formulate a dynamic programming recurrence for this problem

Let PATHSj denote the set of all root-to-leaf pahts in Tj M[jB] denote the optimal value of the maximum among all data dk in Tj assuming a space budget of B

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 27: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

27

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

M[jB] depicted in (11)

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 28: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

28

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 29: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

29

Probabilistic wavelet synopses DRounding to minimize the maximum relative error

The problem in (11) is that the yi and bL each range over a continuous interval making it infeasible to use

The key technical idea is to quantize the solution space

We modify the constraint

where q is a input integer

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 30: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

30

Probabilistic wavelet synopses ELow-bias probabilistic wavelet synopses

Each coefficient is either retained or discarded according to the probabilities yi where as before the yirsquos are selected to minimize a desired error metric

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 31: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

31

Probabilistic wavelet synopsesF Summary of the approach

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 32: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

32

Experimental study A Zipfian data generator was used to produ

ce Zipfian frequencies for various levels of skew (z parameter between 03 to 20)

We use real world data set download from the National Forest Service

Let q=10 sanity bound S as the 10-percentile in the da

ta perturbation Δ= min001 S100

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 33: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

33

Experimental study

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 34: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

34

Experimental study

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 35: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

35

Experimental study

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach

Page 36: 1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray

36

Conclusions We has introduced probabilistic wavelet synopses

the first wavelet-based data reduction technique that provably enables unbiased data reconstruction with error guarantees on individual approximate answers

We have described a number of novel techniques for tuning our scheme to minimize desired error metrics

Experimental results on real-world and synthetic data sets demonstrate that probabilistic wavelet synopses significantly reduce relative error compared with the deterministic approach