the impact of duality on data synopsis problems panagiotis karras kdd, san jose, august 13 th, 2007...

23
The Impact of Duality The Impact of Duality on Data Synopsis Problems on Data Synopsis Problems Panagiotis Karras Panagiotis Karras KDD, San Jose, August 13 th , 2007 work with Dimitris Sacharidis and Nikos Mamoulis work with Dimitris Sacharidis and Nikos Mamoulis

Upload: logan-lang

Post on 13-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

The Impact of DualityThe Impact of Dualityon Data Synopsis Problemson Data Synopsis Problems

Panagiotis KarrasPanagiotis KarrasKDD, San Jose, August 13th, 2007

work with Dimitris Sacharidis and Nikos Mamouliswork with Dimitris Sacharidis and Nikos Mamoulis

Page 2: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

IntroductionIntroduction• Data synopsis problems require the

optimization of error under a bound on space.• Classical approaches treat them in a direct

manner, producing complicated solutions, and sometimes resorting to heuristics.

• Parameters involved have a monotonic relationship.

• Hence, an alternative approach is possible, based on the dual, error-bounded problems.

Page 3: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

OutlineOutline• Histograms.• Restricted Haar Wavelet Synopses.• Unrestricted Haar and Haar+ Synopses.• Experiments.• Conclusions.

Page 4: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

HistogramsHistograms• Approximate a data set [d1, d2, …, dn] with B buckets,

si = [bi, ei, vi] so that a maximum-error metric is minimized.

• Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 ijbjEbiE

ij,1,1,maxmin,

1

nnBO 2log

• Recent solutions: Buragohain et al. ICDE 2007

Guha and Shim TKDE 19(7) 2007 For weighted error:

Liner for:

Bn

UnnO loglog

nBnO 32 log

n

nB

3log 199824,741,073,1230 Bn

nBnnO 62 loglog

Page 5: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

HistogramsHistograms

• Solve the error-bounded problem.

Maximum Absolute Error bound ε = 2

4 5 6 2 15 17 3 6 9 12 …

[ 4 ] [ 16 ] [ 4.5 ] […

• Generalized to any weighted maximum-error metric.

Each value di defines a tolerance interval

Bucket closed when running intersection of interval becomes null

Complexity:

ii

ii w

dw

d

,

nO

Page 6: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

HistogramsHistograms

• Apply to the space-bounded problem.

Perform binary search in the domain of the error bound ε

Complexity: *lognO

For error values requiring space , with actual error , run an optimality test:BB

Error-bounded algorithm running under constraint instead oferror error

If requires space, then optimal solution has been reached.BB ~error

Independent of buckets B

Page 7: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

34 16 2 20 20 0 36 16

0

18

7 -8

9 -9 1010 25 11 10 26

Restricted Haar Wavelet Restricted Haar Wavelet Synopses Synopses

• Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized.

• Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB 2005

18 18

1,,

,,,max

,,,

,,,max

min,,

bbzviE

bzviE

bbviE

bviE

bviE

iR

iL

R

L

2nO

Page 8: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Restricted Haar Wavelet Restricted Haar Wavelet SynopsesSynopses

• Solve the error-bounded problem. Muthukrishnan FSTTCS 2005

Local search within each of subtrees in bottom Haar tree levels

n

nOlog

2

1,,

,,,min,

iRiL

RL

zviSzviS

viSviSviS

nloglog

n

n

log

Complexity:

• Apply to the space-bounded problem.

Complexity:

n

nOlog

log *2

no significant advantage

Page 9: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Unrestricted Haar and HaarUnrestricted Haar and Haar++ SynopsesSynopses

• Assign arbitrary values to Haar/Haar+ coefficients, so that a maximum-error metric is minimized.

• Classical solutions: Guha and Harb KDD 2005, SODA 2006

0,,

,,,max

min,,00

zbbzviE

bzviE

bviE

R

L

zbbSz vi

BnnRO 22 loglog

c1+

c2 c3C1

c5 c6+

C2 c7c8 c9

c

o

d3d2d1d0

-++ +

-+c4

+-+

+ +

C3

0,,

,,,maxmin

,0,,

,,,maxmin

,0,,

,,,maxmin

min,,

00

00

00

,

,

,

rrR

L

zbbSz

lR

lL

zbbSz

hhR

hL

zbbSz

zbbzviE

bviE

zbbviE

bzviE

zbbzviE

bzviE

bviE

r

vRir

l

vLil

h

vHih

n

B

nRBO log

time

space

Karras and Mamoulis ICDE 2007

Page 10: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Unrestricted Haar and HaarUnrestricted Haar and Haar++ SynopsesSynopses• Solve the error-bounded problem.

nnRO log2

0,,min,

zzviSzviSviS RLSz vi

Complexity:

• Apply to the space-bounded problem.

Complexity:

unrestricted Haar

0,

,,maxmin

,0,

,,maxmin

,0,

,,maxmin

min,

,

,

,

rrR

L

Sz

lR

lL

Sz

hhR

hL

Sz

zzviS

viS

zviS

zviS

zzviS

zviS

viS

vRir

vLil

vHih

Haar+

time nnRO log space

nnRO loglog *2 significant time & space advantage

Page 11: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Experiments: Histograms, Time Experiments: Histograms, Time vs. nvs. n

Page 12: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Experiments: Histograms, Time Experiments: Histograms, Time vs. Bvs. B

Page 13: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Experiments: Haar Wavelets, Time Experiments: Haar Wavelets, Time vs. nvs. n

Page 14: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Experiments: Haar Wavelets, Time Experiments: Haar Wavelets, Time vs. Bvs. B

Page 15: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Experiments: HaarExperiments: Haar++, Time vs. n, Time vs. n

Page 16: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Experiments: HaarExperiments: Haar++, Time vs. B, Time vs. B

Page 17: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

ConclusionsConclusions• Offline space-bounded data synopsis

problems are more easily solvable through their error-bounded counterparts.

• Complexities lower & independent of synopsis space.

• Dual-problem-based algorithms are simpler, more scalable, more general, more elegant, and more memory-parsimonious than the direct ones.

• Future: application on other data representation models, multi-measure, multi-dimensional data.

Page 18: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Related WorkRelated Work• H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.

C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. VLDB 1998

• S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram construction algorithms. VLDB 2004

• M. Garofalakis and A. Kumar. Wavelet synopses for general error metrics. TODS, 30(4):888–928, 2005 (also PODS 2004).

• S. Guha. Space efficiency in synopsis construction algorithms. VLDB 2005

• S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing Non-Euclidean Error. KDD 2005

• S. Guha and B. Harb. Approximation algorithms for wavelet transform coding of data streams. SODA 2006

• S. Muthukrishnan. Subquadratic algorithms for workload-aware haar wavelet synopses. FSTTCS 2005

• P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data structure. ICDE 2007

Page 19: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Thank you! Questions?Thank you! Questions?

More discussion at Board More discussion at Board 17 this evening17 this evening

Page 20: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis
Page 21: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Compact Hierarchical Compact Hierarchical HistogramsHistograms

• Assign arbitrary values to CHH coefficients, so that a maximum-error metric is minimized.

• Heuristic solutions: Reiss et al. VLDB 2006

BnnBO loglog2

c0

c1 c2

c3 c4c5 c6

d3d2d1d0

nnBO 2log

time

space

The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node.

[Reiss et al. VLDB 2006]

Page 22: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. Next-to-bottom level case

dcbavdcba

dcbavdcbadcbavdcba

dcbavdcba

viS

,,,,

,,,,,,,,

,,,,

,2

,1

,0

,

1,,, ** ii ssviSv

cic2i c2i+1

bav ,

z00

ba, dc,

dcba ,,

cic2i0 0

z

dcbav ,,

dc, ba,

dcba ,,

dcz , dcbaz ,,

Page 23: The Impact of Duality on Data Synopsis Problems Panagiotis Karras KDD, San Jose, August 13 th, 2007 work with Dimitris Sacharidis and Nikos Mamoulis

Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. General, recursive case

0000

00000000

0000

**

**

**

,2

,1

,

,

RLRL

RLRLRLRL

RLRL

v

vv

v

ss

ss

ss

viS

RL

RL

RL

ii

ii

ii

*0

*0 ,,

RL iRiL sviSvsviSv RL

Complexity: nnOn

On 2log

0 1log

22

time

space

• Apply to the space-bounded problem.

Complexity: Polynomially Tractable

nOOn

log

02

nnnO logloglog *2