the impact of duality on data representation problems panagiotis karras hku, june 14 th, 2007
TRANSCRIPT
The Impact of DualityThe Impact of Dualityon Data Representation Problemson Data Representation Problems
Panagiotis KarrasPanagiotis KarrasHKU, June 14th, 2007
IntroductionIntroduction• Many data representation problems require
the optimization of one parameter under a bound on one or more others.
• Classical approaches treat them in a direct manner, producing complicated solutions, and sometimes resorting to heuristics.
• Parameters involved have a monotonic relationship.
• Hence, an alternative approach is possible, based on dual problems.
OutlineOutline• Histograms.• Restricted Haar Wavelet Synopses.• Unrestricted Haar and Haar+ Synopses.• l-Diversification in 1D.• Compact Hierarchical Histograms.
HistogramsHistograms• Approximate a data set [d1, d2, …, dn] with B buckets,
si = [bi, ei, vi] so that a maximum-error metric is minimized.
• Classical solution: Jagadish et al. VLDB 1998 Guha et al. VLDB 2004, Guha VLDB 2005 ijbjEbiE
ij,1,1,maxmin,
1
nnBO 2log
• Recent solutions: Buragohain et al. ICDE 2007
Guha and Shim TKDE 19(7) 2007 (linear for )
Bn
UnnO loglog
nBnO 32 logn
nB
3log
199824,741,073,1230 Bn
HistogramsHistograms
• Solve the error-bounded problem.
Maximum Absolute Error bound ε = 2
4 5 6 2 15 17 3 6 9 12 …
[ 4 ] [ 16 ] [ 4.5 ] […
• Generalized to any weighted maximum-error metric.
Each value di defines a tolerance interval
Bucket closed when running union of interval becomes null
Complexity:
ii
ii w
dw
d
,
nO
HistogramsHistograms
• Apply to the space-bounded problem.
Perform binary search in the domain of the error bound ε
Complexity: *lognO
For error values requiring space , with actual error , run an optimality test:BB
Error-bounded algorithm running under constraint instead oferror error
If requires space, then optimal solution has been reached.BB ~error
Independent of buckets B
34 16 2 20 20 0 36 16
0
18
7 -8
9 -9 1010 25 11 10 26
Restricted Haar Wavelet Restricted Haar Wavelet Synopses Synopses
• Select subset of Haar wavelet decomposition coefficients, so that a maximum-error metric is minimized.
• Classical solution: Garofalakis and Kumar PODS 2004 Guha VLDB 2005
18 18
1,,
,,,max
,,,
,,,max
min,,
bbzviE
bzviE
bbviE
bviE
bviE
iR
iL
R
L
2nO
Restricted Haar Wavelet Restricted Haar Wavelet SynopsesSynopses
• Solve the error-bounded problem. Muthukrishnan FSTTCS 2005
Local search within each of subtrees in bottom Haar tree levels
n
nO
log
2
1,,
,,,min,
iRiL
RL
zviSzviS
viSviSviS
nloglog
n
n
log
Complexity:
• Apply to the space-bounded problem.
Complexity:
n
nOlog
log *2
no significant advantage
Unrestricted Haar and HaarUnrestricted Haar and Haar++ SynopsesSynopses
• Assign arbitrary values to Haar/Haar+ coefficients, so that a maximum-error metric is minimized.
• Classical solutions: Guha and Harb KDD 2005, SODA 2006
0,,
,,,max
min,,00
zbbzviE
bzviE
bviE
R
L
zbbSz v
i
BnnRO 22 loglog
c1+
c2 c3
C1
c5 c6+
C2 c7c8 c9
c
o
d3d2d1d0
-++ +
-+c4
+-+
+ +
C3
0,,
,,,maxmin
,0,,
,,,maxmin
,0,,
,,,maxmin
min,,
00
00
00
,
,
,
rrR
L
zbbSz
lR
lL
zbbSz
hhR
hL
zbbSz
zbbzviE
bviE
zbbviE
bzviE
zbbzviE
bzviE
bviE
r
vRir
l
vLil
h
vHih
n
B
nRBO log
time
space
Karras and Mamoulis ICDE 2007
Unrestricted Haar and HaarUnrestricted Haar and Haar++ SynopsesSynopses• Solve the error-bounded problem.
nnRO log2
0,,min,
zzviSzviSviS RLSz v
i
Complexity:
• Apply to the space-bounded problem.
Complexity:
unrestricted Haar
0,
,,maxmin
,0,
,,maxmin
,0,
,,maxmin
min,
,
,
,
rrR
L
Sz
lR
lL
Sz
hhR
hL
Sz
zzviS
viS
zviS
zviS
zzviS
zviS
viS
vRir
vLil
vHih
Haar+
time nnRO log space
nnRO loglog *2 significant time & space advantage
l-Diversification in 1Dl-Diversification in 1D• Given database table T(A1, A2,…, An), a quasi-identifier
attribute set QT is a subset of attributes which can reveal the personal identity of records.
• Equivalence class with respect to quasi-identifier attribute set QT is a set of records indistinguishable in the projection of T on QT.
• A database table T with quasi-identifier set QT and sensitive attribute S conforms to the l-diversity property iff each equivalence class in T with respect to QT has at least l well-represented values of S [Machanavajjhala et al. ICDE 2006]
• Utility metric: Extent of equivalence class (group).• Other parameter: Outliers, records whose quasi-identifier
values are suppressed.
10 30 50 70 90
7
6
5
4
3
2
1
Lead Poisoning
Parkinson’s
Flu
Hyperthyroidism
Age
Postcode
Age
Postcode
10 30 50 70 90
7
6
5
4
3
2
1
l-Diversification in 1Dl-Diversification in 1D• A two-dimensional example.
quasi-identifier
Sensitive value
l-Diversification in 1Dl-Diversification in 1D• Study the problem in one dimension (a single
quasi-identifier).• Total order exists.• Similar to histogram construction.• Polynomially tractable.
quasi-identifier
Sensitive value
D1
D3
D2
D4r1 r6
r4
r2
r3
r5
• Groups consecutive in each sensitive value domain.
• Groups order the same in each domain.• Example for l=3.
l-Diversification in 1Dl-Diversification in 1D
quasi-identifier
Sensitive value
D1
D3
D2
D4r1 r6
r4
r2
r3
r5
• Groups consecutive in each sensitive value domain.• Groups order the same in each domain.• Example for l=3
l-Diversification in 1Dl-Diversification in 1D
quasi-identifier
Sensitive value
e
E
l-Diversification in 1Dl-Diversification in 1D• Given interval I of extent E, which includes c items with m different
sensitive values, number of possible boundaries/groups in I is:
cmO
cmm
cO
Bc
m
cm
,2
,
cmO
cmmc
OC
c
m
cm
,3
,22
l-Diversification in 1Dl-Diversification in 1D• Solve the outlier minimization problem.
nnCCO wm
cm log Complexity:
bccabc
ababab,Nmin
,,|
NN
PME
time wBO cm space
• Apply to the accuracy maximization problem.
Complexity:
• Apply to the privacy maximization problem.
Complexity:
nnCCO wm
cm loglog * time
time nnCCO wm
cm loglog *
Compact Hierarchical Compact Hierarchical HistogramsHistograms
• Assign arbitrary values to CHH coefficients, so that a maximum-error metric is minimized.
• Heuristic solutions: Reiss et al. VLDB 2006
BnnBO loglog2
c0
c1 c2
c3 c4c5 c6
d3d2d1d0
nnBO 2log
time
space
The benefit of making node B a bucket (occupied) node depends on whether node A is a bucket node – and also on whether node C is a bucket node.
[Reiss et al. VLDB 2006]
Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. Next-to-bottom level case
dcbavdcba
dcbavdcbadcbavdcba
dcbavdcba
viS
,,,,
,,,,,,,,
,,,,
,2
,1
,0
,
1,,, ** ii ssviSv
cic2i c2i+1
bav ,
z00
ba, dc,
dcba ,,
cic2i
0 0
z
dcbav ,,
dc, ba,
dcba ,,
dcz , dcbaz ,,
Compact Hierarchical Compact Hierarchical HistogramsHistograms• Solve the error-bounded problem. General, recursive case
0000
00000000
0000
**
**
**
,2
,1
,
,
RLRL
RLRLRLRL
RLRL
v
vv
v
ss
ss
ss
viS
RL
RL
RL
ii
ii
ii
*0
*0 ,,
RL iRiL sviSvsviSv RL
Complexity: nnOn
On 2log
0 1log
22
time
space
• Apply to the space-bounded problem.
Complexity: Polynomially Tractable
nOOn
log
02
nnnO logloglog *2
ConclusionsConclusions• Offline data representation problems under
constrains are more easily solvable through their counterparts optimizing another parameter.
• Dual-problem-based algorithms are simpler, more scalable, more elegant, and more memory-parsimonious than the direct ones.
• In the CHH case, the dual-problem-based algorithm achieves an optimal solution to the maximum-error longest-prefix-match CHH partitioning problem, which was considered intractable.
• Future: assessment of privacy and CHH solutions.
Related WorkRelated Work• H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik,
and T. Suel. Optimal histograms with quality guarantees. VLDB 1998• S. Guha, K. Shim, and J. Woo. REHIST: Relative error histogram
construction algorithms. VLDB 2004• M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for
maximum-error metrics. PODS 2004• S. Guha. Space efficiency in synopsis construction algorithms. VLDB
2005• S. Guha and B. Harb. Wavelet Synopses for Data Streams: Minimizing
Non-Euclidean Error. KDD 2005• S. Muthukrishnan. Subquadratic algorithms for workload-aware haar
wavelet synopses. FSTTCS 2005• S. Guha and B. Harb. Approximation algorithms for wavelet transform
coding of data streams. SODA 2006• we devised a specialized, highly efficient method for the case that a• F. Reiss, M. Garofalakis, and J. M. Hellerstein. Compact histograms for
hierarchical identifiers. VLDB 2006• A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam.
l-• diversity: Privacy beyond k-anonymity. ICDE 2006• P. Karras and N. Mamoulis. The Haar+ tree: a refined synopsis data
structure. ICDE 2007
Thank you! Questions?Thank you! Questions?