towards estimating the number of distinct value combinations for a set of attributes
DESCRIPTION
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes. Xiaohui Yu 1 , Calisto Zuzarte 2 , Ken Sevcik 1 1 University of Toronto 2 IBM Toronto Lab [email protected]. Distinct value combinations. 1. 2. 3. 3 distinct value combinations. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/1.jpg)
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes
Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik11University of Toronto2IBM Toronto [email protected]
![Page 2: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/2.jpg)
November 3, 2005 CIKM 2005 2
Distinct value combinationsCountry City Hotel NameGermany Bremen HiltonGermany Bremen Best WesternGermany Frankfurt InterCityCanada Toronto Four SeasonsCanada Toronto Intercontinent
al
3 distinct value combinations
1
2
3
COLSCARD (COlumn Set CARDinality) = 3
The problem: estimating COLSCARD for a given set of attributes
![Page 3: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/3.jpg)
November 3, 2005 CIKM 2005 3
Motivation Cardinality estimation for query
optimization, e.g., Estimating the size of Estimating the size of the aggregation
Approximate query answering, e.g., COUNT queries
Hotelcitycountry ),(
SELECT sales_date, sales_person, SUM(sales_quantity) AS unit_soldFROM salesGROUP BY sales_date, sales_person
![Page 4: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/4.jpg)
November 3, 2005 CIKM 2005 4
Roadmap Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
![Page 5: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/5.jpg)
November 3, 2005 CIKM 2005 5
Related work Previous work has focused on the
case of single attribute. [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00]
Sampling approach is used. Estimation through sampling is difficult
[CCMN’00] No existing statistical information is
exploited.
![Page 6: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/6.jpg)
November 3, 2005 CIKM 2005 6
Our solution Considering multiple-attributes Utilizing existing statistics on individual
attributes Readily available in most database systems Does not require access to the data
Granularity of statistics Exact marginal frequency distributions Approximate distributions: histograms etc.
![Page 7: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/7.jpg)
November 3, 2005 CIKM 2005 7
Estimation with known marginals Number of distinct values in attribute Ai,
frequency vector ),...,2,1( midi
i
i
d
j ijidiii ffff121 1),,...,,(f
)4.0,6.0(1 f
Country City Hotel NameGermany Bremen HiltonGermany Bremen Best WesternGermany Frankfurt InterCityCanada Toronto Four SeasonsCanada Toronto Intercontinental
)4.0,2.0,4.0(2 f )2.0,2.0,2.0,2.0,2.0(3 f
![Page 8: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/8.jpg)
November 3, 2005 CIKM 2005 8
The naïve estimator COLSCARD = Ndm
i i ,min1
Number of possible value combinations
di: the number of distinct values in attribute Ai
Sanity bound: COLSCARD cannot be greater than the table size
The problem: Some value combinations with low occurrence probabilities may not appear in the table!
![Page 9: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/9.jpg)
November 3, 2005 CIKM 2005 9
Upper/Lower bounds Trivial bounds
Upper bound: (the naïve estimator)
Lower bound:
Tighter bounds? In the case of two attributes, tighter bounds
are available.
mddd ,...,,max 21
Ndm
i i ,min1
![Page 10: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/10.jpg)
November 3, 2005 CIKM 2005 10
Tighter boundsN = 10
442
def
118
abc
A2A1
Naïve bounds: 3, 9 Lower bound = 2+1+1 = 4
1
1
value freqvalue freq
[2, 3]
Upper bound = 3+1+1 = 5
![Page 11: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/11.jpg)
November 3, 2005 CIKM 2005 11
Expected number of combinations Assumptions
1. The data distributions of individual columns are independent
2. The occurrence of each combination in the table is independent
Each element of f represents the
frequency of a specific value combination. An estimate of the probability of occurrence
mffff 21
![Page 12: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/12.jpg)
November 3, 2005 CIKM 2005 12
Estimator The probability of the i-th combination
not appearing in a particular tuple is
The probability of the i-th combination not appearing in the table (of size N) is
The expected number of value combinations is
)1( if
i
NifMCOLSCARDE )1(][ )(
1
m
j jdM
Nif )1(
![Page 13: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/13.jpg)
November 3, 2005 CIKM 2005 13
Example revisited Estimate the COLSCARD for attribute set (A1, A2, A3),
given)6.0,3.0,1.0(1 f )99.0,01.0(2 f )95.0,05.0(3 f 100N
New estimate: 5.94
Naïve estimate: 3*2*2 = 12,09405.0,00495.0,00095.0,00005.0(321 ffff,28215.0,01485.0,00285.0,00015.0)05643.0,02970.0,00570.0,00030.0
![Page 14: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/14.jpg)
November 3, 2005 CIKM 2005 14
Roadmap Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
![Page 15: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/15.jpg)
November 3, 2005 CIKM 2005 15
Estimation with histograms Histograms exist on individual attributes Two classes of histograms
Partition-based End-biased
Marginals can be (approximately) reconstructed from histograms Optimal histograms in each class?
![Page 16: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/16.jpg)
November 3, 2005 CIKM 2005 16
Optimal histograms Minimizing the error incurred by histograms
ERR = |ESThist – ESTexact| Partition-based histograms
A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used.
![Page 17: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/17.jpg)
November 3, 2005 CIKM 2005 17
Optimal end-biased histograms An end-biased histogram with B buckets
stores The exact frequencies of B-1 attribute values The average of the remaining values
Which B-1 values to store exactly? Most widely used end-biased histograms
store the frequencies of the most frequent values Not always optimal for COLSCARD estimation!!
![Page 18: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/18.jpg)
November 3, 2005 CIKM 2005 18
Example)9.0,1.0(1 f
0.94) 0.03, 0.02, (0.01, :1 case 2 f0.39) 0.31, 0.29, (0.01, :2 case '
2 f
Attributes (A1, A2)
Choose 1 frequency to store exactly
Index of the frequency stored1 2 3 4
1.68 2.01 2.17 0.150.01 1.10 1.09 1.02
2f'2f
Error table
N=10
![Page 19: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/19.jpg)
November 3, 2005 CIKM 2005 19
Optimal end-biased histograms Exhaustive search takes time proportional to We prove that the optimal choices can be one of the following
Most frequent values Least frequent values A combination of most frequent and least frequent values
Only need to search both ends Cost is linear in B, independent of dj!
1Bd j
C
![Page 20: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/20.jpg)
November 3, 2005 CIKM 2005 20
Roadmap Related work Estimation with known marginal
distributions Upper/lower bounds An estimator
Estimation with histograms Experimental results Conclusions
![Page 21: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/21.jpg)
November 3, 2005 CIKM 2005 21
Experiments – Data sets
Synthetic data Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) Number of tuples: 10K to 1M
Real data Cover Type: 581,012 tuples, 10 attributes Census Income: 32,561 tuples, 14 attributes
Error measure: ratio error ERR = max{true/est-1, est/true-1}
![Page 22: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/22.jpg)
November 3, 2005 CIKM 2005 22
Effect of data skew
0
1
2
3
4
5
6
7
8
9
ER
R
Proposed estimator 0.000237 0.000933 0.000982 0.0654
Naive estimator 0.0516 6.5171 5.9423 8.4921
z1 = 0,z2=0
z1 = 0,z2=2
z1 = 0,z2=4
z1 = 4,z2=4
N=100K
di=1k
![Page 23: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/23.jpg)
November 3, 2005 CIKM 2005 23
Effect of number of tuples
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1000 10000 100000 1000000
N
ER
R
z=0z=2z=4
![Page 24: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/24.jpg)
November 3, 2005 CIKM 2005 24
Results on real data
(a) Cover Type
31
4
3
52
ERR≤0.05 0.05<ERR≤0.1 0.1<ERR≤0.5 0.5<ERR≤1 ERR>1
(b) Census Income
59
19
102 1
45 pairs 91 pairs
![Page 25: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/25.jpg)
November 3, 2005 CIKM 2005 25
Accuracy of end-biased histograms
0
0.05
0.1
0.15
0.2
0.25
0.3
10 20 30 50
Number of buckets
ER
R
Results on the “capital-gain” attribute of Census Income data set
![Page 26: Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes](https://reader035.vdocument.in/reader035/viewer/2022062816/568155fb550346895dc3c63e/html5/thumbnails/26.jpg)
November 3, 2005 CIKM 2005 26
Conclusions Utilizing existing knowledge
maintained in database systems Proposed upper/lower bounds as well
as an estimator Considered two cases
exact marginal frequencies Histograms: optimal histograms
Experimental results show the effectiveness of the proposed method