liang jin and chen li vldb’2005 supported by nsf career award iis-0238586 selectivity estimation...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Liang Jin and Chen Li
VLDB’2005
Supported by NSF CAREER Award IIS-0238586
Selectivity Estimation for Fuzzy String Predicates in Large Data Sets
2
Example: a movie database
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
“Find movies starred Schwarrzenger”?
Find movies with a star “similar to” Schwarrzenger.
3
Queries with Fuzzy String Predicates
• Stars: name similar to “Schwarrzenger”• Employees: SSN similar to “430-87-7294”• Customers: telephone number similar to “412-
0964”
• Similar to: – a domain-specific function – returns a similarity value between two strings
• Example: edit distance– Ed(s1,s2): minimum # of operations (insertion, deletion, substitution) to change
s1 to s2– ed(Tom Hanks, Ton Hank ) = 2
Database
4
Selectivity Estimation: Problem Formulation
A bag of strings
Input: fuzzy string predicate P(q, δ)
star SIMILARTO ’Schwarrzenger’
Output: # of strings s that satisfy dist(s,q) <= δ
5
Why Selectivity Estimation?
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND year BETWEEN [1980,1999];
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Star Wars: Episode III - Revenge of the Sith
2005 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson Goodfellas 1990 Drama
… … … …
Movies
SELECT *
FROM Movies
WHERE star SIMILARTO ’Schwarrzenger’
AND year BETWEEN [1970,1971];
The optimizer needs to know the selectivity of a predicate to decide a good plan.
6
Rest of the talk
• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Proximity between strings– Histograms and estimation algorithm
• Construction and maintenance of SEPIA• Experiments
7
Intuition of SEPIA
Selectivity Estimation of Approximate Predicates
Cluster
Pivot: p
String s
Query String: q
v1
v2ed(p,s)1 2 3
10%
44%28%
Probability 100%
4
8
Proximity between Strings
lukas
luciano
lucia
lucas2
3Query String
Pivot2
Cluster
Edit Distance? Not discriminative enough
9
Edit Vector from s1 to s2
• A vector <I, D, S>– I: # of insertions– D: # of deletions– S: # of substitutionsin a sequence of edit operations with their edit
distance
luciano
lucas<1,1,0>
<2,0,0>lucia
lucia
10
Why Edit Vector? More discriminative
lukas
luciano
lucia
lucas
<1,1,0><1,1,1>
<2,0,0>
Cluster
11
SEPIA histograms: Overview
Frequency Table
Cluster 1
Cluster k
Cluster 2
...
Global PPD TablePivot p1
Pivot p2
Pivot pk
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
Edit Distance
5
…………
30
18
9
Count
25
22
19
8
…
Edit Vector
......
40<0,1,0>
3<0,0,0>
# of Strings
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
<0,1,0> 7
Edit Vector
......
84<1,0,2>
2<0,0,0>
# of Strings
Frequency Table
Frequency Table
12
Frequency table for each cluster
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
Cluster iPivot pi
<0,1,0> 7
[0,1,0]
7 strings with an edit vector <0,1,0> from pi
13
Global PPD Table
Proximity Pair Distribution table
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
Edit Distance
5
…………
30
18
9
Count
25
22
19
8
…
Cluster
Pivot: p
String s
Query String: q
<1,0,1>
<1,1,0>ed(p,s)1 2 3
Probability
30%
60%
100%
14
SEPIA histograms: summary
Edit Vector
......
12<0,0,1>4<0,0,0>
# of Strings
Edit Vector
......
40<0,1,0>
3<0,0,0>
# of Strings
Edit Vector
......
84<1,0,2>
2<0,0,0>
# of Strings
Frequency Table
Cluster 1
Cluster k
Cluster 2
Vector v1
<1,1,0><1,0,1>
<1,1,0><1,0,1>
<1,1,0><1,0,1>
Vector v2
1003
602
301
Percentage(%)
<1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1><1,1,0>
<1,1,1> 2
100
884
763
32
...
Edit Distance
5
…………
Global PPD TablePivot p1
Pivot p2
Pivot pk
<0,1,0> 730
18
9
Count
25
22
19
8
…
15
Selectivity Estimation: ed(lukas, 2)
• Do it for all v2 vectors in each cluster, for all clusters• Take the sum of these contributions
Cluster i
lucialukas[1,1,1]
<0,1,0>Edit Vector
......
40<0,1,0>
# of Strings
Vector v1 Vector v2Percentage
(%)
<0,1,0><1,1,1> 762
Edit Distance
Count
19
... ...
Expected Contribution: 76% * 40
Global PPD Table
Frequency Table i
16
Selectivity Estimation for ed(q,d)
• For each cluster Ci
• For each v2 in frequency table of Ci
• Use (v1,v2,d) to lookup PPD• Take the sum of these f * N• Pruning possible (triangle inequality)
Cluster i
pivotqv1
v2Edit Vector
......
# of Strings
Vector v1 Vector v2Percentage
(%)
v2v1 f
Edit Distance
Count
19
... ...
Expected Contribution: f * N
Global PPD Table
Frequency Table i
d
v2 N
17
Outline
• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Proximity between strings– Histograms and estimation algorithm
• Construction and maintenance of SEPIA• Experiments
18
Clustering Strings
Two example algorithms• Lexicographic order based.• K-Medoids
– Choose initial pivots– Assign strings to its closest pivot– Swap a pivot with another string– Reassign the strings
19
Number of Clusters
It affects:• Cluster quality
– Similarity of strings within each cluster
• Costs:– Space– Estimation time
20
Constructing Frequency Tables
• For each cluster, group strings based on their edit vector from the pivot
• Count the frequency for each group
Cluster i
Pivot pi
[0,1,0]
[0,1
,0]
21
Constructing PPD Table
• Get enough samples of string triplets (q,p,s)• Propose a few heuristics
– ALL_RAND– CLOSE_RAND– CLOSE_LEX– CLOSE_UNIQUE
Pivot: p
String s
Query String: q
v1
v2ed(p,s)1 2 3
10%
44%28%
Probability 100%
4
A collection of q strings
A set of clusters
22
Dynamic Maintenance: Frequency Table
Take insertion as an example
Edit Vector
......
12<0,0,1>
4<0,0,0>
# of Strings
Cluster iPivot pi
<0,1,0> 7
[0,1,0]
New String
8
23
Dynamic Maintenance: PPD
Pivot: pq v1
v2
ed(p,s)=2
A collection of q strings in the construction of PPD
One of the clusters in the construction of PPD
New String
Vector v1 Vector v2Percentage
(%)
100
88
76
32
Edit Distance
…………
Count
25
22
19
8
…
v1 v2
v1 v2
v1 v2
v1 v2
0
1
2
3
+1
Adjust
24
Improving Estimation Accuracy
• A post-processing step to further improve estimation accuracy
• See paper for details.
25
Outline
• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA
– Proximity between strings– Histograms and estimation algorithm
• Construction and maintenance of SEPIA• Experiments
26
Data
• Citeseer: – 71K author names– Length: [2,20], avg = 12
• Movie records from UCI KDD repository: – 11K movie titles.– Length: [3,80], avg = 35
• Introduced duplicates: – 10% of records – # of duplicates: [1,20], uniform
• Final results:– Citeseer: 142K author names– UCI KDD: 23K movie titles
27
Setting
• Test bed– PC: 2.4G P4, 1.2GB RAM, Windows XP– Visual C++ compiler
• Query workload:– Strings from the data– String not in the data– Results similar
• Quality measurements– Relative error: (fest – freal) / freal
– Absolute relative error : |fest – freal | / freal
28
Quartile distribution of relative errors
0
0.25
0.5
0.75
1
-100 -7
5-5
0-2
5 0 25 50 75 100
Infin
ity
Relative Error (%)
Perc
enta
ge in
Wor
kloa
d
Data set 1. CLOSE_RAND; 1000 clusters
29
Number of Clusters
30
Dynamic Maintenance
More results in the paper:• Extension to other similarity functions• More experimental results
31
Related Work
• Traditional histograms• Selectivity estimation for predicates with
wildcards: star LIKE “%Hanks%”• Answering fuzzy predicates efficiently (another
talk in this conference)
32
Conclusions
• Important to support queries with fuzzy string predicates
• SEPIA: provides accurate selectivity estimation– Structures can be efficiently constructed and
maintained.– Extendable to various similarity measurements
Q&A?
The Flamingo Project : http://www.ics.uci.edu/~flamingo/