liang jin and chen li vldb’2005 supported by nsf career award iis-0238586 selectivity estimation...

Liang Jin and Chen Li

VLDB’2005

Supported by NSF CAREER Award IIS-0238586

Selectivity Estimation for Fuzzy String Predicates in Large Data Sets

2

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

“Find movies starred Schwarrzenger”?

Find movies with a star “similar to” Schwarrzenger.

3

Queries with Fuzzy String Predicates

• Stars: name similar to “Schwarrzenger”• Employees: SSN similar to “430-87-7294”• Customers: telephone number similar to “412-

0964”

• Similar to: – a domain-specific function – returns a similarity value between two strings

• Example: edit distance– Ed(s1,s2): minimum # of operations (insertion, deletion, substitution) to change

s1 to s2– ed(Tom Hanks, Ton Hank ) = 2

Database

4

Selectivity Estimation: Problem Formulation

A bag of strings

Input: fuzzy string predicate P(q, δ)

star SIMILARTO ’Schwarrzenger’

Output: # of strings s that satisfy dist(s,q) <= δ

5

Why Selectivity Estimation?

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1980,1999];

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Star Wars: Episode III - Revenge of the Sith

2005 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson Goodfellas 1990 Drama

… … … …

Movies

SELECT *

FROM Movies

WHERE star SIMILARTO ’Schwarrzenger’

AND year BETWEEN [1970,1971];

The optimizer needs to know the selectivity of a predicate to decide a good plan.

6

Rest of the talk

• Motivation: selectivity estimation of fuzzy predicates• Our approach: SEPIA

– Proximity between strings– Histograms and estimation algorithm

• Construction and maintenance of SEPIA• Experiments

7

Intuition of SEPIA

Selectivity Estimation of Approximate Predicates

Cluster

Pivot: p

String s

Query String: q

v1

v2ed(p,s)1 2 3

10%

44%28%

Probability 100%

4

8

Proximity between Strings

lukas

luciano

lucia

lucas2

3Query String

Pivot2

Cluster

Edit Distance? Not discriminative enough

9

Edit Vector from s1 to s2

• A vector <I, D, S>– I: # of insertions– D: # of deletions– S: # of substitutionsin a sequence of edit operations with their edit

distance

luciano

lucas<1,1,0>

<2,0,0>lucia

lucia

10

Why Edit Vector? More discriminative

lukas

luciano

lucia

lucas

<1,1,0><1,1,1>

<2,0,0>

Cluster

11

SEPIA histograms: Overview

Frequency Table

Cluster 1

Cluster k

Cluster 2

...

Global PPD TablePivot p1

Pivot p2

Pivot pk

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

Edit Distance

5

…………

30

18

9

Count

25

22

19

8

…

Edit Vector

......

40<0,1,0>

3<0,0,0>

# of Strings

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

<0,1,0> 7

Edit Vector

......

84<1,0,2>

2<0,0,0>

# of Strings

Frequency Table

Frequency Table

12

Frequency table for each cluster

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

Cluster iPivot pi

<0,1,0> 7

[0,1,0]

7 strings with an edit vector <0,1,0> from pi

13

Global PPD Table

Proximity Pair Distribution table

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

Edit Distance

5

…………

30

18

9

Count

25

22

19

8

…

Cluster

Pivot: p

String s

Query String: q

<1,0,1>

<1,1,0>ed(p,s)1 2 3

Probability

30%

60%

100%

14

SEPIA histograms: summary

Edit Vector

......

12<0,0,1>4<0,0,0>

# of Strings

Edit Vector

......

40<0,1,0>

3<0,0,0>

# of Strings

Edit Vector

......

84<1,0,2>

2<0,0,0>

# of Strings

Frequency Table

Cluster 1

Cluster k

Cluster 2

Vector v1

<1,1,0><1,0,1>

<1,1,0><1,0,1>

<1,1,0><1,0,1>

Vector v2

1003

602

301

Percentage(%)

<1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1><1,1,0>

<1,1,1> 2

100

884

763

32

...

Edit Distance

5

…………

Global PPD TablePivot p1

Pivot p2

Pivot pk

<0,1,0> 730

18

9

Count

25

22

19

8

…

15

Selectivity Estimation: ed(lukas, 2)

• Do it for all v2 vectors in each cluster, for all clusters• Take the sum of these contributions

Cluster i

lucialukas[1,1,1]

<0,1,0>Edit Vector

......

40<0,1,0>

# of Strings

Vector v1 Vector v2Percentage

(%)

<0,1,0><1,1,1> 762

Edit Distance

Count

19

... ...

Expected Contribution: 76% * 40

Global PPD Table

Frequency Table i

16

Selectivity Estimation for ed(q,d)

• For each cluster Ci

• For each v2 in frequency table of Ci

• Use (v1,v2,d) to lookup PPD• Take the sum of these f * N• Pruning possible (triangle inequality)

Cluster i

pivotqv1

v2Edit Vector

......

# of Strings


(%)

v2v1 f

Edit Distance

Count

19

... ...

Expected Contribution: f * N

Global PPD Table

Frequency Table i

d

v2 N

17

Outline




18

Clustering Strings

Two example algorithms• Lexicographic order based.• K-Medoids

– Choose initial pivots– Assign strings to its closest pivot– Swap a pivot with another string– Reassign the strings

19

Number of Clusters

It affects:• Cluster quality

– Similarity of strings within each cluster

• Costs:– Space– Estimation time

20

Constructing Frequency Tables

• For each cluster, group strings based on their edit vector from the pivot

• Count the frequency for each group

Cluster i

Pivot pi

[0,1,0]

[0,1

,0]

21

Constructing PPD Table

• Get enough samples of string triplets (q,p,s)• Propose a few heuristics

– ALL_RAND– CLOSE_RAND– CLOSE_LEX– CLOSE_UNIQUE

Pivot: p

String s

Query String: q

v1

v2ed(p,s)1 2 3

10%

44%28%

Probability 100%

4

A collection of q strings

A set of clusters

22

Dynamic Maintenance: Frequency Table

Take insertion as an example

Edit Vector

......

12<0,0,1>

4<0,0,0>

# of Strings

Cluster iPivot pi

<0,1,0> 7

[0,1,0]

New String

8

23

Dynamic Maintenance: PPD

Pivot: pq v1

v2

ed(p,s)=2

A collection of q strings in the construction of PPD

One of the clusters in the construction of PPD

New String


(%)

100

88

76

32

Edit Distance

…………

Count

25

22

19

8

…

v1 v2

v1 v2

v1 v2

v1 v2

0

1

2

3

+1

Adjust

24

Improving Estimation Accuracy

• A post-processing step to further improve estimation accuracy

• See paper for details.

25

Outline




26

Data

• Citeseer: – 71K author names– Length: [2,20], avg = 12

• Movie records from UCI KDD repository: – 11K movie titles.– Length: [3,80], avg = 35

• Introduced duplicates: – 10% of records – # of duplicates: [1,20], uniform

• Final results:– Citeseer: 142K author names– UCI KDD: 23K movie titles

27

Setting

• Test bed– PC: 2.4G P4, 1.2GB RAM, Windows XP– Visual C++ compiler

• Query workload:– Strings from the data– String not in the data– Results similar

• Quality measurements– Relative error: (fest – freal) / freal

– Absolute relative error : |fest – freal | / freal

28

Quartile distribution of relative errors

0

0.25

0.5

0.75

1

-100 -7

5-5

0-2

5 0 25 50 75 100

Infin

ity

Relative Error (%)

Perc

enta

ge in

Wor

kloa

d

Data set 1. CLOSE_RAND; 1000 clusters

29

Number of Clusters

30

Dynamic Maintenance

More results in the paper:• Extension to other similarity functions• More experimental results

31

Related Work

• Traditional histograms• Selectivity estimation for predicates with

wildcards: star LIKE “%Hanks%”• Answering fuzzy predicates efficiently (another

talk in this conference)

32

Conclusions

• Important to support queries with fuzzy string predicates

• SEPIA: provides accurate selectivity estimation– Structures can be efficiently constructed and

maintained.– Extendable to various similarity measurements

Q&A?

The Flamingo Project : http://www.ics.uci.edu/~flamingo/

liang jin and chen li vldb’2005 supported by nsf career award iis-0238586 selectivity estimation...

Documents