![Page 1: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/1.jpg)
Extending Q-Grams to Estimate Selectivity of String Matching with
Low Edit Distance [1]
Pirooz Chubak
May 22, 2008
![Page 2: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/2.jpg)
Motivation
• Selectivity estimation of approximate string matching queries
• Applications– Misspelling correction/suggestion– Data integration and data cleaning– Query optimization (generating query plans)
![Page 3: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/3.jpg)
Approximate String Matching
• String similarity measures– Edit distance– Hamming distance– Jaccard similarity co-efficient
• Edit distance– Minimum number of edit (insertion, deletion,
replacement) operations to convert a string to the other
![Page 4: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/4.jpg)
Short Identifying Substring
• SIS by Chaudhuri, et al. [2]– String s usually has a substring s’ that if an
attribute value contains s, it almost always contains s’
– Thus, approximate selectivity of long string queries with their shorter substrings
![Page 5: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/5.jpg)
Related Work
• SEPIA [3]– Clusters similar strings– Selects a pivot for each cluster– Captures the edit distance distribution with
histograms– For each query, visit all the clusters and
estimate the number of strings within the distance threshold
![Page 6: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/6.jpg)
Problem Statement
• Given a query string sq and a bag of strings DB estimate the size of the answer set
• Interested in low edit thresholds (1-3)
},),(|{ DBssseds q
![Page 7: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/7.jpg)
Basic Definitions
• Q-gram– Any string of length q
• N-gram table– Frequencies of all q-grams for q=1…N
• Ans(sq,iDjImR) = set of strings s’ such that sq can be converted to s’ with i deletions, j insertions and m replacements
• Ans(sq,k) = set of string s’ obtained from sq with exactly k edit operations
![Page 8: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/8.jpg)
Examples
• Ans(“abcd”,1R) = {“?bcd”,”a?cd”,”ab?d”,”abc?”}
• Alphabet for extended Q-grams =
• 3-gram table for “beau” contains frequencies for– 1-grams (b, e, a, u)
– 2-grams (#b, be, ea, au, u$)
– 3-grams (#be, bea, eau, au$
• Extended 3-gram table also contains frequencies for– For 2-grams (?b, ?e, ?u, b?, e?, a?, u?, ??, #?, ?$
– For 3-grams (?ea, #?e, ??$, etc.)
?},$,{#
![Page 9: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/9.jpg)
Replacement semi-lattice
• Assume only replacements are allowed• E.g. Ans(“abcd”,2R)
– Possible answers = ab??, a?c?, ?bc?, a??d, ?b?d, ??cd
• Find value of | Ans(“abcd”,2R)| using
• S1 = ab??, … , S6=??cd
![Page 10: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/10.jpg)
Replacement semi-lattice (Cont.)
Get the values of intersections from this table and plug them into the formula for |Ans(“abcd”,2R)|
Semi-lattice for Ans(“abcd”,2R)
![Page 11: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/11.jpg)
General Formulas
• Generalize the above idea to find |Ans(sq,kR)|
• The general formulas for deletion is very trivial and can be shown to always be the sum of the frequencies of the level-0 nodes
• The general case for insertion can be very complex, only interested in at most 3 insertions
![Page 12: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/12.jpg)
Estimate selectivity
• General idea– group Ans(sq,k) by the length of the strings (l-k...l+k)
– Estimate the size of each subset separately
• Ans(“abcde”,2)– 5 subsets, having strings of size 3 to 7
– Length 3 is Ans(“abcde”,2D)
– Length 5 is Ans(“abcde”,1I1D) U Ans(“abcde”,2R)
Lots of overlap
![Page 13: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/13.jpg)
Estimate selectivity (Cont.)
• Combined Approach– Obtain base strings for both sets
– Remove redundant base strings
• Ans(“abcde”,2R) generates “abc??”• Ans(“abcde”,1I1D) generates “abcd?”• “abc??” has all the strings in “abcd?”
Remove “abcd?” from base strings
![Page 14: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/14.jpg)
Estimate selectivity (cont.)
• BasicEQ, for a given string length– Find the base strings (remove redundancies)
– Iteratively intersect base strings to obtain r-intersections (r = 2..|base strings|)
• This will generate new nodes in the hierarchy
– Partition the nodes and estimate their frequencies
– Add these estimated frequencies
![Page 15: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/15.jpg)
Estimate selectivity (cont.)• Node Partitioning
– Partition the nodes, so that every node q in a partition has the same coefficient Cq
– Cq is the number of times q appears in all the intersections of base strings
– For each partition find Cq and sum of frequencies of its nodes
![Page 16: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/16.jpg)
Frequency Estimation• Estimate the frequency of an extended q-gram in the
extended N-gram table
• Maximal Overlap (MO) [4]– Finds the substring in the table that has the maximum overlap with
sq
• MAX approach– If MO(“abc?”) < MO(“abcd”), then set MO(“abcd”) for “abc?”
• MO+– Find the substring with the minimum frequency
• MM– Combination of MAX and MO+
![Page 17: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/17.jpg)
Estimate selectivity (cont.)
• BasicEQ is efficient if the general formulas are applicable
• Propose OptEQ that adds two enhancements to BasicEQ– Approximates the co-efficient Cq but achieves a better
performance
– Groups the set of strings obtained in each iteration of BasicEQ to obtain faster intersection tests (for being empty)
![Page 18: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/18.jpg)
Experimetal Evaluation(method, NB, NE, PT)
![Page 19: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/19.jpg)
Experimetal Evaluation
![Page 20: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/20.jpg)
Experimetal Evaluation
![Page 21: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/21.jpg)
Experimetal Evaluation
Space vs. Accuracy
![Page 22: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/22.jpg)
Conclusions
• Proposed OptEQ– Approximates coefficients of partitions
– Groups semi-lattices to obtain scalability
– More accurate than SEPIA
– Exploits disk space to give higher precisions
• MM and Max estimates give good results
![Page 23: Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022062620/551ac7c555034656628b5cec/html5/thumbnails/23.jpg)
References
[1] H. Lee, R. T. Ng, and K. Shim, “Extending Q-grams to estimate selectivity of string matching with low edit distance”, VLDB 2007
[2] S. Chaudhuri, V. Ganti, and L. Gravano “Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem”, ICDE 2004
[3] L. Jin and C. Li, “Selectivity Estimation for Fuzzy String Predicates in Large Data Sets”, VLDB 2005
[4] H. V. Jagadish, R. T. Ng and D. Srivastava. “Substring Selectivity Estimation”, PODS 1999