jiannan wang (tsinghua, china) guoliang li (tsinghua, china) jianhua feng (tsinghua, china)

34
Fast-Join : An Efficient Method for Fuzzy Token Matching based String Similarity Join Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)

Upload: leah-silvey

Post on 14-Dec-2015

244 views

Category:

Documents


6 download

TRANSCRIPT

  • Slide 1

Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China) Slide 2 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 2/34 Slide 3 Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 Fast-Join @ ICDE2011 Card #NameAddrPhn 1234****Jeffery UllmanCS Dept. Stanford, CA111-1111 1018****Marvin MinskyCS Dept., MIT, MA222-2222 Card #NameEmailTel 1205****David [email protected] 0101****Jeffrey [email protected](650)111-1111 Jeffery Ullman Jeffrey Ullman Perform a similarity join on name attribute 3/34 Slide 4 Background String Similarity Join Find similar string pairs between two string sets An essential operation in many applications 2011/4/13 Fast-Join @ ICDE2011 User Id QueryTimestamp 1018**** ICDE 2011 Hanover 2011-01-15 10:12:10 1234**** NBA All Stars 2011 2011-01-15 11:05:06 2823**** ICDE Hannover 2011-01-15 11:10:10 6345**** weather Hanover 2011-01-15 12:34:10 Perform a self similarity join on query attribute 4/34 Slide 5 Motivation 2011/4/13 Fast-Join @ ICDE2011 Existing Similarity Metrics Token-based Similarity Character-based Similarity Hybrid Similarity Dice, Cosine, Jaccard, Edit Distance, Edit Similarity, GED [SIGMOD 03] Jaccard(S1, S2) = 1/3 ED(S1, S2) = 8GED(S1, S2) = 0 S1 = nba mcgrady, S2 = macgrady nba 5/34 Slide 6 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 6/34 Slide 7 Token-based Similarity Dice similarity Cosine similarity Jaccard similarity 2011/4/13 Fast-Join @ ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} |T 1 T 2 | =1 Example Exactly matched token pairs, i.e. T 1 T 2 7/34 Slide 8 2011/4/13 Fast-Join @ ICDE2011 T1T1 T2T2 mcgrady nba wnba macgrady nba 0.125 0.75 0.875 0.143 1 0.125 Weighted Bipartite Graph 3.Fuzzy Overlap: Maximum Weighted Matching (Quantify token similarity) Better than |T 1 T 2 |= 1 8/34 Slide 9 Fuzzy-Token Similarity Fuzzy-Dice similarity Fuzzy-Cosine similarity Fuzzy-Jaccard similarity 2011/4/13 Fast-Join @ ICDE2011 T 1 = {nba, mcgrady} T 2 = {macgrady, nba} Example 9/34 Slide 10 Comparison with Existing Similarities 2011/4/13 Fast-Join @ ICDE2011 10/34 Slide 11 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 11/34 Slide 12 2011/4/13 Fast-Join @ ICDE2011 String Similarity Join using Fuzzy-Token Similarity s1s1 kobe and trancy s2s2 trcy macgrady mvp s' 1 kobe bryant age s' 2 mvp tracy mcgrady T1T1 {kobe, and, trancy} T2T2 {trcy, macgrady, mvp} T 1 {kobe, bryant, age} T 2 {mvp, tracy, mcgrady} Tokenization (s 2, s 2 ), Naive Solution Enumerating N 2 pairs Quite Expensive Naive Solution Enumerating N 2 pairs Quite Expensive 12/34 Slide 13 Using Existing Methods 2011/4/13 Fast-Join @ ICDE2011 13/34 Slide 14 Our Signature Scheme 2011/4/13 Fast-Join @ ICDE2011 The superscript denotes which token generates the signature The superscript denotes which token generates the signature 14/34 Slide 15 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 15/34 Slide 16 Fast-Join @ ICDE2011 2011/4/13 Prefix Filtering Signature Scheme Alphabetical Order Remove 2 largest signatures 16/34 Slide 17 2011/4/13 Fast-Join @ ICDE2011 Token Sensitive Signature Scheme Prefix Filtering No! Token Sensitive Yes! 17/34 Slide 18 2011/4/13 Fast-Join @ ICDE2011 Candidates : {(T2,T4)} Delete the maximal number of largest signatures that contain 2 tokens Alphabetical Order Token Sensitive Signature Scheme (Contd) Candidates : {(T 1,T 2 ),(T 1,T 3 ),(T 1,T 4 ),(T 2,T 4 )} 18/34 Slide 19 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 19/34 Slide 20 2011/4/13 Fast-Join @ ICDE2011 Partition-NED Signature Scheme 20/34 Slide 21 2011/4/13 Fast-Join @ ICDE2011 Partition t 21/34 Slide 22 2011/4/13 Fast-Join @ ICDE2011 Partition t 22/34 Slide 23 2011/4/13 Fast-Join @ ICDE2011 Partition t (Contd) -3 -2 2 23/34 Slide 24 2011/4/13 Fast-Join @ ICDE2011 Punning Techniques Reduce substrings from 21 to 8 24/34 Slide 25 Comparison with Partition-ED (SIGMOD 09) 2011/4/13 Fast-Join @ ICDE2011 25/34 Slide 26 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for token sets Signature Scheme for tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 26/34 Slide 27 Experiment Setup Data sets DBLP Author: Author names from DBLP dataset AOL Query Log: Queries from AOL dataset Environment C++, GCC 4.2.3, Ubuntu Intel Core 2 Quad X5450 3.00GHz processor and 4 GB memory 2011/4/13 Fast-Join @ ICDE2011 27/34 Slide 28 Result Quality 2011/4/13 Fast-Join @ ICDE2011 28/34 Slide 29 Evaluation on Different Signature Schemes for Tokens 2011/4/13 Fast-Join @ ICDE2011 29/34 Slide 30 Evaluation on Different Signature Schemes for Token Sets 2011/4/13 Fast-Join @ ICDE2011 30/34 Slide 31 Put Everything Together 2011/4/13 Fast-Join @ ICDE2011 31/34 Slide 32 Outline Introduction Fuzzy-Token Similarity String Similarity Join using Fuzzy-Token Similarity Signature Scheme for Token Sets Signature Scheme for Tokens Experiment Conclusion 2011/4/13 Fast-Join @ ICDE2011 32/34 Slide 33 Conclusion Fuzzy-token similarity Hybrid similarity Subsume many well-known similarities High result quality String similarity join using fuzzy-token similarity Signature-based framework Token-sensitive signature scheme Partition-NED signature scheme Achieve higher performance than the state-of-the-art methods both theoretically and experimentally 2011/4/13 Fast-Join @ ICDE2011 33/34 Slide 34 2011/4/13 Fast-Join @ ICDE2011 http://dbgroup.cs.tsinghua.edu.cn/wangjn/projects/fastjoin/ 34/34