efficient algorithms for approximate member extraction using signature- based inverted lists jialong...
TRANSCRIPT
![Page 1: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/1.jpg)
Efficient Algorithms for Approximate Member Extraction Using Signature-based Inverted Lists
Jialong Han
Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin University of China
![Page 2: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/2.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng2
Introduction: An Example
A dictionary of strings we are interested in E.g. product names, postal addresses…
We are going to locate their “approximate apparences” in a series of documents. See the meaning of “approximate apparence” in the
following example:
![Page 3: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/3.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng3
Problem Definition
Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, m) ∈≥δ(or Distance(r, m) ≤k). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity of
two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:
)(
)(),(
mrwt
mrwtmrJ
![Page 4: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/4.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng4
Outline
Introduction State-of-the-art techniques
The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable
Our algorithms and evaluations Conclusion
![Page 5: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/5.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng5
Why pre-pruning is needed
We need spot evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be
inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering-
power-oriented? Less time or less survivors?
![Page 6: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/6.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng6
The issue of compromise comes again
Balance between the two stages should be reached:
More(less)filtration time
Strong(weak)filtration power
Fewer(more)candidates
Less(more)verification time
Overall performance
=Tf+Tv ?????
![Page 7: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/7.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng7
Outline
Introduction State-of-the-art techniques
The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable
Our algorithms and evaluations Conclusion
![Page 8: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/8.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng8
K-signature scheme K-signature scheme
Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as
signatures to represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have
insufficient signature overlapping with m K is a parameter for filtration power tuning
Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and
k=∞
![Page 9: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/9.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng9
Outline
Introduction State-of-the-art techniques
The filtration-verification framework K-signature scheme Inverted Signature-based Hashtable
Our algorithms and evaluations Conclusion
![Page 10: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/10.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng10
Inverted Signature-based Hashtable
Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An ‘1’ for each occurrence of a <token,sig-token>
tuple (‘1’- rectangle) Bitwise-or all solid matrices to get the matrix of R
Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R.
Formalized into an NPC problem Solution causes too weak filtering power
![Page 11: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/11.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng11
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
![Page 12: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/12.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng12
If Sim(m,r) ≥δ, what do we have ?wt(Sig(m)∩Sig(r)) ≥ τ(m)
wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }
So the threshold does not remain constant involves unknown evidence
Our solution: Use inverted lists to count sig-token overlappings. Note that sig-tokens usually have low document
frequency (e.g. IDF as weights)
Our proposed theorem
Too strict !Too strict !
Proved by usProved by us
![Page 13: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/13.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng13
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
![Page 14: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/14.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng14
Lists indexed by sig-tokens Each sig-token of a string creates a node (containing
the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon
digital slr camera”, r3=“canon slr camera”}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2,
7 ,9).
Signature-based Inverted Lists
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
11
11
22
11
22
22
33
33
![Page 15: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/15.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng15
Filtration by SIL Using an array called “accumulator” to compute
the overlapped sig weight wt(Sig(m)∩Sig(r)) E.g. m=“canon eos digital camera”, δ=0.8
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
11
11
22
11
22
22
33
33
rid 1 2 3wt(Sig(m)∩Sig(r))
min{τ(m),τ(r) } 6.8 3.8 3
AccumulatorAccumulator
2.02.09.09.0 2.02.000
Qualified!Qualified!
![Page 16: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/16.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng16
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
![Page 17: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/17.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng17
EvITER: Progressive Computation Recall we are checking all substrings
Some of them are quite similar, indicating that they share duplicate computation
An intuition: if m have potential evidence r, then m t is very likely to match r
Formally we proved that Let ES(m) be the set of “potential evidence” for m, list[t]={s| all
dictionary strings that contain token t} We have ES(m t) ES(m)∪list[t]
![Page 18: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/18.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng18
Example
Docoment M:
m t
“…. cannon eos digital camera lens…”
We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens”
ES(m)ES(m)
{r1}
{r1}
…
lens, 3.0
…
…
lens, 3.0
…
2222 5353
List[t]List[t]
![Page 19: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/19.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng19
Flow of Evidence
EvITER for “Evidence ITERATION”
……
![Page 20: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/20.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng20
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
![Page 21: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/21.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng21
The Static Threshold Problem
How does this index work so far? -“Get ready forδ=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1,δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I wantδ=0.9…” -“Sorry, please wait another 30min for index regeneration…” -“:-(”
![Page 22: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/22.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng22
The Static Threshold Problem
This One Seems Better -“Get ready forδ>=0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1,δ=0.8. Go!” -“…Extraction complete.” -“Document M2, and I wantδ=0.9…” -“…Extraction complete.” “:-)”
![Page 23: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/23.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng23
Supporting Dynamic Thresholds
An Observation When δ descends, a string r’s tokens fall into Sig(r)
one by one, in the order of their weight ranking. I.e. any node <sig-token, rid> is “active” when δ is
below certain “threshold” u<sig-token, rid>.
We record u<sig-token, rid> in each node and sort all nodes in each list according to the descending order of their u value.
For any given δ, we only need retrieve a prefix of each list to get all “active nodes”
![Page 24: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/24.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng24
Experimental Datasets
DBLP: 274,788 Paper titles 1,838,973 URLs
![Page 25: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/25.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng25
Balance should be reached
Recall our two stages of filtration and verification
![Page 26: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/26.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng26
Performance (DBLP)
![Page 27: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/27.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng27
Outline
Introduction State-of-the-art techniques Our algorithms and evaluations
Corrected filtering conditions EvSCAN: Filtration by SIL EvITER: Incremental optimization on EvSCAN Supporting Dynamic Thresholds
Conclusion
![Page 28: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/28.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng28
Conclusion Our method causes no false negatives Our method achieves a good balance between the two
phases of filtration and verification
We also propose EvITER to eliminate duplicate computation
Our method has both effective & efficient performance
![Page 29: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/29.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng29
![Page 30: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/30.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng30
References [1] A. Arasu, V. Ganti, R. Kaushik. Efficient exact set-similarity joins.
In VLDB, pages 918-929, 2006. [2] K. Chakrabarti, S. Chaudhuri, V. Ganti, D. Xin. An efficient filter
for approximate membership checking. In SIGMOD Conference, 2008.
[3] A. Chandel, P. C. Nagesh, and S. Sarawagi. Efficient batch top-k search for dictionary-based entity recognition. In ICDE, page 28, 2006.
[4] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.
[5] M.R.Garey and D.S.Johnson. Computers and Intractability: Guidance to the Theory of NP-Completeness.
[6] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491-500, 2001.
![Page 31: Efficient Algorithms for Approximate Member Extraction Using Signature- based Inverted Lists Jialong Han Co-authored with Jiaheng Lu, Xiaofeng Meng Renmin](https://reader033.vdocument.in/reader033/viewer/2022061306/55148096550346ea6e8b495a/html5/thumbnails/31.jpg)
Jiaheng Lu, Jialong Han, Xiaofeng MengJiaheng Lu, Jialong Han, Xiaofeng Meng31
References [7] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for
approximate string searches. In ICDE, pages 257–266, 2008. [8] C. Li, B,Wang, X. Yang, VGRAM: Improving performance of
approximate queries on string collections using variable length grams. In VLDB 2007.
[9] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001.
[10] S. Sarawagi, A.Kirpal, Efficient set joins on similarity predicates. In SIGMOD Conference, 2004.
[11] A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35-43, 2001.
[12] E. Sutinen and J. Tarhio. On using q-grams locations in approximate string matching. In ESA, pages 327-340, 1995.
[13] W. Wang, C. Xiao, X. Lin, C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, 2009.