gssjoin: a gpu-based set similarity join...
Post on 11-Aug-2020
1 Views
Preview:
TRANSCRIPT
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
gSSJoin: a GPU-based Set Similarity Join
Algorithm
Sidney R. Junior, Rafael D. Quirino, Leonardo A. Ribeiro,Wellington S. Martins
www.inf.ufg.br
October 7, 2016
1 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Agenda
1 Introduction
2 Background
3 The gSSJoin Algorithm
4 Experiments
5 Related Work
6 Conclusion and Future Work
2 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Introduction
Set similarity join returns all pairs of similar sets from a dataset.Sets are considered similar if the value returned by a setsimilarity function for them is not less than a given threshold.
3 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Applications
Integration
Cleaning
Plagiarism identi�cation
Data Mining
4 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Current Algorithms
Most state-of-the-art algorithms uses a technique called pre�x
�ltering to prune dissimilar sets by inspecting only a fraction ofthem.
5 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Limitations
Pre�x �ltering is only e�ective at high threshold values. Whenthe threshold value is low, the fraction needed to be inspectedis bigger and the remaining sets needed to be veri�ed after thepruning step is much higher.
6 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Contributions
A �ne-grained parallel algorithm for both indexing the dataand performing the set similarity joins.
A highly threaded GPU implementation that takesadvantage of intensive occupation, hierarchical memory,and coalesced memory access.
A scalable multi-GPU implementation that exploits bothdata parallelism and task parallelism.
Extensive experimental work with standard real-worldtextual datasets.
7 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Data Representation
We map the strings to sets of tokens called q-grams. q-gramsare sub-strings of length q obtained by sliding a window overthe characters of the input string.
3-grams of "gSSJoin": {"gSS", "SSJ", "SJo", "Joi", "oin"}
8 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Formal De�nition
Given two set collections R and S, a set similarity function sim,and a threshold τ , the set similarity join between R and Sreturns all scored set pairs 〈(r , s), τ ′〉 s.t. (r , s) ∈ R× S andsim (r , s) = τ ′ ≥ τ .
9 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Similarity Function
A set similarity function sim (r , s) returns a value in [0, 1] torepresent their similarity. A greater value indicates that r and s
have higher similarity.
10 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Similarity Function
Jaccard Similarity : J (r , s) = |r∩s||r∪s|
11 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Overlap Constraint
A predicate involving the Jaccard similarity and a threshold τcan be equivalently rewritten into a set overlap constraint:J (r , s) ≥ τ ⇐⇒ |r ∩ s| ≥ τ
1+τ (|r |+ |s|).
12 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Pre�x
For a threshold τ , we can identify all candidate matches of agiven set r using a pre�x of length |r | − d|r | · τe+ 1.
13 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Current Algorithms
14 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Limitations of Current Algorithms
15 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Limitations of Current Algorithms
16 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Limitations of Current Algorithms
17 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
The gSSJoin Algorithm
Exploits massive parallelism instead of �ltering
Builds an inverted index on the pre-processing phase
Performs a set similarity search for each set
18 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Pre-processing
19 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Set Similarity Search
20 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Multi-GPU
The inverted index is created in all GPUs
The search sets are divided between the GPUs
21 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Multi-GPU
22 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Experimental Setup
We implemented gSSJoin using the CUDA Toolkit version 7.5.The experiments were conducted on a machine running CentOs7.2.1511 64-bits, with 24 Intel Xeon E5-2620, 16GB of ECCRAM, and four GeForce Zotac Nvidia GTX Titan Black, with6GB of RAM and 2,880 CUDA cores each.
23 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Databases
Table : Dataset statistics.
Database Number of sets Max Mean Standard deviation
DBLP 100k 205 70 23.9
Net�ix 200k 54 30.54 8.2
24 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Performance Results
Figure : DBLP dataset using 3-grams.
25 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Performance Results
Figure : Net�ix dataset using 3-grams.
26 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Performance Results
Figure : DBLP dataset using 2-grams.
27 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Performance Results
Figure : Net�ix dataset using 2-grams.
28 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Performance Results
Table : Execution Time (secs) Using Multiple GPUs, Threshold = 0.5
Number of GPUs: 1 2 3 4
DBLP 2gram 27.56 14.37 10.4 7.2
DBLP 3gram 16.73 8.77 6 4.47
Net�ix 2gram 41.10 21.63 14.37 11.2
Net�ix 3gram 23.52 12.69 8.5 6.5
29 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Related Work
The state-of-the-art algorithms are intrinsically sequential[Sarawagi and Kirpal, Chaudhuri et al. 2006, Bayardo et al.2007, Xiao et al. 2011, Ribeiro and Härder 2011, Wang etal. 2012].
A distributed memory MapReduce framework to performsimilarity joins was proposed by [Vernica et at. 2010].
Parallel solutions such as [Cruz et al. 2015] performapproximate set similarity joins using MinHash and LocalitySensitive Hashing.
30 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Conclusion
The gSSJoin algorithm has a much better performance than thestate-of-the-art CPU-based algorithms when the threshold islow and the database is more uniform.
In future work, we plan to incorporate candidate �lteringtechniques to obtain even greater speed-ups.
31 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
Questions?
email: srjsoftware@gmail.com
32 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
References
[Cruz et al. 2015]
Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2015).GPU Acceleration of Set Similarity Joins. In DEXA, pages 384�398.
[Bayardo et al. 2007]
Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All PairsSimilarity Search. In WWW, pages 131�140.
[Chaudhuri et al. 2006]
Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitiveoperator for similarity joins in data cleaning. In ICDE, page 5.
[Ribeiro and Härder 2011]
Ribeiro, L. A. and Härder, T. (2011). Generalizing Pre�x Filtering toImprove Set Similarity Joins. Information Systems, 36(1):62�78.
[Sarawagi and Kirpal 2004]
Sarawagi, S. and Kirpal, A. E�cient Set Joins on SimilarityPredicates. In SIGMOD.
33 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
References
[Vernica et al. 2010]
Vernica, R., Carey, M. J., and Li, C. (2010). E�cient Parallel SetSimilarity Joins using MapReduce. In SIGMOD, pages 495�506.
[Xiao et al. 2011]
Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). E�cientSimilarity Joins for Near-duplicate Detection. TODS, 36(3):15.
[Wang et al. 2012]
Wang, J., Li, G., and Feng, J. (2012). Can We Beat the Pre�xFiltering?: An Adaptive Framework for Similarity Join and Search. InSIGMOD, pages 85�96.
34 / 35
Mestrado emCiência daComputação
Sidney R.Junior,
Rafael D.Quirino,
Leonardo A.Ribeiro,
WellingtonS. Martins
Introduction
Background
The gSSJoinAlgorithm
Experiments
RelatedWork
Conclusionand FutureWork
GPU Architecture
35 / 35
top related