gssjoin: a gpu-based set similarity join...

Post on 11-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

gSSJoin: a GPU-based Set Similarity Join

Algorithm

Sidney R. Junior, Rafael D. Quirino, Leonardo A. Ribeiro,Wellington S. Martins

www.inf.ufg.br

October 7, 2016

1 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Agenda

1 Introduction

2 Background

3 The gSSJoin Algorithm

4 Experiments

5 Related Work

6 Conclusion and Future Work

2 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Introduction

Set similarity join returns all pairs of similar sets from a dataset.Sets are considered similar if the value returned by a setsimilarity function for them is not less than a given threshold.

3 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Applications

Integration

Cleaning

Plagiarism identi�cation

Data Mining

4 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Current Algorithms

Most state-of-the-art algorithms uses a technique called pre�x

�ltering to prune dissimilar sets by inspecting only a fraction ofthem.

5 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Limitations

Pre�x �ltering is only e�ective at high threshold values. Whenthe threshold value is low, the fraction needed to be inspectedis bigger and the remaining sets needed to be veri�ed after thepruning step is much higher.

6 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Contributions

A �ne-grained parallel algorithm for both indexing the dataand performing the set similarity joins.

A highly threaded GPU implementation that takesadvantage of intensive occupation, hierarchical memory,and coalesced memory access.

A scalable multi-GPU implementation that exploits bothdata parallelism and task parallelism.

Extensive experimental work with standard real-worldtextual datasets.

7 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Data Representation

We map the strings to sets of tokens called q-grams. q-gramsare sub-strings of length q obtained by sliding a window overthe characters of the input string.

3-grams of "gSSJoin": {"gSS", "SSJ", "SJo", "Joi", "oin"}

8 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Formal De�nition

Given two set collections R and S, a set similarity function sim,and a threshold τ , the set similarity join between R and Sreturns all scored set pairs 〈(r , s), τ ′〉 s.t. (r , s) ∈ R× S andsim (r , s) = τ ′ ≥ τ .

9 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Similarity Function

A set similarity function sim (r , s) returns a value in [0, 1] torepresent their similarity. A greater value indicates that r and s

have higher similarity.

10 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Similarity Function

Jaccard Similarity : J (r , s) = |r∩s||r∪s|

11 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Overlap Constraint

A predicate involving the Jaccard similarity and a threshold τcan be equivalently rewritten into a set overlap constraint:J (r , s) ≥ τ ⇐⇒ |r ∩ s| ≥ τ

1+τ (|r |+ |s|).

12 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Pre�x

For a threshold τ , we can identify all candidate matches of agiven set r using a pre�x of length |r | − d|r | · τe+ 1.

13 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Current Algorithms

14 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Limitations of Current Algorithms

15 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Limitations of Current Algorithms

16 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Limitations of Current Algorithms

17 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

The gSSJoin Algorithm

Exploits massive parallelism instead of �ltering

Builds an inverted index on the pre-processing phase

Performs a set similarity search for each set

18 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Pre-processing

19 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Set Similarity Search

20 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Multi-GPU

The inverted index is created in all GPUs

The search sets are divided between the GPUs

21 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Multi-GPU

22 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Experimental Setup

We implemented gSSJoin using the CUDA Toolkit version 7.5.The experiments were conducted on a machine running CentOs7.2.1511 64-bits, with 24 Intel Xeon E5-2620, 16GB of ECCRAM, and four GeForce Zotac Nvidia GTX Titan Black, with6GB of RAM and 2,880 CUDA cores each.

23 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Databases

Table : Dataset statistics.

Database Number of sets Max Mean Standard deviation

DBLP 100k 205 70 23.9

Net�ix 200k 54 30.54 8.2

24 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Performance Results

Figure : DBLP dataset using 3-grams.

25 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Performance Results

Figure : Net�ix dataset using 3-grams.

26 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Performance Results

Figure : DBLP dataset using 2-grams.

27 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Performance Results

Figure : Net�ix dataset using 2-grams.

28 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Performance Results

Table : Execution Time (secs) Using Multiple GPUs, Threshold = 0.5

Number of GPUs: 1 2 3 4

DBLP 2gram 27.56 14.37 10.4 7.2

DBLP 3gram 16.73 8.77 6 4.47

Net�ix 2gram 41.10 21.63 14.37 11.2

Net�ix 3gram 23.52 12.69 8.5 6.5

29 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Related Work

The state-of-the-art algorithms are intrinsically sequential[Sarawagi and Kirpal, Chaudhuri et al. 2006, Bayardo et al.2007, Xiao et al. 2011, Ribeiro and Härder 2011, Wang etal. 2012].

A distributed memory MapReduce framework to performsimilarity joins was proposed by [Vernica et at. 2010].

Parallel solutions such as [Cruz et al. 2015] performapproximate set similarity joins using MinHash and LocalitySensitive Hashing.

30 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Conclusion

The gSSJoin algorithm has a much better performance than thestate-of-the-art CPU-based algorithms when the threshold islow and the database is more uniform.

In future work, we plan to incorporate candidate �lteringtechniques to obtain even greater speed-ups.

31 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

Questions?

email: srjsoftware@gmail.com

32 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

References

[Cruz et al. 2015]

Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2015).GPU Acceleration of Set Similarity Joins. In DEXA, pages 384�398.

[Bayardo et al. 2007]

Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All PairsSimilarity Search. In WWW, pages 131�140.

[Chaudhuri et al. 2006]

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitiveoperator for similarity joins in data cleaning. In ICDE, page 5.

[Ribeiro and Härder 2011]

Ribeiro, L. A. and Härder, T. (2011). Generalizing Pre�x Filtering toImprove Set Similarity Joins. Information Systems, 36(1):62�78.

[Sarawagi and Kirpal 2004]

Sarawagi, S. and Kirpal, A. E�cient Set Joins on SimilarityPredicates. In SIGMOD.

33 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

References

[Vernica et al. 2010]

Vernica, R., Carey, M. J., and Li, C. (2010). E�cient Parallel SetSimilarity Joins using MapReduce. In SIGMOD, pages 495�506.

[Xiao et al. 2011]

Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). E�cientSimilarity Joins for Near-duplicate Detection. TODS, 36(3):15.

[Wang et al. 2012]

Wang, J., Li, G., and Feng, J. (2012). Can We Beat the Pre�xFiltering?: An Adaptive Framework for Similarity Join and Search. InSIGMOD, pages 85�96.

34 / 35

Mestrado emCiência daComputação

Sidney R.Junior,

Rafael D.Quirino,

Leonardo A.Ribeiro,

WellingtonS. Martins

Introduction

Background

The gSSJoinAlgorithm

Experiments

RelatedWork

Conclusionand FutureWork

GPU Architecture

35 / 35

top related