a parallel association rule mining algorithm for corpus

Post on 17-Jul-2015

421 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1 of 23

A MPI-based Parallel Association Rule Mining (ARM) Algorithm for Corpus

Shankai Yan, 8 November 2014

2 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

Experiments

3 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

Experiments

4 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Detecting Privacy Leaks

Application of ARM for CorpusTalent Recruitment

5 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Detecting Privacy Leaks

Richard Chow, Philippe Golle, Jessica Staddon. Detecting Privacy Leaks Using Corpus-based Association Rules. Proceedings of the 14th ACM SIGKDDMIDP, pp.893-901, 2008.

6 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Detecting Privacy Leaks

Application of ARM for CorpusTalent Recruitment

7 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

The DISCOTEX System (job postings)

Raymond J. Mooney, Un Yong Nahm. Text Mining with Information Extraction. Proceedings of the 4th International MIDP Colloquium, pp.141-160, 2003.

8 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

ExperimentsShankai Yan, Pingjian Zhang. A Fast Association Rule Mining Algorithm for Corpus. International Conference on Intelligent Systems and Knowledge Engineering, pp.449-459, 2013.

9 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Serial Algorithm Description

Hash Inverted Index Construction

10 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Serial Algorithm Description

k-Frequent Itemsets Generation

Association Rules Generation

11 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

Experiments

12 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Parallel Algorithm Description

Corpus 1-Frequent Itemsets

1-Frequent Itemsets

AssociationRules

Input Data Decomposition

Hash Inverted Index Synchronization

Communication PatternAssociation Rules

Generation

13 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Input Data Decomposition

C=30 C=20 C=25

Adjust bound C of the first fit decrease (FFD) algorithm on bin-packing problem to find the minimum C that leads the bin number to the value equal to process number.Example: Find a strategy to dispatch documents of different size [13, 7, 20, 13, 12, 7, 12] to 4 processes.

14 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Parallel Algorithm Description

Corpus 1-Frequent Itemsets

1-Frequent Itemsets

AssociationRules

Input Data Decomposition

Hash Inverted Index Synchronization

Communication PatternAssociation Rules

Generation

15 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Hash Inverted Index Synchronization

16 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Parallel Algorithm Description

Corpus 1-Frequent Itemsets

1-Frequent Itemsets

AssociationRules

Input Data Decomposition

Hash Inverted Index Synchronization

Communication PatternAssociation Rules

Generation

17 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Communication Pattern

18 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Parallel Algorithm Description

Corpus 1-Frequent Itemsets

1-Frequent Itemsets

AssociationRules

Input Data Decomposition

Hash Inverted Index Synchronization

Communication PatternAssociation Rules

Generation

19 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

Experiments

20 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Experiments

Data set: Sougou Labs Corpushttp://www.sogou.com/labs/resources.html

Small 103 documents 2.4MB 15710 terms

Medium 104 documents 31.2MB 35617 terms

Large 105 documents 368MB 135527 terms

21 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Experiments

0

5

10

15

20

25

30

35

serial 1 3 5 7

Elap

se t

ime

(s)

Node Number

small(MPI)

medium(MPI)

small(MPI+OpenMP)

medium(MPI+OpenMP)

Parallel Efficiency:small(MPI) [1.39%, 6.72%]small(MPI+OpenMP) [2.15%, 7.19%]medium(MPI) [1.54%, 7.37%]medium(MPI+OpenMP) [2.25%, 7.94%]

22 of 23Email: yan.shankai@mail.scut.edu.cnAddress: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Experiments

0

500

1000

1500

2000

2500

3000

3500

4000

serial 1 3 5 7

Elap

se t

ime

(s)

Node Number

large(MPI)

large(MPI+OpenMP)

Parallel Efficiency:large(MPI) [9.67%, 27.05%]large(MPI+OpenMP) [61.00%, 70.48%]

23 of 23

Thanks For ListeningShankai Yan, 8 November 2014

top related