parallel mining of closed sequential patterns

Parallel Mining of Closed Sequential Patterns

Shengnan Cong, Jiawei Han, David Padua

Proceeding of the 11th ACM SIGKDD international conference on Knowledge discovery in data mining Chicago, Illinois, USA, 2005

Advisor ： Jia-Ling Koh Speaker ： Chun-Wei Hsieh

Introduction

Numerous applications:– DNA sequences, Analysis of web log, customer shopping

sequences, XML query access patterns…

Closed Sequential patterns– have All information– are more compact

Many applications are time-critical and involve huge volumes of data.

Sequential Algorithm-BIDE

Step 1: Identify the frequent 1-sequences Step 2: Project the dataset along each

frequent 1-sequence Step 3: Mine each resulting projected dataset

Sequential Algorithm-BIDE

The projected dataset forsequence AB is {C,CB,C,BCA}.

Task Decomposition

1. Each processor counts the occurrence of 1-sequences in a different part of the dataset. A global add reduction is executed to obtain the overall counts.

2. Build pseudoprojections. This is done in parallel by assigning a different part of the dataset to each processor. The pseudo-projections are communicated to all processors via an all-to-all broadcast.

3. Dynamic scheduling to distribute the processing of the projections across processors.

Task Decomposition

In the second step, it is more efficient to implement the broadcast using a virtual ring structure.

Assume there are N processor, and

Processor K – Only receives the package from Processor ((K-1) mod N)– Only Sends the package to Processor ((K+1) mod N)

It needs (N-1) send-receive steps and consumes no more than 0.5% of the mining time.

Task Scheduling

1. A master processor maintains a queue of pseudo- projection identifiers. Other processors is initially assigned a projection.

2. After mining a projection, a processor sends a request to the master processor for another projection.

3. This process continues until the queue of projections is empty.

Task Scheduling

If the largest subtask takes 25% of the total mining time, the best possible speedup is only 4 regardless of the number of processors available.

To improve the dynamic scheduling, the approach is to find which projections require long mining time, and to

decompose them.

Relative Mining Time Estimation

Random sampling – selects random subset of the projections– is not accurate if the overhead is kept small

Selective sampling – uses every sequence of the projections– discards infrequent 1-sequences and the last L frequent 1-

sequences ( L = a given fraction t * the average length of the sequences in the dataset )

Selective sampling

For example,– assume (A : 4), (B : 4), (C : 4), (D :3), (E : 3), (F : 3), (G : 1) are the

1-sequences– the support threshold = 4 – the average length of the sequences in the dataset = 4 – Suppose t = 75%

L = 4 0 .∗ 75 = 3 Given a sequence as AABCACDCFDB, selective sampling will reduce this sequence to AABCA

Relative Mining Time Estimation

Par-CSP Algorithm

Experiments

64 nodes OS: Redhat Linux 7.2 CPU: 1GHz Intel Pentium 3 RAM: 1GB Compiler: GNU g++ 2.96

Experiments

•Synthetic Dataset: IBM dataset generator

•Real Dataset: Gazelle, Web click-stream

Experiments

parallel mining of closed sequential patterns

n processor

master processor

total mining time

long mining time

experimentssynthetic

dna sequences

data mining chicago

customer shopping sequences

Documents

dryad: distributed data-parallel programs from sequential...

sequential and parallel algorithms for the eneralized...

a unified approach to sequential and parallel algorithms ·...

on the potentiality of sequential and parallel codes...

parallel languages as extensions of sequential ones

communication-optimal parallel and sequential qr and lu...

the sequential attack against power grid...

communication-optimal parallel and sequential qr and lu...

from sequential algorithm selection to parallel portfolio...

a survey of parallel sequential pattern mining ·...

parallel coding approaches in converting sequential code...

adaptive sequential posterior simulators for massively...

sequential and parallel algorithms for some problems on...

communication-avoiding parallel and sequential qr...

scheduling and ordering issues in sequential task flow...

accelerating sequential computer vision algorithms using...

saman amarasinghe. lets stick with current sequential...

adaptive sequential posterior simulators for …adaptive...

dryad: distributed data-parallel programs from sequential

easy, e ective, e cient: gpu programming in python with...