pattern mining in parallel environmentspeople.irisa.fr/alexandre.termier/dmv/dmv_cm5.pdf•pattern...

85
Pattern mining in parallel environments Alexandre Termier Université de Rennes 1 – IRISA – Equipe LACODAM DMV – M2 SIF

Upload: others

Post on 10-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Pattern mining in parallel environments

Alexandre Termier

Université de Rennes 1 – IRISA – Equipe LACODAM

DMV – M2 SIF

Page 2: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Naive introduction

• Pattern mining: find (interesting) patterns in data• cf Marc’s previous courses

• Need a lot of computing power• Exploration of a huge search space• Potentially costly pattern interest test

• Nowadays, computing power comes from parallelism• Multicore processors• Clusters• GPU

=> How to exploit parallel environments for pattern mining ?

Page 3: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

(Tentative) motivations

Use of computing power for pattern mining ?

• Mine large datasets• Ex: actual supermarket dataset ~ 4 TB

• Mine “troublesome” datasets• Ex: bioinformatics, SNP data: ~1000 lines / 5 000 000 columns, 25% density

• Best actual FIS algorithms surrender around 20 000 columns (yes, LCM too)

• Mine “complex” patterns• Graphs: interest = subgraph isomorphism

• Make a finer grained analysis• Usually, reduce minimum support threshold…

Page 4: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Counter-argument

• Pattern mining outputs millions of patterns

• Few have actual value

• Why bother computing billions of patterns ?

?

Page 5: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Motivations, take two

• Many solutions to handle pattern overabundance (more on the way)• Post-processing

• Constraint

• Pattern sets (ex: KRIMP)

• Statistics-based pattern interest functions

• …

• Use computing power to find more interesting patterns• Efficient parallel pattern space exploration

• Efficient parallel evaluation of complex pattern interest functions

• Interactive navigation in pattern space

Page 6: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Parallel environments discussed in this talk

1. Multicore processors

2. Clusters

3. GPUs (a bit)

4. Manycores (some hints)

Page 7: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Parallel performance 101Slides from Marc Snir – University of Illinois at Urbana Champaign

Come from IJCAI Tutorial on Parallel Data Mining

Page 8: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Sometimes Parallelism is Easy

• Painting a fence:

• Time = (picket_painting_time) * (# pickets)/#painters

• Perfect parallelism

8Slide from Marc Snir

Page 9: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Up To a Limit

• Task granularity cannot be too small

9

― Too many painters spoil the fence

Slide from Marc Snir

Page 10: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Sometimes Parallelism Does Not Help

• How many babies do 9 women in one month?

10Slide from Marc Snir

Page 11: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Some Definitions

TP = Compute time with P HW threadsT1 – sequential compute timeT∞ -- compute time with no limitations on #threads = critical path length

TP ≥ T∞; TP ≥ T1/P

• Efficient algorithm: Tp ~ T1/P T∞ << T1/P; P << T1 /T∞

• Cannot use efficiently more HW threads than the “average width” of the computation

• Example -- Amdahl Law: fraction α of the computation is sequential, (1-α) fully parallel T∞ = αT1 ; can use efficiently ~1/α processors• E.g., 10% of code is sequential -> should not use more than ~10 HW threads

11Slide from Marc Snir

Page 12: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Speedup

• Measure of how much faster the computation executes versus the best serial code

• Serial time divided by parallel time

• Example: Painting a picket fence• 30 minutes of preparation (serial)

• One minute to paint a single picket

• 30 minutes of cleanup (serial)

• Thus, 300 pickets takes 360 minutes (serial time)

12

Speedup and Efficiency

Slide from Marc Snir

Page 13: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Computing SpeedupNumber of painters

Time Speedup

1 30 + 300 + 30 = 360 1.0X

2 30 + 150 + 30 = 210 1.7X

10 30 + 30 + 30 = 90 4.0X

100 30 + 3 + 30 = 63 5.7X

Infinite 30 + 0 + 30 = 60 6.0X

13

• Speedup = Tp/T1

• Speedup ≤ P (P workers reduce time by at most a factor of P)

• Speedup ≤ T∞/ T1 (Amdahl’s law)

Amdahl’s Law

Potential speedup is restricted by serial portion

Speedup and Efficiency

Slide from Marc Snir

Page 14: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Speedup

• T1/TP

• Speedup usually is sub-linear and has a plateau• how could one have superlinear speedup?

14

1

3

5

7

9

11

13

15

17

19

1 3 5 7 9 11 13 15 17 19

ideal speedup

speedup

Slide from Marc Snir

Page 15: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Efficiency

Number of painters

Time Speedup Efficiency

1 360 1.0X 100%

2 30 + 150 + 30 = 210 1.7X 85%

10 30 + 30 + 30 = 90 4.0X 40%

100 30 + 3 + 30 = 63 5.7X 5.7%

Infinite 30 + 0 + 30 = 60 6.0X 0%

15

• Measure of how effectively computation resources (threads) are kept busy

• Speedup divided by number of HW thread

Speedup and Efficiency

Slide from Marc Snir

Page 16: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Efficiency

• T1/(PTP)

• Efficiency is <1 and decreasing, usually

16

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

ideal

efficiency

Slide from Marc Snir

Page 17: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Pattern mining on multicore processors

Page 18: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Multicore processors

• Most of todays processors are multicore processors• Moore law still active: #transitor on chip doubles every 18 monthes

• But clock frequency doesn’t increase anymore…

• …so computing power comes from multiple cores on chip

• Multicore processors have• Independent computing cores (from 2 to 12 usually)

• Shared L3 cache

• Shared or private L2 cache

• Private L1 cache

Page 19: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Example – Intel Nehalem/Westmere

• 4-10 cores • 1/2 threads per core• vector unit per core• Three cache levels:

• Private L1(32K Data, 32K Instruction),

• Private L2 (512 KB)• Shared L3 (16-30

MB)• 32 nm technology• Can be assembled

in quad chip SMPs~50 Gflop/s peak!

Slide from Marc Snir

Page 20: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Cache hierarchy / architecture schema

hwloc library:get architecture schema

Page 21: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Multicore pitfalls for the pattern miner

• Synchronization• Protects accesses to shared memory areas• Sequentializes code • Avoid it: tree-shaped search space exploration (more on this later)

• Load unbalance

• Bus bandwith saturation• N cores / 1 bus to connect them to memory

• Computations much faster than memory transfers • Or too many data transfers

Thread 1

Thread 2

Thread 1 finishes -> idle !

Thread 2 finishes

Page 22: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Case study #1: subtree mining / Tatikonda et al.

• Paper: Shirish Tatikonda, Srinivasan Parthasarathy:

Mining Tree-Structured Data on Multicore Systems. PVLDB 2(1): 694-705 (2009)

• Excellent illustration of impact of bandwith pressure on pattern mining

Page 23: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Subtree mining 101

• Input: Tree database where transaction = tree

• Output: frequent subtree patterns

• Relies on subtree inclusion (costly) :

Induced subtree• Preserves parent-child relationships

A

B C

D B

A

B C

A

B C

D B

A

D B

Embedded subtree• Preserves ancestor-descendant relationships

• Every induced subtree is an embedded subtree

Page 24: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Algorithm overview• Two primary steps

• Candidate subtree generation• Generate all possible candidate subtrees

• Challenge: search space traversal

• Support counting• Evaluate each candidate for their frequency

• Challenge: subtree isomorphism

• A recursive pattern-growth approach• Start with a seed pattern (a single node)

• Repeatedly grow the pattern by adding nodes (pattern extension)• This step corresponds to search space traversal

• Evaluate the frequency of the generated pattern

1. Pattern_mine(P)

2. loop

3. sup = find_frequency (P)4. if sup ≥ Ѳ

5. P’ = grow P with a new node6. Pattern_mine(P’)

A

A

A

A

AB

A

B

A

SeedPattern

A

B

How can we do it efficiently ?

Slide from Shirish Tatikonda

Page 25: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Usual approach

• Search space exploration: • Represent trees with sequences (Asai et al., Zaki)

• Explore the search space of sequences

• Subtree inclusion test• Costly: store found embeddings -> embedding lists

• Pre-2005 tradeoff: more memory than computing power

Page 26: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Bandwith usage of TreeMiner (Zaki et al. 2005)

• Need lots transfers of embeddinglists…

• …that have poor cache locality

• Result:• 1.2 GB/s usage per core !

• Speedup : ~2 on 8 processors…

Page 27: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Reducing bandwidth usage

• Store embedding lists -> Recompute embedding lists on the fly• Post 2005, CPU is cheap, memory is expensive !

• Only process fixed number of embeddings at a time

TreeMiner: 1.2 GB/s TRIPS: 200 MB/s

Page 28: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Other challenge: task partitioning

B CA

Search Space Search space is partitioned into

equivalence classes ( )

Each equivalence class contains many patterns ( )

Processing each pattern involves many trees ( )

Workload skew is present at every level― One equivalence class may contain more patterns than the other

― Processing one pattern may be more expensive than the other

― One tree may be bigger than the other

Challenge: load balancing in the presence of skewSlide from Shirish Tatikonda

Page 29: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Key Idea:Adaptively and automatically

adjust the type and granularityof work that is shared

among cores

Adaptive Design

Thread pool

Task pool Tree pool Column pool

Context switch

Process multiple patterns in parallel

Process a single pattern i.e., multiple trees

in parallel

Process a single pattern w.r.t. a single tree in parallel

(i.e., dynamic programming matrix is processed in parallel)

Pools are empty

Job pools

Work is ready

Fine-grain Parallelism

Coarse-grainParallelism

Slide from Shirish Tatikonda

Page 30: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Implementation of Parallel AlgorithmminingMethod ( . . . )

miningMethod ( . . . )

Job-spawning condition

Tree pool

Process_the_job

Chunk pool

Light-weight context switching

General-purpose scheduling service

Thread pool

Check the complexity Of current work

task-parallel

data-parallel

chunk-parallel

Slide from Shirish Tatikonda

Page 31: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

31

0

2

4

6

8

1 2 3 4 5 6 7 8

Cslogs

Treebank

Performance – Parallel Efficiency

0

4

8

12

16

1 2 4 6 8 10 12 14 16

Cslogs

Treebank

Cslogs w/o fine-grained

Treebank w/o fine-grained

Number of cores

Spee

du

p

Spee

du

p

Number of processors

On a dual quad core system On a SGI Altix - SMP system

1) Near-linear speedups

2) Need for fine-grain parallelism― Without which the speedups saturate

3) Memory optimizations are critical― Without them, the algorithms are not scalable (speedup of 1.7 on 8 processors)

On two data sets: Cslogs (web analytics) a nd treebank (computational linguistics)

Slide from Shirish Tatikonda

Page 32: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Constraint pattern mining

Generic pattern miningBoley et al, 07-10Arimura & Uno, 09

32

ParaMiner[DMKD, 14]

FISApriori, 93FPGrowth, 00LCM, 04

Specific approaches

strong accessibility

strong accessibility+

decomposability

Map of pattern mining families

Page 33: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

ParaMiner

2014

Page 34: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

ParaMiner: algorithm

Page 35: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

ParaMiner’s initial scalability

Page 36: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

36

0

2

4

6

8

10

12

14

32168421

AV

ERA

GE

LATE

NC

Y(C

YCLE

S)

#CORES

AVERAGE LATENCY EVOLUTION IN PARAMINER

GRI FIS

Page 37: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

37

A B C

AB AC BC BD CD CE

Select Select Select

ABC ABD ACE BCD CDE CDF CEF

Select Select Select Select Select Select

Select Select Select Select Select Select Select

dataset dataset dataset

dataset dataset dataset dataset dataset dataset

dataset dataset dataset dataset dataset dataset dataset

Page 38: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

38

A B C

AB AC BC BD CD CE

Select Select Select

ABC ABD ACE BCD CDE CDF CEF

Select Select Select Select Select Select

Select Select Select Select Select Select Select

dataset dataset dataset

DA DA DB DB DC DC

DAB DAB DAC DBC DCD DCD DCE

dataset reduction

Page 39: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

39

A B C

AB AC BC BD CD CE

Select Select Select

ABC ABD ACE BCD CDE CDF CEF

Select Select Select Select Select Select

Select Select Select Select Select Select Select

dataset dataset dataset

DA DA

DAB DAB

dataset reduction 2.0

dataset dataset

datasetDA

DC DC

DC DC

DCE

Page 40: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

40

ProblemSpeedup (before)

max: 32Speedup (after)

max: 32

Frequent Itemset Mining(dense data)

3 21

Frequent Itemset Mining(sparse data)

11 25

Gradual Pattern Mining 27.5 28.5

Closed Relational Graph Mining

3 5

Page 41: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Conclusion for multicores

• Ubiquitous parallel environment• Getting easier to program

• C++11, future/promises, async…• Java 8 Streaming• Scala actors…

• Main problem: cores contend for bus bandwidth

• Requires to design algorithms with a small working set

• Use the right profiling tools !• Java -> YourKit, C++/other -> Vtune, hardware counters library of Linux

41

Page 42: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Vtune (Intel)

Page 43: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Pattern mining on clusters

Page 44: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Clusters

• Homogenous/heterogeneous network of (multicore) machines

• Network of clusters: grid -> backbone of cloud

• Cheapest way to get tremendous• Computing power• RAM• Storage space

• Introduces new problems• Slow communications between nodes

• Data locality

• Fault tolerance

Page 45: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Cluster computing main environments

• MPI

• MapReduce

• Spark

Page 46: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

MPI

• Message-passing paradigm

• Programmer have total control over communications

• One abstraction level above socket programming

• Communication primitives• One-to-one• One-to-many / one-to-all• Many-to-many / all-to-all

• Messages are byte arrays

• No fault tolerance

=> Powerful but hard to use correctly

Page 47: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

MapReduce• Based on functional paradigm• Two types of operation

• Map• Reduce

• Hadoop also offers:• Distributed file system (HDFS)• Fault tolerance

• For files• For Map/Reduce jobs

• Everything commited to disk• super slow

Page 48: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Spark

• Designed for iterative computations (including data mining)• Data stored in memory (and/or disk)

• One/two orders of magnitude faster than Map/Reduce

• Based on RDD: Resilient Distributed Dataset• Distributed collection paradigm• RDD divided in block -> each block fit into node’s RAM• RDD transformation operations

• Fault tolerance via reconstruction• Keep RDD lineage as metadata• Recompute lost RDD / RDD blocks

• Computing power is cheap !

Page 49: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Mining top-k-per-item over clusters

• Web data: long tail

• Standard Frequent Itemset Mining + long tail:

Slid

e p

arts

Mar

tin

Kir

chge

ssn

er/

Vin

cen

t Le

roy

Page 50: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Top-k-per-item frequent itemsets

Slid

e M

arti

n K

irch

gess

ne

r/V

ince

nt

Lero

y

Page 51: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Mining top-k-per-item itemsets

• TopPI• Martin Kirchgessner, Vincent Leroy et al., Grenoble-Alpes University• In submission• Computes top-k closed frequent itemsets per item• Based on heavily modified LCM• Multicore and MapReduce versions

• PFP: Parallel FP-Growth• Li et al., RecSys 2008• Based on FPGrowth• The frequent itemset miner of Mahout (MapReduce)• Compute at most k itemset per item

• Sloppy output definition

Page 52: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Example (TopPI)

Page 53: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Reminder: closure extension (Uno et al., 03)

Each branch generates different itemsets, no need for synchronization

Page 54: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Overview of general TopPI algorithm

Page 55: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

TopPI over MapReduce

• Distribute branches over nodes• Branch = starting item

• Each node receives a set of starting items G to process

• Distribute the top-k collector• Collector: for each item, heap of size k storing current top-k for this item

• In a distributed setting, a node can only fill collectors for items of G• May not be actual top-k of these items

• Second phase: worker get complement top-k (items not in G)

• Merge both

Page 56: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need
Page 57: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

PFP

• Partition the database over the workers (shards)

• Count frequency of all individual items & organize them in groups• Same as in TopPI

• Make group-dependant transactions (conditional datasets), mine each with FP-Growth

• Aggregate discovered frequent itemsets

Page 58: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Results – time comparison

LastFM: 1.2M lines x 1.2M columns (277 MB)Supermarket: 55M lines x 400k columns (2.8 GB)

51 x [ 2 Xeon E5520 4 core, 24 GB RAM]4 task / node

Page 59: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Results – TopPI speedup

Page 60: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Results – ouput comparison

Page 61: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Arabesque

• Arabesque: A System for Distributed Graph Mining

Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, Ashraf Aboulnaga. SOSP 2015

• Problem:• There are frameworks for analyzing very large graphs

• Pregel, Giraph

• But they are ill-adapted for frequent subgraph mining• Their base element is the vertex

• Arabesque http://arabesque.io• Embedding as base element

• Numerous problems fit in

Page 62: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Slide Mohammed J. Zaki

Page 63: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Slide Mohammed J. Zaki

Page 64: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Slide Mohammed J. Zaki

Page 65: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Slide Mohammed J. Zaki

Page 66: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Slide Mohammed J. Zaki

Page 67: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Slide Mohammed J. Zaki

Page 68: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Conclusion for clusters

• Choose the right environment, stay alert for new approaches• currently Spark is the way to go for most people

• Make your computations as independent as possible• -> tree shaped search space exploration (but load unbalance risk)• -> limit number of “barriers”

• Control your data partitioning

• Use the right profiling tools !

Page 69: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Spark UI

Page 70: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Pattern mining on GPUs

Page 71: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

GPUs

• Different paradigm• CPU: low latency low throughput

• GPU: high latency high throughput

• Many simple cores• Simpler control logic

• Fewer cache

• Data parallelism• SIMD

Page 72: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

GPUs for pattern mining

• Pattern mining usual approaches: task parallelism• Ill-adapted for GPU

• Slow data transfers between host and GPU

• => few GPU pattern mining approaches• Apriori based on bitsets + vertical format

• FPGrowth based on array representation of FP-tree

Page 73: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Apriori vertical + bitsets on GPU

Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He, Qiong Luo:

Frequent itemset mining on graphics processors. DaMoN 2009: 34-42

Page 74: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Comparison Apriori GPU / FPGrowth CPU

Page 75: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Conclusion on GPUs

• Pattern mining on GPUs ?• Risky business

• Most researchers who published on it changed topic…

• Novel manycore processors may be better adapted to the task

Page 76: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Some hints on Manycores

Page 77: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Manycores processors

• Middle ground between multicores and GPUs• ~100s of cores

• Cores • Simpler than those of multicores

• More complex than GPU cores

• Cores have• Full-fledged control logic

• Some cache

Page 78: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Current manycores

• Intel Xeon Phi• 61 cores• 512k L2 cache per core• Extension board (same as GPU)

• Kalray MPPA• 256 cores• 16 clusters of 16 cores, 2MB per cluster

• Tilera GX• 72 cores• 256k L2 cache per core

• All: cache coherency optional

Page 79: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Manycores for pattern mining ?

• Better control logic than GPU…• …however:

• Small caches / onboard memory (working set must be kept small)• Slow data transfers with host (as of now)

• Manycores designed for complex streaming applications:• May be adapted for online pattern mining

• Also designed for: performance per watt• See: Emilio Francesquini, Márcio Bastos Castro, Pedro H. Penna, Fabrice Dupros,

Henrique C. Freitas, Philippe Olivier Alexandre Navaux, Jean-François Méhaut:On the energy efficiency and performance of irregular application executions on multicore, NUMA and manycore platforms. J. Parallel Distrib. Comput. 76: 32-48 (2015)

Page 80: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Conclusion

• Parallelism is getting easier and easier to get into

• -> provide necessary gain in performance

• Performance should be used to:• Show that I am faster than colleagues and get papers so 2005 !

• Extract more significant patterns

• Allow better interactivity with analysts• See KDD IDEA workshop 2013-2015

• See our CIKM 2015 paper (Omidvar Tehrani et al.)

Page 81: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

Backup slides

Page 82: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

82

Page 83: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

83

Page 84: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

84

Page 85: Pattern mining in parallel environmentspeople.irisa.fr/Alexandre.Termier/dmv/DMV_CM5.pdf•Pattern mining: find (interesting) patterns in data •cf Marc’s previous courses •Need

85