pattern mining in parallel environmentspeople.irisa.fr/alexandre.termier/dmv/dmv_cm5.pdf•pattern...

Pattern mining in parallel environments

Alexandre Termier

Université de Rennes 1 – IRISA – Equipe LACODAM

DMV – M2 SIF

Naive introduction

• Pattern mining: find (interesting) patterns in data• cf Marc’s previous courses

• Need a lot of computing power• Exploration of a huge search space• Potentially costly pattern interest test

• Nowadays, computing power comes from parallelism• Multicore processors• Clusters• GPU

=> How to exploit parallel environments for pattern mining ?

(Tentative) motivations

Use of computing power for pattern mining ?

• Mine large datasets• Ex: actual supermarket dataset ~ 4 TB

• Mine “troublesome” datasets• Ex: bioinformatics, SNP data: ~1000 lines / 5 000 000 columns, 25% density

• Best actual FIS algorithms surrender around 20 000 columns (yes, LCM too)

• Mine “complex” patterns• Graphs: interest = subgraph isomorphism

• Make a finer grained analysis• Usually, reduce minimum support threshold…

Counter-argument

• Pattern mining outputs millions of patterns

• Few have actual value

• Why bother computing billions of patterns ?

?

Motivations, take two

• Many solutions to handle pattern overabundance (more on the way)• Post-processing

• Constraint

• Pattern sets (ex: KRIMP)

• Statistics-based pattern interest functions

• …

• Use computing power to find more interesting patterns• Efficient parallel pattern space exploration

• Efficient parallel evaluation of complex pattern interest functions

• Interactive navigation in pattern space

Parallel environments discussed in this talk

1. Multicore processors

2. Clusters

3. GPUs (a bit)

4. Manycores (some hints)

Parallel performance 101Slides from Marc Snir – University of Illinois at Urbana Champaign

Come from IJCAI Tutorial on Parallel Data Mining

Sometimes Parallelism is Easy

• Painting a fence:

• Time = (picket_painting_time) * (# pickets)/#painters

• Perfect parallelism

8Slide from Marc Snir

Up To a Limit

• Task granularity cannot be too small

9

― Too many painters spoil the fence

Slide from Marc Snir

Sometimes Parallelism Does Not Help

• How many babies do 9 women in one month?


Some Definitions

TP = Compute time with P HW threadsT1 – sequential compute timeT∞ -- compute time with no limitations on #threads = critical path length

TP ≥ T∞; TP ≥ T1/P

• Efficient algorithm: Tp ~ T1/P T∞ << T1/P; P << T1 /T∞

• Cannot use efficiently more HW threads than the “average width” of the computation

• Example -- Amdahl Law: fraction α of the computation is sequential, (1-α) fully parallel T∞ = αT1 ; can use efficiently ~1/α processors• E.g., 10% of code is sequential -> should not use more than ~10 HW threads


Speedup

• Measure of how much faster the computation executes versus the best serial code

• Serial time divided by parallel time

• Example: Painting a picket fence• 30 minutes of preparation (serial)

• One minute to paint a single picket

• 30 minutes of cleanup (serial)

• Thus, 300 pickets takes 360 minutes (serial time)

12

Speedup and Efficiency


Computing SpeedupNumber of painters

Time Speedup

1 30 + 300 + 30 = 360 1.0X

2 30 + 150 + 30 = 210 1.7X

10 30 + 30 + 30 = 90 4.0X

100 30 + 3 + 30 = 63 5.7X

Infinite 30 + 0 + 30 = 60 6.0X

13

• Speedup = Tp/T1

• Speedup ≤ P (P workers reduce time by at most a factor of P)

• Speedup ≤ T∞/ T1 (Amdahl’s law)

Amdahl’s Law

Potential speedup is restricted by serial portion



Speedup

• T1/TP

• Speedup usually is sub-linear and has a plateau• how could one have superlinear speedup?

14

1

3

5

7

9

11

13

15

17

19

1 3 5 7 9 11 13 15 17 19

ideal speedup

speedup


Efficiency

Number of painters

Time Speedup Efficiency

1 360 1.0X 100%

2 30 + 150 + 30 = 210 1.7X 85%

10 30 + 30 + 30 = 90 4.0X 40%

100 30 + 3 + 30 = 63 5.7X 5.7%

Infinite 30 + 0 + 30 = 60 6.0X 0%

15

• Measure of how effectively computation resources (threads) are kept busy

• Speedup divided by number of HW thread



Efficiency

• T1/(PTP)

• Efficiency is <1 and decreasing, usually

16

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

ideal

efficiency


Pattern mining on multicore processors

Multicore processors

• Most of todays processors are multicore processors• Moore law still active: #transitor on chip doubles every 18 monthes

• But clock frequency doesn’t increase anymore…

• …so computing power comes from multiple cores on chip

• Multicore processors have• Independent computing cores (from 2 to 12 usually)

• Shared L3 cache

• Shared or private L2 cache

• Private L1 cache

Example – Intel Nehalem/Westmere

• 4-10 cores • 1/2 threads per core• vector unit per core• Three cache levels:

• Private L1(32K Data, 32K Instruction),

• Private L2 (512 KB)• Shared L3 (16-30

MB)• 32 nm technology• Can be assembled

in quad chip SMPs~50 Gflop/s peak!


Cache hierarchy / architecture schema

hwloc library:get architecture schema

Multicore pitfalls for the pattern miner

• Synchronization• Protects accesses to shared memory areas• Sequentializes code • Avoid it: tree-shaped search space exploration (more on this later)

• Load unbalance

• Bus bandwith saturation• N cores / 1 bus to connect them to memory

• Computations much faster than memory transfers • Or too many data transfers

Thread 1

Thread 2

Thread 1 finishes -> idle !

Thread 2 finishes

Case study #1: subtree mining / Tatikonda et al.

• Paper: Shirish Tatikonda, Srinivasan Parthasarathy:

Mining Tree-Structured Data on Multicore Systems. PVLDB 2(1): 694-705 (2009)

• Excellent illustration of impact of bandwith pressure on pattern mining

Subtree mining 101

• Input: Tree database where transaction = tree

• Output: frequent subtree patterns

• Relies on subtree inclusion (costly) :

Induced subtree• Preserves parent-child relationships

A

B C

D B

A

B C

A

B C

D B

A

D B

Embedded subtree• Preserves ancestor-descendant relationships

• Every induced subtree is an embedded subtree

Algorithm overview• Two primary steps

• Candidate subtree generation• Generate all possible candidate subtrees

• Challenge: search space traversal

• Support counting• Evaluate each candidate for their frequency

• Challenge: subtree isomorphism

• A recursive pattern-growth approach• Start with a seed pattern (a single node)

• Repeatedly grow the pattern by adding nodes (pattern extension)• This step corresponds to search space traversal

• Evaluate the frequency of the generated pattern

1. Pattern_mine(P)

2. loop

3. sup = find_frequency (P)4. if sup ≥ Ѳ

5. P’ = grow P with a new node6. Pattern_mine(P’)

A

A

A

A

AB

A

B

A

SeedPattern

A

B

How can we do it efficiently ?

Slide from Shirish Tatikonda

Usual approach

• Search space exploration: • Represent trees with sequences (Asai et al., Zaki)

• Explore the search space of sequences

• Subtree inclusion test• Costly: store found embeddings -> embedding lists

• Pre-2005 tradeoff: more memory than computing power

Bandwith usage of TreeMiner (Zaki et al. 2005)

• Need lots transfers of embeddinglists…

• …that have poor cache locality

• Result:• 1.2 GB/s usage per core !

• Speedup : ~2 on 8 processors…

Reducing bandwidth usage

• Store embedding lists -> Recompute embedding lists on the fly• Post 2005, CPU is cheap, memory is expensive !

• Only process fixed number of embeddings at a time

TreeMiner: 1.2 GB/s TRIPS: 200 MB/s

Other challenge: task partitioning

B CA

Search Space Search space is partitioned into

equivalence classes ( )

Each equivalence class contains many patterns ( )

Processing each pattern involves many trees ( )

Workload skew is present at every level― One equivalence class may contain more patterns than the other

― Processing one pattern may be more expensive than the other

― One tree may be bigger than the other

Challenge: load balancing in the presence of skewSlide from Shirish Tatikonda

Key Idea:Adaptively and automatically

adjust the type and granularityof work that is shared

among cores

Adaptive Design

Thread pool

Task pool Tree pool Column pool

Context switch

Process multiple patterns in parallel

Process a single pattern i.e., multiple trees

in parallel

Process a single pattern w.r.t. a single tree in parallel

(i.e., dynamic programming matrix is processed in parallel)

Pools are empty

Job pools

Work is ready

Fine-grain Parallelism

Coarse-grainParallelism


Implementation of Parallel AlgorithmminingMethod ( . . . )

miningMethod ( . . . )

Job-spawning condition

Tree pool

Process_the_job

Chunk pool

Light-weight context switching

General-purpose scheduling service

Thread pool

Check the complexity Of current work

task-parallel

data-parallel

chunk-parallel


31

0

2

4

6

8

1 2 3 4 5 6 7 8

Cslogs

Treebank

Performance – Parallel Efficiency

0

4

8

12

16

1 2 4 6 8 10 12 14 16

Cslogs

Treebank

Cslogs w/o fine-grained

Treebank w/o fine-grained

Number of cores

Spee

du

p

Spee

du

p

Number of processors

On a dual quad core system On a SGI Altix - SMP system

1) Near-linear speedups

2) Need for fine-grain parallelism― Without which the speedups saturate

3) Memory optimizations are critical― Without them, the algorithms are not scalable (speedup of 1.7 on 8 processors)

On two data sets: Cslogs (web analytics) a nd treebank (computational linguistics)


Constraint pattern mining

Generic pattern miningBoley et al, 07-10Arimura & Uno, 09

32

ParaMiner[DMKD, 14]

FISApriori, 93FPGrowth, 00LCM, 04

Specific approaches

strong accessibility

strong accessibility+

decomposability

Map of pattern mining families

ParaMiner

2014

ParaMiner: algorithm

ParaMiner’s initial scalability

36

0

2

4

6

8

10

12

14

32168421

AV

ERA

GE

LATE

NC

Y(C

YCLE

S)

#CORES

AVERAGE LATENCY EVOLUTION IN PARAMINER

GRI FIS

37

A B C

AB AC BC BD CD CE

Select Select Select

ABC ABD ACE BCD CDE CDF CEF

Select Select Select Select Select Select

Select Select Select Select Select Select Select

dataset dataset dataset

dataset dataset dataset dataset dataset dataset

dataset dataset dataset dataset dataset dataset dataset

38

A B C

AB AC BC BD CD CE






DA DA DB DB DC DC

DAB DAB DAC DBC DCD DCD DCE

dataset reduction

39

A B C

AB AC BC BD CD CE






DA DA

DAB DAB

dataset reduction 2.0

dataset dataset

datasetDA

DC DC

DC DC

DCE

40

ProblemSpeedup (before)

max: 32Speedup (after)

max: 32

Frequent Itemset Mining(dense data)

3 21

Frequent Itemset Mining(sparse data)

11 25

Gradual Pattern Mining 27.5 28.5

Closed Relational Graph Mining

3 5

Conclusion for multicores

• Ubiquitous parallel environment• Getting easier to program

• C++11, future/promises, async…• Java 8 Streaming• Scala actors…

• Main problem: cores contend for bus bandwidth

• Requires to design algorithms with a small working set

• Use the right profiling tools !• Java -> YourKit, C++/other -> Vtune, hardware counters library of Linux

41

Vtune (Intel)

Pattern mining on clusters

Clusters

• Homogenous/heterogeneous network of (multicore) machines

• Network of clusters: grid -> backbone of cloud

• Cheapest way to get tremendous• Computing power• RAM• Storage space

• Introduces new problems• Slow communications between nodes

• Data locality

• Fault tolerance

Cluster computing main environments

• MPI

• MapReduce

• Spark

MPI

• Message-passing paradigm

• Programmer have total control over communications

• One abstraction level above socket programming

• Communication primitives• One-to-one• One-to-many / one-to-all• Many-to-many / all-to-all

• Messages are byte arrays

• No fault tolerance

=> Powerful but hard to use correctly

MapReduce• Based on functional paradigm• Two types of operation

• Map• Reduce

• Hadoop also offers:• Distributed file system (HDFS)• Fault tolerance

• For files• For Map/Reduce jobs

• Everything commited to disk• super slow

Spark

• Designed for iterative computations (including data mining)• Data stored in memory (and/or disk)

• One/two orders of magnitude faster than Map/Reduce

• Based on RDD: Resilient Distributed Dataset• Distributed collection paradigm• RDD divided in block -> each block fit into node’s RAM• RDD transformation operations

• Fault tolerance via reconstruction• Keep RDD lineage as metadata• Recompute lost RDD / RDD blocks

• Computing power is cheap !

Mining top-k-per-item over clusters

• Web data: long tail

• Standard Frequent Itemset Mining + long tail:

Slid

e p

arts

Mar

tin

Kir

chge

ssn

er/

Vin

cen

t Le

roy

Top-k-per-item frequent itemsets

Slid

e M

arti

n K

irch

gess

ne

r/V

ince

nt

Lero

y

Mining top-k-per-item itemsets

• TopPI• Martin Kirchgessner, Vincent Leroy et al., Grenoble-Alpes University• In submission• Computes top-k closed frequent itemsets per item• Based on heavily modified LCM• Multicore and MapReduce versions

• PFP: Parallel FP-Growth• Li et al., RecSys 2008• Based on FPGrowth• The frequent itemset miner of Mahout (MapReduce)• Compute at most k itemset per item

• Sloppy output definition

Example (TopPI)

Reminder: closure extension (Uno et al., 03)

Each branch generates different itemsets, no need for synchronization

Overview of general TopPI algorithm

TopPI over MapReduce

• Distribute branches over nodes• Branch = starting item

• Each node receives a set of starting items G to process

• Distribute the top-k collector• Collector: for each item, heap of size k storing current top-k for this item

• In a distributed setting, a node can only fill collectors for items of G• May not be actual top-k of these items

• Second phase: worker get complement top-k (items not in G)

• Merge both

PFP

• Partition the database over the workers (shards)

• Count frequency of all individual items & organize them in groups• Same as in TopPI

• Make group-dependant transactions (conditional datasets), mine each with FP-Growth

• Aggregate discovered frequent itemsets

Results – time comparison

LastFM: 1.2M lines x 1.2M columns (277 MB)Supermarket: 55M lines x 400k columns (2.8 GB)

51 x [ 2 Xeon E5520 4 core, 24 GB RAM]4 task / node

Results – TopPI speedup

Results – ouput comparison

Arabesque

• Arabesque: A System for Distributed Graph Mining

Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, Ashraf Aboulnaga. SOSP 2015

• Problem:• There are frameworks for analyzing very large graphs

• Pregel, Giraph

• But they are ill-adapted for frequent subgraph mining• Their base element is the vertex

• Arabesque http://arabesque.io• Embedding as base element

• Numerous problems fit in

http://arabesque.io/

Slide Mohammed J. Zaki

Conclusion for clusters

• Choose the right environment, stay alert for new approaches• currently Spark is the way to go for most people

• Make your computations as independent as possible• -> tree shaped search space exploration (but load unbalance risk)• -> limit number of “barriers”

• Control your data partitioning

• Use the right profiling tools !

Spark UI

Pattern mining on GPUs

GPUs

• Different paradigm• CPU: low latency low throughput

• GPU: high latency high throughput

• Many simple cores• Simpler control logic

• Fewer cache

• Data parallelism• SIMD

GPUs for pattern mining

• Pattern mining usual approaches: task parallelism• Ill-adapted for GPU

• Slow data transfers between host and GPU

• => few GPU pattern mining approaches• Apriori based on bitsets + vertical format

• FPGrowth based on array representation of FP-tree

Apriori vertical + bitsets on GPU

Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He, Qiong Luo:

Frequent itemset mining on graphics processors. DaMoN 2009: 34-42

Comparison Apriori GPU / FPGrowth CPU

Conclusion on GPUs

• Pattern mining on GPUs ?• Risky business

• Most researchers who published on it changed topic…

• Novel manycore processors may be better adapted to the task

Some hints on Manycores

Manycores processors

• Middle ground between multicores and GPUs• ~100s of cores

• Cores • Simpler than those of multicores

• More complex than GPU cores

• Cores have• Full-fledged control logic

• Some cache

Current manycores

• Intel Xeon Phi• 61 cores• 512k L2 cache per core• Extension board (same as GPU)

• Kalray MPPA• 256 cores• 16 clusters of 16 cores, 2MB per cluster

• Tilera GX• 72 cores• 256k L2 cache per core

• All: cache coherency optional

Manycores for pattern mining ?

• Better control logic than GPU…• …however:

• Small caches / onboard memory (working set must be kept small)• Slow data transfers with host (as of now)

• Manycores designed for complex streaming applications:• May be adapted for online pattern mining

• Also designed for: performance per watt• See: Emilio Francesquini, Márcio Bastos Castro, Pedro H. Penna, Fabrice Dupros,

Henrique C. Freitas, Philippe Olivier Alexandre Navaux, Jean-François Méhaut:On the energy efficiency and performance of irregular application executions on multicore, NUMA and manycore platforms. J. Parallel Distrib. Comput. 76: 32-48 (2015)

Conclusion

• Parallelism is getting easier and easier to get into

• -> provide necessary gain in performance

• Performance should be used to:• Show that I am faster than colleagues and get papers so 2005 !

• Extract more significant patterns

• Allow better interactivity with analysts• See KDD IDEA workshop 2013-2015

• See our CIKM 2015 paper (Omidvar Tehrani et al.)

Backup slides

pattern mining in parallel environmentspeople.irisa.fr/alexandre.termier/dmv/dmv_cm5.pdf•pattern...

Documents