pattern mining in parallel environmentspeople.irisa.fr/alexandre.termier/dmv/dmv_cm5.pdf•pattern...
TRANSCRIPT
Pattern mining in parallel environments
Alexandre Termier
Université de Rennes 1 – IRISA – Equipe LACODAM
DMV – M2 SIF
Naive introduction
• Pattern mining: find (interesting) patterns in data• cf Marc’s previous courses
• Need a lot of computing power• Exploration of a huge search space• Potentially costly pattern interest test
• Nowadays, computing power comes from parallelism• Multicore processors• Clusters• GPU
=> How to exploit parallel environments for pattern mining ?
(Tentative) motivations
Use of computing power for pattern mining ?
• Mine large datasets• Ex: actual supermarket dataset ~ 4 TB
• Mine “troublesome” datasets• Ex: bioinformatics, SNP data: ~1000 lines / 5 000 000 columns, 25% density
• Best actual FIS algorithms surrender around 20 000 columns (yes, LCM too)
• Mine “complex” patterns• Graphs: interest = subgraph isomorphism
• Make a finer grained analysis• Usually, reduce minimum support threshold…
Counter-argument
• Pattern mining outputs millions of patterns
• Few have actual value
• Why bother computing billions of patterns ?
?
Motivations, take two
• Many solutions to handle pattern overabundance (more on the way)• Post-processing
• Constraint
• Pattern sets (ex: KRIMP)
• Statistics-based pattern interest functions
• …
• Use computing power to find more interesting patterns• Efficient parallel pattern space exploration
• Efficient parallel evaluation of complex pattern interest functions
• Interactive navigation in pattern space
Parallel environments discussed in this talk
1. Multicore processors
2. Clusters
3. GPUs (a bit)
4. Manycores (some hints)
Parallel performance 101Slides from Marc Snir – University of Illinois at Urbana Champaign
Come from IJCAI Tutorial on Parallel Data Mining
Sometimes Parallelism is Easy
• Painting a fence:
• Time = (picket_painting_time) * (# pickets)/#painters
• Perfect parallelism
8Slide from Marc Snir
Up To a Limit
• Task granularity cannot be too small
9
― Too many painters spoil the fence
Slide from Marc Snir
Sometimes Parallelism Does Not Help
• How many babies do 9 women in one month?
10Slide from Marc Snir
Some Definitions
TP = Compute time with P HW threadsT1 – sequential compute timeT∞ -- compute time with no limitations on #threads = critical path length
TP ≥ T∞; TP ≥ T1/P
• Efficient algorithm: Tp ~ T1/P T∞ << T1/P; P << T1 /T∞
• Cannot use efficiently more HW threads than the “average width” of the computation
• Example -- Amdahl Law: fraction α of the computation is sequential, (1-α) fully parallel T∞ = αT1 ; can use efficiently ~1/α processors• E.g., 10% of code is sequential -> should not use more than ~10 HW threads
11Slide from Marc Snir
Speedup
• Measure of how much faster the computation executes versus the best serial code
• Serial time divided by parallel time
• Example: Painting a picket fence• 30 minutes of preparation (serial)
• One minute to paint a single picket
• 30 minutes of cleanup (serial)
• Thus, 300 pickets takes 360 minutes (serial time)
12
Speedup and Efficiency
Slide from Marc Snir
Computing SpeedupNumber of painters
Time Speedup
1 30 + 300 + 30 = 360 1.0X
2 30 + 150 + 30 = 210 1.7X
10 30 + 30 + 30 = 90 4.0X
100 30 + 3 + 30 = 63 5.7X
Infinite 30 + 0 + 30 = 60 6.0X
13
• Speedup = Tp/T1
• Speedup ≤ P (P workers reduce time by at most a factor of P)
• Speedup ≤ T∞/ T1 (Amdahl’s law)
Amdahl’s Law
Potential speedup is restricted by serial portion
Speedup and Efficiency
Slide from Marc Snir
Speedup
• T1/TP
• Speedup usually is sub-linear and has a plateau• how could one have superlinear speedup?
14
1
3
5
7
9
11
13
15
17
19
1 3 5 7 9 11 13 15 17 19
ideal speedup
speedup
Slide from Marc Snir
Efficiency
Number of painters
Time Speedup Efficiency
1 360 1.0X 100%
2 30 + 150 + 30 = 210 1.7X 85%
10 30 + 30 + 30 = 90 4.0X 40%
100 30 + 3 + 30 = 63 5.7X 5.7%
Infinite 30 + 0 + 30 = 60 6.0X 0%
15
• Measure of how effectively computation resources (threads) are kept busy
• Speedup divided by number of HW thread
Speedup and Efficiency
Slide from Marc Snir
Efficiency
• T1/(PTP)
• Efficiency is <1 and decreasing, usually
16
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
ideal
efficiency
Slide from Marc Snir
Pattern mining on multicore processors
Multicore processors
• Most of todays processors are multicore processors• Moore law still active: #transitor on chip doubles every 18 monthes
• But clock frequency doesn’t increase anymore…
• …so computing power comes from multiple cores on chip
• Multicore processors have• Independent computing cores (from 2 to 12 usually)
• Shared L3 cache
• Shared or private L2 cache
• Private L1 cache
Example – Intel Nehalem/Westmere
• 4-10 cores • 1/2 threads per core• vector unit per core• Three cache levels:
• Private L1(32K Data, 32K Instruction),
• Private L2 (512 KB)• Shared L3 (16-30
MB)• 32 nm technology• Can be assembled
in quad chip SMPs~50 Gflop/s peak!
Slide from Marc Snir
Cache hierarchy / architecture schema
hwloc library:get architecture schema
Multicore pitfalls for the pattern miner
• Synchronization• Protects accesses to shared memory areas• Sequentializes code • Avoid it: tree-shaped search space exploration (more on this later)
• Load unbalance
• Bus bandwith saturation• N cores / 1 bus to connect them to memory
• Computations much faster than memory transfers • Or too many data transfers
Thread 1
Thread 2
Thread 1 finishes -> idle !
Thread 2 finishes
Case study #1: subtree mining / Tatikonda et al.
• Paper: Shirish Tatikonda, Srinivasan Parthasarathy:
Mining Tree-Structured Data on Multicore Systems. PVLDB 2(1): 694-705 (2009)
• Excellent illustration of impact of bandwith pressure on pattern mining
Subtree mining 101
• Input: Tree database where transaction = tree
• Output: frequent subtree patterns
• Relies on subtree inclusion (costly) :
Induced subtree• Preserves parent-child relationships
A
B C
D B
A
B C
A
B C
D B
A
D B
Embedded subtree• Preserves ancestor-descendant relationships
• Every induced subtree is an embedded subtree
Algorithm overview• Two primary steps
• Candidate subtree generation• Generate all possible candidate subtrees
• Challenge: search space traversal
• Support counting• Evaluate each candidate for their frequency
• Challenge: subtree isomorphism
• A recursive pattern-growth approach• Start with a seed pattern (a single node)
• Repeatedly grow the pattern by adding nodes (pattern extension)• This step corresponds to search space traversal
• Evaluate the frequency of the generated pattern
1. Pattern_mine(P)
2. loop
3. sup = find_frequency (P)4. if sup ≥ Ѳ
5. P’ = grow P with a new node6. Pattern_mine(P’)
A
A
A
A
AB
A
B
A
SeedPattern
A
B
How can we do it efficiently ?
Slide from Shirish Tatikonda
Usual approach
• Search space exploration: • Represent trees with sequences (Asai et al., Zaki)
• Explore the search space of sequences
• Subtree inclusion test• Costly: store found embeddings -> embedding lists
• Pre-2005 tradeoff: more memory than computing power
Bandwith usage of TreeMiner (Zaki et al. 2005)
• Need lots transfers of embeddinglists…
• …that have poor cache locality
• Result:• 1.2 GB/s usage per core !
• Speedup : ~2 on 8 processors…
Reducing bandwidth usage
• Store embedding lists -> Recompute embedding lists on the fly• Post 2005, CPU is cheap, memory is expensive !
• Only process fixed number of embeddings at a time
TreeMiner: 1.2 GB/s TRIPS: 200 MB/s
Other challenge: task partitioning
B CA
Search Space Search space is partitioned into
equivalence classes ( )
Each equivalence class contains many patterns ( )
Processing each pattern involves many trees ( )
Workload skew is present at every level― One equivalence class may contain more patterns than the other
― Processing one pattern may be more expensive than the other
― One tree may be bigger than the other
Challenge: load balancing in the presence of skewSlide from Shirish Tatikonda
Key Idea:Adaptively and automatically
adjust the type and granularityof work that is shared
among cores
Adaptive Design
Thread pool
Task pool Tree pool Column pool
Context switch
Process multiple patterns in parallel
Process a single pattern i.e., multiple trees
in parallel
Process a single pattern w.r.t. a single tree in parallel
(i.e., dynamic programming matrix is processed in parallel)
Pools are empty
Job pools
Work is ready
Fine-grain Parallelism
Coarse-grainParallelism
Slide from Shirish Tatikonda
Implementation of Parallel AlgorithmminingMethod ( . . . )
miningMethod ( . . . )
Job-spawning condition
Tree pool
Process_the_job
Chunk pool
Light-weight context switching
General-purpose scheduling service
Thread pool
Check the complexity Of current work
task-parallel
data-parallel
chunk-parallel
Slide from Shirish Tatikonda
31
0
2
4
6
8
1 2 3 4 5 6 7 8
Cslogs
Treebank
Performance – Parallel Efficiency
0
4
8
12
16
1 2 4 6 8 10 12 14 16
Cslogs
Treebank
Cslogs w/o fine-grained
Treebank w/o fine-grained
Number of cores
Spee
du
p
Spee
du
p
Number of processors
On a dual quad core system On a SGI Altix - SMP system
1) Near-linear speedups
2) Need for fine-grain parallelism― Without which the speedups saturate
3) Memory optimizations are critical― Without them, the algorithms are not scalable (speedup of 1.7 on 8 processors)
On two data sets: Cslogs (web analytics) a nd treebank (computational linguistics)
Slide from Shirish Tatikonda
Constraint pattern mining
Generic pattern miningBoley et al, 07-10Arimura & Uno, 09
32
ParaMiner[DMKD, 14]
FISApriori, 93FPGrowth, 00LCM, 04
Specific approaches
strong accessibility
strong accessibility+
decomposability
Map of pattern mining families
ParaMiner
2014
ParaMiner: algorithm
ParaMiner’s initial scalability
36
0
2
4
6
8
10
12
14
32168421
AV
ERA
GE
LATE
NC
Y(C
YCLE
S)
#CORES
AVERAGE LATENCY EVOLUTION IN PARAMINER
GRI FIS
37
A B C
AB AC BC BD CD CE
Select Select Select
ABC ABD ACE BCD CDE CDF CEF
Select Select Select Select Select Select
Select Select Select Select Select Select Select
dataset dataset dataset
dataset dataset dataset dataset dataset dataset
dataset dataset dataset dataset dataset dataset dataset
38
A B C
AB AC BC BD CD CE
Select Select Select
ABC ABD ACE BCD CDE CDF CEF
Select Select Select Select Select Select
Select Select Select Select Select Select Select
dataset dataset dataset
DA DA DB DB DC DC
DAB DAB DAC DBC DCD DCD DCE
dataset reduction
39
A B C
AB AC BC BD CD CE
Select Select Select
ABC ABD ACE BCD CDE CDF CEF
Select Select Select Select Select Select
Select Select Select Select Select Select Select
dataset dataset dataset
DA DA
DAB DAB
dataset reduction 2.0
dataset dataset
datasetDA
DC DC
DC DC
DCE
40
ProblemSpeedup (before)
max: 32Speedup (after)
max: 32
Frequent Itemset Mining(dense data)
3 21
Frequent Itemset Mining(sparse data)
11 25
Gradual Pattern Mining 27.5 28.5
Closed Relational Graph Mining
3 5
Conclusion for multicores
• Ubiquitous parallel environment• Getting easier to program
• C++11, future/promises, async…• Java 8 Streaming• Scala actors…
• Main problem: cores contend for bus bandwidth
• Requires to design algorithms with a small working set
• Use the right profiling tools !• Java -> YourKit, C++/other -> Vtune, hardware counters library of Linux
41
Vtune (Intel)
Pattern mining on clusters
Clusters
• Homogenous/heterogeneous network of (multicore) machines
• Network of clusters: grid -> backbone of cloud
• Cheapest way to get tremendous• Computing power• RAM• Storage space
• Introduces new problems• Slow communications between nodes
• Data locality
• Fault tolerance
Cluster computing main environments
• MPI
• MapReduce
• Spark
MPI
• Message-passing paradigm
• Programmer have total control over communications
• One abstraction level above socket programming
• Communication primitives• One-to-one• One-to-many / one-to-all• Many-to-many / all-to-all
• Messages are byte arrays
• No fault tolerance
=> Powerful but hard to use correctly
MapReduce• Based on functional paradigm• Two types of operation
• Map• Reduce
• Hadoop also offers:• Distributed file system (HDFS)• Fault tolerance
• For files• For Map/Reduce jobs
• Everything commited to disk• super slow
Spark
• Designed for iterative computations (including data mining)• Data stored in memory (and/or disk)
• One/two orders of magnitude faster than Map/Reduce
• Based on RDD: Resilient Distributed Dataset• Distributed collection paradigm• RDD divided in block -> each block fit into node’s RAM• RDD transformation operations
• Fault tolerance via reconstruction• Keep RDD lineage as metadata• Recompute lost RDD / RDD blocks
• Computing power is cheap !
Mining top-k-per-item over clusters
• Web data: long tail
• Standard Frequent Itemset Mining + long tail:
Slid
e p
arts
Mar
tin
Kir
chge
ssn
er/
Vin
cen
t Le
roy
Top-k-per-item frequent itemsets
Slid
e M
arti
n K
irch
gess
ne
r/V
ince
nt
Lero
y
Mining top-k-per-item itemsets
• TopPI• Martin Kirchgessner, Vincent Leroy et al., Grenoble-Alpes University• In submission• Computes top-k closed frequent itemsets per item• Based on heavily modified LCM• Multicore and MapReduce versions
• PFP: Parallel FP-Growth• Li et al., RecSys 2008• Based on FPGrowth• The frequent itemset miner of Mahout (MapReduce)• Compute at most k itemset per item
• Sloppy output definition
Example (TopPI)
Reminder: closure extension (Uno et al., 03)
Each branch generates different itemsets, no need for synchronization
Overview of general TopPI algorithm
TopPI over MapReduce
• Distribute branches over nodes• Branch = starting item
• Each node receives a set of starting items G to process
• Distribute the top-k collector• Collector: for each item, heap of size k storing current top-k for this item
• In a distributed setting, a node can only fill collectors for items of G• May not be actual top-k of these items
• Second phase: worker get complement top-k (items not in G)
• Merge both
PFP
• Partition the database over the workers (shards)
• Count frequency of all individual items & organize them in groups• Same as in TopPI
• Make group-dependant transactions (conditional datasets), mine each with FP-Growth
• Aggregate discovered frequent itemsets
Results – time comparison
LastFM: 1.2M lines x 1.2M columns (277 MB)Supermarket: 55M lines x 400k columns (2.8 GB)
51 x [ 2 Xeon E5520 4 core, 24 GB RAM]4 task / node
Results – TopPI speedup
Results – ouput comparison
Arabesque
• Arabesque: A System for Distributed Graph Mining
Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, Ashraf Aboulnaga. SOSP 2015
• Problem:• There are frameworks for analyzing very large graphs
• Pregel, Giraph
• But they are ill-adapted for frequent subgraph mining• Their base element is the vertex
• Arabesque http://arabesque.io• Embedding as base element
• Numerous problems fit in
Slide Mohammed J. Zaki
Slide Mohammed J. Zaki
Slide Mohammed J. Zaki
Slide Mohammed J. Zaki
Slide Mohammed J. Zaki
Slide Mohammed J. Zaki
Conclusion for clusters
• Choose the right environment, stay alert for new approaches• currently Spark is the way to go for most people
• Make your computations as independent as possible• -> tree shaped search space exploration (but load unbalance risk)• -> limit number of “barriers”
• Control your data partitioning
• Use the right profiling tools !
Spark UI
Pattern mining on GPUs
GPUs
• Different paradigm• CPU: low latency low throughput
• GPU: high latency high throughput
• Many simple cores• Simpler control logic
• Fewer cache
• Data parallelism• SIMD
GPUs for pattern mining
• Pattern mining usual approaches: task parallelism• Ill-adapted for GPU
• Slow data transfers between host and GPU
• => few GPU pattern mining approaches• Apriori based on bitsets + vertical format
• FPGrowth based on array representation of FP-tree
Apriori vertical + bitsets on GPU
Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He, Qiong Luo:
Frequent itemset mining on graphics processors. DaMoN 2009: 34-42
Comparison Apriori GPU / FPGrowth CPU
Conclusion on GPUs
• Pattern mining on GPUs ?• Risky business
• Most researchers who published on it changed topic…
• Novel manycore processors may be better adapted to the task
Some hints on Manycores
Manycores processors
• Middle ground between multicores and GPUs• ~100s of cores
• Cores • Simpler than those of multicores
• More complex than GPU cores
• Cores have• Full-fledged control logic
• Some cache
Current manycores
• Intel Xeon Phi• 61 cores• 512k L2 cache per core• Extension board (same as GPU)
• Kalray MPPA• 256 cores• 16 clusters of 16 cores, 2MB per cluster
• Tilera GX• 72 cores• 256k L2 cache per core
• All: cache coherency optional
Manycores for pattern mining ?
• Better control logic than GPU…• …however:
• Small caches / onboard memory (working set must be kept small)• Slow data transfers with host (as of now)
• Manycores designed for complex streaming applications:• May be adapted for online pattern mining
• Also designed for: performance per watt• See: Emilio Francesquini, Márcio Bastos Castro, Pedro H. Penna, Fabrice Dupros,
Henrique C. Freitas, Philippe Olivier Alexandre Navaux, Jean-François Méhaut:On the energy efficiency and performance of irregular application executions on multicore, NUMA and manycore platforms. J. Parallel Distrib. Comput. 76: 32-48 (2015)
Conclusion
• Parallelism is getting easier and easier to get into
• -> provide necessary gain in performance
• Performance should be used to:• Show that I am faster than colleagues and get papers so 2005 !
• Extract more significant patterns
• Allow better interactivity with analysts• See KDD IDEA workshop 2013-2015
• See our CIKM 2015 paper (Omidvar Tehrani et al.)
Backup slides
82
83
84
85