scalable clustering december 16 algorithm, big data ...€¦ · framework online-offline of...
TRANSCRIPT
Scalable Clustering Algorithm,
M. Ghesmoune, T. Sarazin, M. Lebbah, H. Azzag
1
BIG DATA, MACHINE LEARNING AND SOCIAL NETWORK ANALYSIS, DECEMBER 16
Outline
2
● Context
● Clustering using MapReduce
● Deal with large data sets such as streams
● Conclusion & Perspectives
3
Context
Visualization
Clustering
ExplorationVisualization
Clustering
Tutorial, Ieee bigdata 2014
Difficulties :
- Structure
- similarity measure ?
- Number of clusters ?
(Combinatory)
- Validation (unlabeled data)
- data types : categorical, mixed ...
Two alternatives
Data
Algo. LEArning
…
MapReduce / Spark
4
Algo. LEA
LEA
LEA
LEA
Massive data mining as stream mining
Spark as an alternative
[Sparks et al ICDM 2013]
5
Logistic regression
Clustering
6
Implementation : K-Means
7
data = spark.textFile("hdfs://...") .map(parsePoint)centroids = Array( Point(randX(), randY()), Point(randX(), randY()))
x
y
Compute distance with prototypes
8
x
y closestCentroid(p, centroids)
Assignment
9
x
y closestCentroid(p, centroids)
Assignment
10
x
y closestCentroid(p, centroids)
Map - Assignments
11
x
y val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) )
Reduce - update of prototypes
12
x
y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) }
Iteration 1
13
x
y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }
Iteration 2
14
x
y for (i <- 1 until 10) { val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) ) val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }}
15
Topological Map
Prototypewx
Why ?➢ Topological organization➢ Generalization of K-means ➢ Adapted to MapReduce (batch version )➢ Used for visualisation ➢ Used for exploration phase
MapReduce / Spark
Assignment
Quantization
map1
Reduce
map1
Row assignments Column assignment
map2
Reduce
Quantization
Reduce
SOM BiTM
https://github.com/TugdualSarazin/spark-clustering 16
Quantization
BiTM-MapReduce-SPARK
2 millions, 20 variables
2 millions, 40 variables
Two alternatives
Data
Algo. LEArning
…
MapReduce / Spark
18
Algo. LEA
LEA
LEA
LEA
Massive data mining as stream mining
Big Data as data stream
19
20
Big Data as data stream
Framework online-offline of clustering data streams
21
Big Data as data stream
Framework online-offline of clustering data streams
G-STREAM : GNG + Data Stream
GNG : [Fritzke 95]
― Evolutive topology― Number of cells is not fixed
22
G-STREAM and others
G-Stream: characteristics
• No initialization phase of the model,
• A graph representing the topological structure,
• Creating multiple nodes at the same time,
• One single stage (online), (no offline stage)
• Use of a reservoir.
… G-Stream
24
Data sets
25
G-Stream: Example
26
… G-Stream
G-Stream on letter4
G-Stream on DS1
G-Stream on DS2
G-Stream vs GNG online: accuracy
30
● Accuracy
G-Stream vs GNG online: RMS Error● RMS Error
31
G-Stream vs GNG online: #Nodes● Nombre de noeuds
32
Accuracy
33
NMI
34
Conclusion & perspectives● Data set size has increased significantly
○ MapReduce is crucial for some algorithms ○ Deal with large data sets such as streams
● New approach ○ Resampling & Sketching ○ Boosting & bagging [Kleiner et al ICML 2012]
35
ReferencesAriel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael I. Jordan. The Big Data Bootstrap. Proceedings of the 29th International Conference on Machine Learning (ICML-12). Pages: 1759--1766. 2012
Ghesmoune M, Azzag H, Lebbah M. (2014), «G-Stream: Growing Neural Gas over Data Stream», Neural Information Processing. Lecture Notes in Computer Science Volume 8834, 2014, pp 207-214. Kuching, Sarawak, Malaysia, 03-06 November 2014
Sarazin T, Lebbah M, Azzag H. (2014), "Biclustering using Spark-MapReduce". IEEE International Conference on Big Data. October 27-30, 2014, pp. 58-60. Washington DC, USA . (Poster).
E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. Franklin, M. I. Jordan, T. Kraska. MLI: An API for Distributed Machine Learning. International Conference on Data Mining (ICDM), 2013
36
37