scalable clustering december 16 algorithm, big data ...€¦ · framework online-offline of...

Scalable Clustering Algorithm,

M. Ghesmoune, T. Sarazin, M. Lebbah, H. Azzag

1

BIG DATA, MACHINE LEARNING AND SOCIAL NETWORK ANALYSIS, DECEMBER 16

Outline

2

● Context

● Clustering using MapReduce

● Deal with large data sets such as streams

● Conclusion & Perspectives

3

Context

Visualization

Clustering

ExplorationVisualization

Clustering

Tutorial, Ieee bigdata 2014

Difficulties :

- Structure

- similarity measure ?

- Number of clusters ?

(Combinatory)

- Validation (unlabeled data)

- data types : categorical, mixed ...

Two alternatives

Data

Algo. LEArning

…

MapReduce / Spark

4

Algo. LEA

LEA

LEA

LEA

Massive data mining as stream mining

Spark as an alternative

[Sparks et al ICDM 2013]

5

Logistic regression

http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/spark.html

Clustering

6

Implementation : K-Means

7

data = spark.textFile("hdfs://...") .map(parsePoint)centroids = Array( Point(randX(), randY()), Point(randX(), randY()))

x

y

Compute distance with prototypes

8

x

y closestCentroid(p, centroids)

Assignment

9

x


Assignment

10

x


Map - Assignments

11

x

y val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) )

Reduce - update of prototypes

12

x

y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) }

Iteration 1

13

x

y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }

Iteration 2

14

x

y for (i <- 1 until 10) { val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) ) val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }}

15

Topological Map

Prototypewx

Why ?➢ Topological organization➢ Generalization of K-means ➢ Adapted to MapReduce (batch version )➢ Used for visualisation ➢ Used for exploration phase

MapReduce / Spark

Assignment

Quantization

map1

Reduce

map1

Row assignments Column assignment

map2

Reduce

Quantization

Reduce

SOM BiTM

https://github.com/TugdualSarazin/spark-clustering 16

Quantization

BiTM-MapReduce-SPARK

2 millions, 20 variables

2 millions, 40 variables

Two alternatives

Data

Algo. LEArning

…

MapReduce / Spark

18

Algo. LEA

LEA

LEA

LEA

Massive data mining as stream mining

Big Data as data stream

19

20


Framework online-offline of clustering data streams

21


Framework online-offline of clustering data streams

G-STREAM : GNG + Data Stream

GNG : [Fritzke 95]

― Evolutive topology― Number of cells is not fixed

22

G-STREAM and others

G-Stream: characteristics

• No initialization phase of the model,

• A graph representing the topological structure,

• Creating multiple nodes at the same time,

• One single stage (online), (no offline stage)

• Use of a reservoir.

… G-Stream

24

Data sets

25

G-Stream: Example

26

… G-Stream

G-Stream on letter4

G-Stream on DS1

G-Stream on DS2

G-Stream vs GNG online: accuracy

30

● Accuracy

G-Stream vs GNG online: RMS Error● RMS Error

31

G-Stream vs GNG online: #Nodes● Nombre de noeuds

32

Accuracy

33

NMI

34

Conclusion & perspectives● Data set size has increased significantly

○ MapReduce is crucial for some algorithms ○ Deal with large data sets such as streams

● New approach ○ Resampling & Sketching ○ Boosting & bagging [Kleiner et al ICML 2012]

35

ReferencesAriel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael I. Jordan. The Big Data Bootstrap. Proceedings of the 29th International Conference on Machine Learning (ICML-12). Pages: 1759--1766. 2012

Ghesmoune M, Azzag H, Lebbah M. (2014), «G-Stream: Growing Neural Gas over Data Stream», Neural Information Processing. Lecture Notes in Computer Science Volume 8834, 2014, pp 207-214. Kuching, Sarawak, Malaysia, 03-06 November 2014

Sarazin T, Lebbah M, Azzag H. (2014), "Biclustering using Spark-MapReduce". IEEE International Conference on Big Data. October 27-30, 2014, pp. 58-60. Washington DC, USA . (Poster).

E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. Franklin, M. I. Jordan, T. Kraska. MLI: An API for Distributed Machine Learning. International Conference on Data Mining (ICDM), 2013

36

scalable clustering december 16 algorithm, big data ...€¦ · framework online-offline of...

Documents