scalable clustering december 16 algorithm, big data ...€¦ · framework online-offline of...

37
Scalable Clustering Algorithm, M. Ghesmoune, T. Sarazin, M. Lebbah, H. Azzag 1 BIG DATA, MACHINE LEARNING AND SOCIAL NETWORK ANALYSIS, DECEMBER 16

Upload: others

Post on 13-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Scalable Clustering Algorithm,

M. Ghesmoune, T. Sarazin, M. Lebbah, H. Azzag

1

BIG DATA, MACHINE LEARNING AND SOCIAL NETWORK ANALYSIS, DECEMBER 16

Page 2: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Outline

2

● Context

● Clustering using MapReduce

● Deal with large data sets such as streams

● Conclusion & Perspectives

Page 3: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

3

Context

Visualization

Clustering

ExplorationVisualization

Clustering

Tutorial, Ieee bigdata 2014

Difficulties :

- Structure

- similarity measure ?

- Number of clusters ?

(Combinatory)

- Validation (unlabeled data)

- data types : categorical, mixed ...

Page 4: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Two alternatives

Data

Algo. LEArning

MapReduce / Spark

4

Algo. LEA

LEA

LEA

LEA

Massive data mining as stream mining

Page 5: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Spark as an alternative

[Sparks et al ICDM 2013]

5

Logistic regression

Page 6: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Clustering

6

Page 7: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Implementation : K-Means

7

data = spark.textFile("hdfs://...") .map(parsePoint)centroids = Array( Point(randX(), randY()), Point(randX(), randY()))

x

y

Page 8: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Compute distance with prototypes

8

x

y closestCentroid(p, centroids)

Page 9: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Assignment

9

x

y closestCentroid(p, centroids)

Page 10: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Assignment

10

x

y closestCentroid(p, centroids)

Page 11: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Map - Assignments

11

x

y val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) )

Page 12: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Reduce - update of prototypes

12

x

y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) }

Page 13: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Iteration 1

13

x

y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }

Page 14: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Iteration 2

14

x

y for (i <- 1 until 10) { val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) ) val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }}

Page 15: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

15

Topological Map

Prototypewx

Why ?➢ Topological organization➢ Generalization of K-means ➢ Adapted to MapReduce (batch version )➢ Used for visualisation ➢ Used for exploration phase

Page 16: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

MapReduce / Spark

Assignment

Quantization

map1

Reduce

map1

Row assignments Column assignment

map2

Reduce

Quantization

Reduce

SOM BiTM

https://github.com/TugdualSarazin/spark-clustering 16

Quantization

Page 17: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

BiTM-MapReduce-SPARK

2 millions, 20 variables

2 millions, 40 variables

Page 18: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Two alternatives

Data

Algo. LEArning

MapReduce / Spark

18

Algo. LEA

LEA

LEA

LEA

Massive data mining as stream mining

Page 19: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Big Data as data stream

19

Page 20: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

20

Big Data as data stream

Framework online-offline of clustering data streams

Page 21: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

21

Big Data as data stream

Framework online-offline of clustering data streams

Page 22: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-STREAM : GNG + Data Stream

GNG : [Fritzke 95]

― Evolutive topology― Number of cells is not fixed

22

Page 23: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-STREAM and others

Page 24: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-Stream: characteristics

• No initialization phase of the model,

• A graph representing the topological structure,

• Creating multiple nodes at the same time,

• One single stage (online), (no offline stage)

• Use of a reservoir.

… G-Stream

24

Page 25: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Data sets

25

Page 26: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-Stream: Example

26

… G-Stream

Page 27: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-Stream on letter4

Page 28: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-Stream on DS1

Page 29: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-Stream on DS2

Page 30: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-Stream vs GNG online: accuracy

30

● Accuracy

Page 31: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-Stream vs GNG online: RMS Error● RMS Error

31

Page 32: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

G-Stream vs GNG online: #Nodes● Nombre de noeuds

32

Page 33: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Accuracy

33

Page 34: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

NMI

34

Page 35: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

Conclusion & perspectives● Data set size has increased significantly

○ MapReduce is crucial for some algorithms ○ Deal with large data sets such as streams

● New approach ○ Resampling & Sketching ○ Boosting & bagging [Kleiner et al ICML 2012]

35

Page 36: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

ReferencesAriel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Michael I. Jordan. The Big Data Bootstrap. Proceedings of the 29th International Conference on Machine Learning (ICML-12). Pages: 1759--1766. 2012

Ghesmoune M, Azzag H, Lebbah M. (2014), «G-Stream: Growing Neural Gas over Data Stream», Neural Information Processing. Lecture Notes in Computer Science Volume 8834, 2014, pp 207-214. Kuching, Sarawak, Malaysia, 03-06 November 2014

Sarazin T, Lebbah M, Azzag H. (2014), "Biclustering using Spark-MapReduce". IEEE International Conference on Big Data. October 27-30, 2014, pp. 58-60. Washington DC, USA . (Poster).

E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan, J. Gonzalez, M. Franklin, M. I. Jordan, T. Kraska. MLI: An API for Distributed Machine Learning. International Conference on Data Mining (ICDM), 2013

36

Page 37: Scalable Clustering DECEMBER 16 Algorithm, BIG DATA ...€¦ · Framework online-offline of clustering data streams. 21 Big Data as data stream Framework online-offline of clustering

37