data streaming for autonomic computing in the egee frameworksebag/slides/seminar_xz_june.pdf ·...

66
Motivation Hierarchical AP (Hi-AP): Clustering Large-scale Data StrAP : Clustering Streaming Data Conclusion and Perspectives Data Streaming for Autonomic Computing in the EGEE framework Xiangliang Zhang, Cyril Furtlehner, Mich` ele Sebag TAO INRIA CNRS Universit´ e de Paris-Sud, F-91405 Orsay Cedex, France Xiangliang Zhang, Cyril Furtlehner, Mich` ele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Upload: others

Post on 14-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Data Streaming for Autonomic Computing in the

EGEE framework

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag

TAO − INRIA CNRSUniversite de Paris-Sud, F-91405 Orsay Cedex, France

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 2: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Contents

1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs

3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)

A StrAP-based Real-time Online Grid Monitoring System

4 Conclusion and Perspectives

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 3: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Contents

1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs

3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)

A StrAP-based Real-time Online Grid Monitoring System

4 Conclusion and Perspectives

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 4: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

Motivations of Autonomic Computing

Page 5: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

Goals of Autonomic Computing

AUTONOMIC VISION & MANIFESTOhttp://www.research.ibm.com/autonomic/manifesto/

Self-managing system with the ability of

Self-healing: detect, diagnose and repair problems

Self-configuring: automatically incorporate and configurecomponents

Self-optimizing: ensure the optimal functioning wrt definedrequirements

Self-protecting: anticipate and defend against securitybreaches

Data Mining for Autonomic Computing

Page 6: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

Autonomic Grid Computing System

EGEE: Enabling Grids for E-sciencE, http://www.eu-egee.orgEGEE User Forum: annual event since 2007

Page 7: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Job stream monitoring by clustering

Goal: summarizing the large scale and fast arriving data.

provide compact description

help to find out interesting patterns

classify the incoming data

Challenges:

Large sizesave all the data and process them as a whole ?require huge disk, CPU, and memory (impossible for data insize of GB, TB, even PB, ..)process the data part by part ?how to guarantee the global optimization.

Changing distribution:for the time-ordered data, how to make the clusters keep tracking

the evolving data?

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 8: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

What is Clustering ?

unsupervised learning method

group similar points together in the same group (cluster)

widely used on various problems:Interesting groups discovery, Data structure presentation, Data

classification, Data compression, Dimensionality reduction or feature

selection

many clustering methods are available, e.g., Hierarchical

clustering methods, Density-based methods(Dbscan), Partitioning

methods(k-means)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 9: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Our requirements of clustering method

No need to set the number K of clusters double-edged sword

global optimization of clustering result:not locally optimized by greedy approach

stable clustering result:not affected by the initialization

real data points as representative exemplars (cluster center):suit the application field when averaged centers are meaningless,

e.g. molecule, jobs described by categorical attributes

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 10: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Our requirements of clustering method

No need to set the number K of clusters double-edged sword

global optimization of clustering result:not locally optimized by greedy approach

stable clustering result:not affected by the initialization

real data points as representative exemplars (cluster center):suit the application field when averaged centers are meaningless,

e.g. molecule, jobs described by categorical attributes

Affinity Propagation (AP) (Frey & Dueck, Science2007)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 11: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Iterations of Message passing in AP

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 12: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Iterations of Message passing in AP

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 13: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Iterations of Message passing in AP

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 14: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Iterations of Message passing in AP

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 15: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Iterations of Message passing in AP

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 16: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Iterations of Message passing in AP

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 17: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Iterations of Message passing in AP

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 18: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Iterations of Message passing in AP

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 19: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Introduction of AP

input:

Data: x1, x2, ..., xN Distance: d(xi , xj )

find:

σ: xi → σ(xi ), exemplar representing xi , such that

max∑N

i=1 S(xi , σ(xi ))

where,S(xi , xj) = −d2(xi , xj ) if i 6= j

S(xi , xi ) = −s∗ s∗: user-defined parameter (penalty)

s∗ = ∞, only one an exemplar ( one cluster)

s∗ = 0, every point is an exemplar (N clusters)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 20: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

AP: a message passing algorithm

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 21: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

Message passed

r(i , k) = S(xi , xk) − maxk′,k′ 6=k{a(i , k′) + S(xi , x

′k)}

r(k, k) = S(xk , xk) − maxk′,k′ 6=k{S(xk , x ′k)}

a(i , k) = min {0, r(k, k) +∑

i ′,i ′ 6=i ,k max{0, r(i ′, k)}}

a(k, k) =∑

i ′,i ′ 6=k max{0, r(i ′, k)}

The index of exemplar σ(xi ) associated to xi is finally defined as:

σ(xi ) = argmax {r(i , k) + a(i , k), k = 1 . . . N}

Page 22: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

Summary of AP

Affinity Propagation (AP)

A clustering method

Converge by Iterations of Message passing

No need of K (the number of clusters)

Real point as exemplar

an application of belief propagation (simplified graph +message passing)

cons

Computational complexity problems

Similarity computation: O(N2)

Message passing: O(N2 log N)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 23: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Hi-AP AlgorithmHi-AP Application on EGEE Grid logs

Contents

1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs

3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)

A StrAP-based Real-time Online Grid Monitoring System

4 Conclusion and Perspectives

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 24: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Hi-AP AlgorithmHi-AP Application on EGEE Grid logs

Hierarchical AP

Divide-and-conquer (inspired by Guha et al, TKDE2003)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 25: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Hi-AP AlgorithmHi-AP Application on EGEE Grid logs

Hierarchical AP

Divide-and-conquer (inspired by Guha et al, TKDE2003)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 26: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Hi-AP AlgorithmHi-AP Application on EGEE Grid logs

Weighted AP

AP WAP

xi xi , ni

S(xi , xj) −→ ni × S(xi , xj )

price for xi to select xj as an exemplar

S(xi , xi ) −→ S(xi , xi ) + (ni − 1) × ǫ

price to select xi as exemplar ǫ is variance of ni points

Proposition

WAP ≡ AP with duplications (aggregations)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 27: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

Hierarchical AP

Complexity of Hi-AP is O(N3/2)(X. Zhang et al, ECML/PKDD 2008)

NB: can be iteratively reduced to O(N1+γ)(X. Zhang et al, SIGKDD 2009)

Page 28: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Hi-AP AlgorithmHi-AP Application on EGEE Grid logs

Validation of Hi-AP on EGEE jobs

EGEE(Enabling Grids forE-sciencE)

Grid Observatoryhttp://www.grid-

observatory.org/

description of jobs (237,087)

4 numeric features: duration of execution

1 symbolic feature: name of queue

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 29: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Hi-AP AlgorithmHi-AP Application on EGEE Grid logs

Validation of Hi-AP on EGEE jobs

Evaluation: Distortion

D([σ]) =∑N

i=1 d2(xi , σ(xi ))

50 100 150 200 250 3000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2x 10

5

Dis

tort

ion

N. of clusters: K

Distortion of hierarchical K−centersDistortion of HI−AP simpleDistortion of HI−AP 237,087 jobs

10 minson Intel2.66GHzDual-Core PCwith 2 GBmemory

Hi-AP has the lowest distortion compared to baseline method

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 30: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Contents

1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs

3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)

A StrAP-based Real-time Online Grid Monitoring System

4 Conclusion and Perspectives

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 31: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Challenges of Stream Clustering

Data stream:

a real-time, continuous, ordered sequence of items arriving at avery high speed (Golab & Ozsu,SigMod2003)

e.g., network traffic data, sensor network monitoring data

Data streams clustering

Provide compact description of data flow

Incremental model updating

No specified number of clusters

Process in real-time

Available results at any time

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 32: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Related works

Divide-and-conquer strategy (Guha et al, TKDE 2003)fixed segmentation window —— > not feasible to handle the

changing distribution

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 33: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Related works

A two-level scheme (Aggarwal et al, VLDB 2003)

online level to summarize the evolving data streamoffline level to generate the clusters using the summary.clustering method is used to get initial micro-clusters and finalclusters. e.g., Density-based clustering methods DBSCAN (Cao etal, SDM 2006)

Problem: the online clustering models is not provided or onlyavailable when it is required by users.

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 34: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Contents

1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs

3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)

A StrAP-based Real-time Online Grid Monitoring System

4 Conclusion and Perspectives

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 35: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Stream clustering

e e e i i e i i e e i i

Model Reservoireeeeeeef jjjiiiij

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 36: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Stream clustering

e e e i i e i i e e i i e

Model Reservoireeeeeeefeeeeeeef jjjiiiij

Does xt fit the current model ??

if yes, update the model

otherwise, go to reservoir

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 37: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Stream clustering

e e e i i e i i e e i i e i

Model Reservoireeeeeeef jjjiiiijjjjiiiij

Does xt fit the current model ??

if yes, update the model

otherwise, go to reservoir

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 38: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Stream clustering

e e e i i e i i e e i i e i�@

Model Reservoireeeeeeef jjjiiiij �@

Does xt fit the current model ??

if yes, update the model

otherwise, go to reservoir

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 39: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Stream clustering

e e e i i e i i e e i i e i i e�@ i e� �@ @ �@

Model Reservoireeeeeeef jjjiiiij � � �@ @ @

Has the distribution changed ??

CHANGE TEST

if yes, rebuild the model

otherwise, continue

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 40: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Stream clustering

e e e i i e i i e e i i e i�@ i e� �@ @ �@

Model Reservoireeeeeeef jjjiiiij�@

Has the distribution changed ??

CHANGE TEST

if yes, rebuild the model

otherwise, continue

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 41: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

StrAP Method

data - -data streamingprocess system models { ei , ni ,Σi , ti }

Does xt fit the current model ??

if yes, update the model update the weight with time decay(decay window ∆)

otherwise, go to reservoir

Has the distribution changed ??

if yes, rebuilt the model based on current model andreservoir by WAP

otherwise, continue

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 42: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

Rebuild the model??

when reservoir is full

when changes are detected: Page-Hinkley statistic(Cumulative-Sum-like test)

(Page, Biometrika1954; Hinkley, Biometrika1971)

0 100 200 300 400 500 600 700 800 900 1000−5

0

5

10

15

20

25

30

35

40

time t

pt

pt

mt

Mt

pt changing distribution

pt = 1t

Ptℓ=1 pℓ

mt =Pt

ℓ=1 (pℓ − pℓ + δ)

Mt = max{mℓ}

PHt = Mt − mt

if PHt > λ, changed detected

How to set λ ???

Page 43: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

Setting of λ

fixed empirical value (X. Zhang et al, ECML/PKDD 2008)

self-adaptive change detection test (X. Zhang et al, SIGKDD 2009)

Self-adapt λ ≡ An optimization problem

BIC: Fλ = 1|C |

∑|C |i=1

(

1ni

ej∈Cid(ej , e

∗i )

)

+ ϕρ2 log N + ηOt

∝ loss + size of model + percentage of outlier

OPTIMIZATION:

ǫ-greedy search from a finite set of λ values

λ = argmin{E(Fλ}),

λ1 λ2 λ3 λ4 ...

E(Fλ1) E(Fλ2

) E(Fλ3) E(Fλ4

) ...

Gaussian Process Regression based on {λi ,Fλi}

continuous value of λ is generated

Page 44: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Contents

1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs

3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)

A StrAP-based Real-time Online Grid Monitoring System

4 Conclusion and Perspectives

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 45: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Validation of StrAP on KDD99 data

Data used

Real world data: KDD99 data

intrusion detection benchmark494,021 network connection records in IR

34

23 classes: 1 normal + 22 attacks

Baseline: DenStream (Cao et al, SDM2006)

Performance indicator (supervised setting)

Clustering accuracy

Clustering purity

KDD Cup 1999 data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 46: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Accuracy and Purity along time

Error Rate along time < 2%

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 105

0

0.5

1

1.5

2

time steps

Err

or R

ate

(%)

Error rateRestart point

Higher clustering purity than DenStream

1 2 3 480

85

90

95

100

time windows

Clu

ster

Pur

ity (

%)

STRAP ∆=15000 STRAP ∆=5000 DenStream

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 47: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Discussion

StrAP vs DenStream

Pros

better accuracyTruth Detection rate: 99.18%False Alarm rate: 1.39%Online Error rate < 2%model available at any time

Cons

DenStream: 7 secondsStrAP : 7 mins

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 48: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Contents

1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs

3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)

A StrAP-based Real-time Online Grid Monitoring System

4 Conclusion and Perspectives

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 49: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Multi-scale Realtime Grid Monitoring System

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 50: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Multi-scale Realtime Grid Monitoring System

1 2 3 4 50

20

40

60

80

100

700000

10 47 54129 0 0

8 18 24 30595139

7 13 14 24 972819190

Per

cent

age

of jo

bs a

ssig

ned

(%)

Outliers

Clusters

exemplar shown as a job vector

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 51: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Multi-scale Realtime Grid Monitoring System

0 20 40 60 80 100 120 140 1600

5

10

15

20

25

30

days

perc

enta

ge o

f job

s (%

)

distribution of jobs in cluster [7 0 0 0 0 0]

0 20 40 60 80 100 120 140 1600

10

20

30

40

50

60

70

80

90

days

perc

enta

ge o

f job

s (%

)

distribution of jobs in cluster [0 0 0 0 0 0]

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 52: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Experimental Data

EGEE logs of 39 RBs during 5 months (2006-01-01 ∼2006-05-31)

5,268,564 jobs

for each job, its

final status (good or type of errors)6 features describing the time-cost of services in a job lifecycle

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 53: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Experimental Results: Online Monitoringoutputs

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 54: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Real-time Monitoring: when change detected

Online summarizing the streaming jobs into clusters:

1 2 3 4 50

20

40

60

80

100

Reservoir

700000

10 47 54129 0 0

8 18 24 30595139

7 13 14 24 972819190

Clusters

Per

cent

age

of jo

bs a

ssig

ned

(%)

exemplar shown as a job vector

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 55: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Real-time Monitoring: when change detected

Online summarizing the streaming jobs into clusters:

1 2 3 4 5 6 7 80

20

40

60

80

100

Reservoir

000000

700000

10 47 54129 0 0

9 18 2520110 0 0

8 18 24 30595139

6 5 10 14 12710854

10 18 2920091 395 276

LogMonitor isgetting clogged

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 56: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Clustering Accuracy

0 1 2 3 4 5

x 106

80

85

90

95

100

time step

Acc

urac

y (%

)

StrAP with PH λ

t

streaming k−centers

10% higher than baseline method(Streaming k-centers)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 57: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Discussion

Real-time quality (330K jobs/day):

tested on Intel 2.66GHz Dual-Core PC with 2 GB memory10k jobs per minute coding in Matlab60k jobs per minute coding in C/C++

concise online summary of the streaming jobs, with

proportion of defectsperformance of the grid services

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 58: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Experimental Results: Offline Analysis

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 59: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Large-time scale Monitoring: Global view

the history behavior of interesting exemplars

without prior knowledge about failure patterns

summarizing Gbyte data

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 60: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Bad Super Exemplars: day view

Days

Super Clusters

20 40 60 80 100 120 140

2

4

6

8

10

12

14

16

18

20 0

10%

20%

30%

40%

50%

60%

70%

80%

90%

“early stopped error”, Who and When ?Date Jan 7∼13 Jan 30 ∼ Feb 3 Mar 16∼21 May 17∼19

UserID A1 A1 B1 D1 and A1

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 61: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System

Discussion and Conclusion

real-time monitoring Grid job streams

providing multi-scale models to describing the status of Grid

proportion of different type of job patterns (realtime-view,day-view, week-view ....)rupture stepsoffline globally analysis

good quality clustering is guaranteed

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 62: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Contents

1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)

2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs

3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)

A StrAP-based Real-time Online Grid Monitoring System

4 Conclusion and Perspectives

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 63: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Conclusion, Algorithm

Scalability: Hi-AP

Reduce complexity from O(N2) to O(N3/2)

Iteratively reduce toward O(N (1+γ))

Stream clustering: StrAP

Framework of processing the streaming data

Hybridized with an efficient change detection method, Page-Hinkley

Model available at any time

BUT: slower than DenStream

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 64: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Conclusion, Application

Network Intrusion Detection (KDD99 data)

clustering by one-scan of the data

using only < 1% data for building model Active Learning

high clustering and classification accuracy

Autonomic Grid Computing

real-time grid monitoring system

visualized online output describing grid running status

offline output for historical performance analysis

multi-scale analysis of system behaviors

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 65: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data

StrAP : Clustering Streaming DataConclusion and Perspectives

Ongoing work

Flexible Clustering Methods

Fixed number clusters by messaging passing

Arbitrary shape clusters by messaging passing

Comprehensive model of streaming datausing several representative exemplars covering the cluster, instead

of one center point

Online Learning

Assess the alarm level attached to a given modelcriticality of the clusters based on its frequency along time

User profilingthe clusters —> new features —> describe the users (viewing a

user as a set of clusters)

Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew

Page 66: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit

Thank you for your attention.

Xiangliang ZHANG

[email protected]

http://www.lri.fr/∼xlzhang