distributed clustering from data streams
Post on 14-Jun-2015
1.249 Views
Preview:
DESCRIPTION
TRANSCRIPT
Distributed Clustering for Smart GridsPedro Rodrigues, João Gama
University of Porto, Portugal
Project KDUS (PTDC/EIA-EIA/98355/2008)4 September 2011NGDM '11
NGDM '11
2
Smart GridsSmart Grids: monitoring information on the top of electrical grid
Internet-like communications layer
A shift in the way in which power grids are operatedIntelligent monitoring in real time
Interactive with consumers and markets
Optimized to make the best use of resources and equipment
Predictive rather than reactive
Distributed across geographical and organizational boundaries
NGDM '11
3
Smart Grids and Data Mining Smart grid forms a network (eventually decomposable) of distributed sources of high-speed data streams.
The dynamics of data are unknown:
the topology of network changes over time,
the number of meters tends to increase and
the context where the meter acts evolves over time.
Several data mining tasks are involved: prediction, cluster (profiling) analysis, event and anomaly detection, correlation analysis, etc.
All these characteristics constitute real challenges and opportunities for applied research in distributed data mining.
The requirements of near real-time analysis for multiple time horizons and multiple space aggregations make these analysis an even harder research challenge.
NGDM '11
4
Outline
Rationale
Clustering distributed data streams
Local-to-Global Clustering of data sources
NGDM '11
5
Sensors are usually small, low-cost devices capable of sensing some attribute and of communicating with other sensors.
Sensor networks can include thousands of sensors, each one being capable of measuring, analysing and transmitting a stream of data.
Resources are scarse, which reduce the possibilities for heavy computation,while operating under a limited bandwidth.
Rationale Sensor Networks
NGDM '11
6
Comprehension
Extract information about global interaction between sources by looking at the data they produce.
When no other information is available, usual knowledge discovery approaches are based on unsupervised techniques (e.g. clustering).
However, two different stream clustering problems exist:
clustering streaming data points (e.g. meter' readings)
clustering streaming data sources (e.g. meters)
Rationale Comprehension of Ubiquitous Data Streams
NGDM '11
7
Information about dense regions of the sensor data space.
Cluster A Cluster B Cluster C
Rationale Comprehension by Clustering Data Points
NGDM '11
8
Information about groups of sensors that behave similarly over time.
Possible scenario
Sensors collecting electricity demand data from different homes, exploring similar consumption patterns.
Cluster A Cluster B Cluster C
Rationale Comprehension by Clustering Data Sources
NGDM '11
9
Setting
Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data)
Objective
To keep a clustering of the observations that are created by aggregating each node's data as a feature in a centralized stream.
Cluster A Cluster B Cluster C
DGClust Setting and Objective
NGDM '11
10
Problems
high-speed data streams excessive storage and processing
widely spread network heavy communication
centralized clustering high dimensionality
dynamic data outdated models
Research Question
Does local discretization and representative clustering improve validity, communication and computation loads when applied to distributed sensor data streams?
DGClust Problems and Research Question
NGDM '11
11
DGClust – Distributed Grid Clustering (Local Step)
Each sensor keeps an online ordinal discretization of its data.
Partition Incremental Discretization
Current State
low
D
DGClust Methodology : Local Step
NGDM '11
12
DGClust – Distributed Grid Clustering (Aggregating Step)
The central server gathers the global state of the network.
Sensors whose state has not change since last communication, do not transmit to server.
lowlow
Dhigh
highA
BB
Bhigh
low
lowlowD
highhigh
ABBB
highlow
DGClust Methodology : Aggregating Step
NGDM '11
13
DGClust – Distributed Grid Clustering (Representative Step)
Server keeps a small list of the most frequent global states.
Space-Saving Frequent Items Monitoring
lowlowD
highhigh
ABBB
highlow
highlowDlowlowACCBhighhigh
lowlowDhighhigh
ABBBhighlow
lowhigh
DhighlowAABAlowlow
#
523
334
89
...
DGClust Methodology : Representative Step
NGDM '11
14
DGClust – Distributed Grid Clustering (Clustering Step)
Server applies partitional clustering to the most frequent states.
Furthest Point Clustering + Online Adaptive K-Means
DGClust Methodology : Clustering Step
NGDM '11
15
DGClust Example (k=5) Varying Resources
NGDM '11
16
Quality of results does not depend on the number of sensors.
Communication reduction is constant with any number of sensors (as long as direct link with server exists).
higher clustering quality
higher discretization granularity
lower communication reduction
higher number of sensors more clustering updates
DGClust Main Findings
NGDM '11
17
Setting
Sensors in a wide network produce streams of heterogeneously distributed data (each sensor produces a univariate stream of data)
Objective
To keep, at each node, a clustering of the entire network of sensors.
Cluster A Cluster B Cluster C
L2GClust Setting and Objective
NGDM '11
18
Each sensor keeps a sketch of its most recent data.
The common approach for focus on recent data are sliding windows1.
Even within the sliding window, the most recent data point is usually more important than the last one which is about to be discarded.
In ubiquitous streaming data sources, such as sensor networks, resources like memory and processing power are scarse.
Some times, there is not even enough memory to store all the data points inside the window.
Memoryless α-fading average
10.2
L2GClust Methodology : Local Sketch
NGDM '11
19
1
2
10
99
95
11
10
100
3
10
2
12
5
10
L2GClust Example : Local Clustering
NGDM '11
20
Centroids {6.9, 98.0}1
2
10
99
95
11
10
100
3
10
2
12
5
10
L2GClust Example : Local Clustering
NGDM '11
21
This estimate is computed by clustering the centroids of direct neighbors’ estimates of the global clustering.
Furthest Point Clustering
Basically, each node performs an ensemble of clusterings from its direct neighbors.
Instead of broadcasting the sketch of the its own data, each node broadcasts its estimate of the global clustering.
L2GClust Methodology : Local Clustering
NGDM '11
22
Centroids {6.9, 98.0}88.07
88.06
2.80
1.21
3.58
3.74
87.37
4.19
88.03
3.50
88.12
86.31
2.41
88.06
L2GClust Example : Local Clustering
{7.71, 97.1}
{10.59, 97.38}
{5.10, 95.00}
NGDM '11
23
Centroids {6.9, 98.0}88.07
88.06
2.80
1.21
3.58
3.74
87.37
4.19
88.03
3.50
88.12
86.31
2.41
88.06
L2GClust Example : Local Clustering
{7.71, 97.1}
{10.59, 97.38}
{5.10, 95.00}
NGDM '11
24
Centroids {6.9, 98.0}88.07
88.06
2.80
1.21
3.58
3.74
87.37
4.19
88.03
3.50
88.12
86.31
2.41
88.06
L2GClust Example : Local Clustering
{10.36, 97.1}
NGDM '11
25
Comparison was performed with same strategy executed at a central server with access to all data.
Measured outcomes were the agreement between a node's clustering estimate and the centralized clustering, averaged over all nodes.
Kappa statistic cluster sanity
Proportion of agreement cluster validity
K=(P(A)-P(e))/(1-P(e))
State-of-the-art Simulator
Each sensor in the simulation (Visual Sense) generates a Gaussian stream with mean from one of the predefined Gaussian clusters.
Evaluated parameters were number of clusters, network size, and cluster overlap.
L2GClust Evaluation Summary
NGDM '11
26
L2GClust Results
Average proportion of agreement converges (with small fluctuations).
NGDM '11
27
L2GClust Results
Sanity was confirmed with Kappa statistic always above 0.58.
NGDM '11
28
L2GClust Results
Real data from electricity demand sensors showedability to improve with examples.
NGDM '11
29
Local sketch yields:
memoryless storage of summaries;
a straightforward adaptation to most recent data;
a reduction of the system's sensitivity to uncertainty;
Local clustering with direct neighbors yields:
no forwarding of information (reduced communication);
low dimensionality of the clustering problem;
sensitive information better preserved.
Future Work
Evaluate L2GClust on smart grid sensor networks.
L2GClust Main Properties
NGDM '11
30
Thank you!
top related