di yang, zhengyu guo, elke a. rundensteiner and matthew o. ward worcester polytechnic institute
DESCRIPTION
A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In Streaming Windows. Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted. - PowerPoint PPT PresentationTRANSCRIPT
Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward
Worcester Polytechnic InstituteEDBT 2010, Submitted
1
A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In
Streaming Windows
This work is supported under NSF grants CCF-0811510, IIS-0119276, IIS-0414380.
What are Density-Based Clusters?
2
Clusters that are defined by individual data points (tuples) and their local “neighborhood”.
How they are different from K-median style clustering?
Cluster 1
Cluster 2
Cluster 1 Cluster 2
Cluster 3 Cluster 4
Formal Definition
3
12
14
5
7
6
4
8
29
16
17
1
13
15
Core Object: has more than neighbors
in distance from it.
Edge Object: not core object but a neighbor
of a core object.
Noise: not core object and not a neighbor of
any core object.
θrange
θ cnt
A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them
Cluster Detection in Sliding Windows
54321 6 7 8
W1
54321 6 7 8
W2
4
Template Density-Based Clustering Query Over Sliding Windows
Pattern-specific
Window-specific
Application Examples:
5
transaction info clusters
Stock Market
Are there intensive-transaction areas in last 1 hour transactions?
Battle field
position info
Stock Analysts
Commander
Where are the main clusters formed by enemy war-crafts
clusters
5
State-of-Art
6
Existing algorithms for density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09].
Extra-N suffers from the performance inefficiency as the slide/win rate increases.
No evolution semantics defined for density-based cluster changes over the time.
No existing system allowing interactive exploration of density-based clusters in streaming windows.
Goals
7
1. A more efficient density-based clustering algorithm over streams.
2. An evolution semantics that intuitively explain cluster changes.
3. A visualized pattern space allowing interactive exploration of clusters.
Review: existing algorithm– Extra-N
8
In highly dynamic streaming environments: Re-computation.Incremental cluster maintenance.
Extra-N[Yang09] proposed a hybrid neighbor relationship
(neighborship) mechanism to represent cluster structure. Maintain “Exact Neighborships” (neighbor lists) for none-
core objects.Maintain “Abstract Neighborships” (cluster memberships)
for core objects.
A general concept of “Predicted View” is applied to efficiently update the cluster structure.
—Key: a compact and easy-maintainable cluster representation.
Concept of Predicted Views
12
14
57
613
11
2
9
101
3
8
4
15
16
Current View of W0
window size=16, slide size=4, time=1
Predicted View of W1
12
14
57
613
11
9
10
8
15
16
Predicted View of W2
12
14
57
613
11
2
9
101
3
8
4
15
16
12
14 13
11
9
1015
16
Predicted View of W3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16W0
W1W2
W3
9
Update Predicted Views
Current View of W1 Predicted View of W2
12
14 13
11
9
1015
16
Predicted View of W3
12
14
57
613
11
9
10
8
15
16
14 13
15
16
Predicted View of W4
17 18 19 205 6 7 8 9 10 11 12 13 14 15 16W1
W2W3
W4
17
20
18
19
17
20
18
19
17
20
18
19
17
20
18
19
New Data Points
12
14
57
613
11
2
9
101
3
8
4
15
16
window size=16, slide size=4, time=1
10
Expired View of W0
Inefficiency of Extra-N
11
When Slide/Win rate increases, (for example Win=10000, slide=10), large number of predicted views need to be maintained independently.
Heavy burden to both CPU and memory resources.
WinSlide
Proposed Solution: IWIN
12
Any relationship between the cluster identified ?
“Growth Property” among DB-cluster Sets
13 Independent Cluster Structure Storage Hierarchical Cluster Structure Storage
Grow
If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1 .
c6c5c4 c6c5c4
Integrated Vs. Independent Maintenance of Predicted Views
14 IWIN: Integrated maintenance Extra-N: Independetmaintenance
Benefits of Integrated Maintenance
15
Benefits for Memory Resources: Memory space needed by storing cluster sets identified
by multiple queries in QG is independent from |QG|.
Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster
structure (which are usually similar) can be maintained incrementally, rather than independently.
IWIN outperforms Extra-N in both CPU and memory utilizations.
Goals
16
1. A more efficient density-based clustering algorithm over streams.
2. An evolution semantics that intuitively explain cluster changes.
3. A visualized pattern space allowing interactive exploration of clusters.
Why we need evolution semantics?
17
Analysts need to know how clusters change over time.
It is hard to observe by looking at the clusters only (even with visualization).
Commander
History: Did any clusters merge? Now: Are their any new cluster?Future: Is there any cluster breaking shortly?
Proposed Semantics
18
Single Step Evolutions:birth terminationsplitmergePreserve/expand/shrink
Multi Step Evolutions:split-expand
split-merge
shrink-split
/ /
How to Compute
19
Extract Predicted Evolution (before window slide)
Update Evolution (after window slide)
preserve
split
preserve
shrink
Conclusion for Proposed Semantics
20
1. Intuitively describe the cluster evolution over the time.
2. Easily maintainable: can be computed on-the-fly during cluster maintenance.
Goals
21
1. A more efficient density-based clustering algorithm over streams.
2. An evolution semantics that intuitively explain cluster changes.
3. A visualized pattern space allowing interactive exploration of clusters.
Outline
22
1. What is Neighbor-Based Pattern Detection2. State-of-Art3. Potential Solutions & Their Inefficiency 4. Proposed Solution: Extra-N5. Experimental Study6. Conclusion
Why needed?
23
Analysts need to navigate along the time axis to learn the current, review the history, and predict the near future.Example: how are the two clusters in current window
related to those detected 30 minutes back?
Analysts need to study the clusters and their evolution at different abstraction level.Example: for routine traffic monitoring, only the position
of major clusters need to be reported; when accident happened, specific information of cluster members need to be reported.
Proposed Pattern Space
24
Evaluation for IWIN
25
Alternative Methods:1. Incremental DBSCAN [Ester98] 2. Extra-N [Yang09] 3. IWIN
Real Streaming Data:1. GMTI data recording information about moving
vehicles [Mitre08].2. STT data recording stock transactions from NYSE
[INETATS08].Measurements:
1. Average processing time for each tuple. 2. Memory footprint.
Evaluation for IWIN
26
Case Study 1
Case Study 2
28
Conclusion
29
1. Presented the first unified framework supporting interactive exploration of density-based clusters in streaming windows.
2. Designed a more efficient density-based clustering algorithm IWIN.
3. Define the first evolution semantics for density-based clusters.
4. Our experimental study confirms the both the efficiency and effectiveness of our proposed framework.
Future work
30
Support multiple queries.
Support other pattern types, such as outliers, association rules…
Support pattern storage and match.
More?
The End
31
Thanks