a framework for projected clustering of high dimensional data streams proceedings of the 30th vldb...
Post on 20-Dec-2015
215 views
TRANSCRIPT
![Page 1: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/1.jpg)
A Framework for Projected Clustering of High
Dimensional Data Streams
Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004
![Page 2: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/2.jpg)
Motivation and Underlying Concepts
•All dimensions should not be considered in high dimensional setup for clustering
•The Fading Cluster Structure: Use fading function
•The half life t0 of a point is defined as the time at which f(t0) = (1=2)f(0).
•A fading cluster structure at time t for a set of d-dimensional points
•The clustering structure properties called additivity and temporal multiplicity
•The clustering process requires a simultaneous maintenance of the clusters as well as the set of dimensions associated with each cluster
![Page 3: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/3.jpg)
HPStream : High-Dimentional Projected Stream Clustering Method
![Page 4: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/4.jpg)
HPStream Algorithm – Brief Explanation
-Set parameters
-Normalization Process
-Initial Clustering using k-means and Init Number
-ComputeDimensions: This procedure determines the dimensions insuch a way that the spread along the chosen dimensions is as small as possible
-The next step is the determination of the closest cluster to the incoming data point using FindProjectedDist
-The procedure for determination of the limiting radius is denoted by FindLimitingRadius
-Finally decision which cluster to add or delete.
![Page 5: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/5.jpg)
![Page 6: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/6.jpg)
![Page 7: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/7.jpg)
![Page 8: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/8.jpg)
Experimental Setup
HPStream compared with Clustream : both implemented on MS VC++
One synthetic data and 2 sets of Real world data
- Network Intrusion and Forest cover type data sets.
Comparison criteria for judging the 2 algorithms:
- accuracy : clustering quality
- efficiency : stream processing rate
- sensitivity : varying decay rate, l and radius threshold
- scalability : varying number of dimensions and clusters
Parameters initialized as following:Decay-rate = 0:5, Spread radius factor = 2, InitNumber =2000, Average Projected Dimensionality l > d/2.
![Page 9: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/9.jpg)
Comparing Accuracy : Using clustering quality and cluster purity
![Page 10: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/10.jpg)
Accuracy comparison continued:
![Page 11: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/11.jpg)
Accuracy comparison continued:
![Page 12: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/12.jpg)
Efficiency comparison using Stream Processing Rate:
![Page 13: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/13.jpg)
Sensitivity : Varying ‘l’
![Page 14: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/14.jpg)
Sensitivity: Varying radius threshold and decay rate
![Page 15: A Framework for Projected Clustering of High Dimensional Data Streams Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004](https://reader031.vdocument.in/reader031/viewer/2022032800/56649d425503460f94a1d893/html5/thumbnails/15.jpg)
Scalability : varying dimensionality and number of clusters