ch. eick: introduction to hierarchical clustering and dbscan 1 remaining lectures in 2009 1.advanced...
TRANSCRIPT
![Page 1: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/1.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1
Remaining Lectures in 2009
1. Advanced Clustering and Outlier Detection 2. Advanced Classification and Prediction 3. Top Ten Data Mining Algorithms (short)4. Course Summary (short)5. Assignment5 Student Presentations
![Page 2: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/2.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 2
Clustering Part2: Advanced Clustering and Outlier Detection
1. Hierarchical Clustering2. More on Density-based Clustering: DENCLUE 3. [EM Top10-DM-Alg]4. Cluster Evaluation Measures5. Outlier Detection
![Page 3: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/3.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
More on Clustering
1. Hierarchical Clustering to be discussed in Nov. 11
2. DBSCAN will be used in programming project
![Page 4: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/4.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram– A tree like diagram that records the sequences of
merges or splits
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
![Page 5: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/5.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
Agglomerative Clustering Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
– Different approaches to defining the distance between clusters distinguish the different algorithms
![Page 6: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/6.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
Starting Situation
Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 7: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/7.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
Intermediate Situation
After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 8: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/8.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 9: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/9.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
After Merging
The question is “How do we update the proximity matrix?”
C1
C4
C2 U C5
C3? ? ? ?
?
?
?
C2 U C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
![Page 10: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/10.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function– Ward’s Method uses squared error
Proximity Matrix
![Page 11: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/11.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function– Ward’s Method uses squared error
![Page 12: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/12.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function– Ward’s Method uses squared error
![Page 13: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/13.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function– Ward’s Method uses squared error
![Page 14: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/14.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function– Ward’s Method uses squared error
![Page 15: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/15.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise proximity between points in the two clusters.
Need to use average connectivity for scalability since total proximity favors large clusters
||Cluster||Cluster
)p,pproximity(
)Cluster,Clusterproximity(ji
ClusterpClusterp
ji
jijjii
I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
![Page 16: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/16.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 16
2009 Teaching of Clustering
Clustering Part1: Basics (September/October)1. What is Clustering?2. Partitioning/Representative-based Clustering
• K-means• K-medoids
3. Density Based Clustering centering on DBSCAN4. Region Discovery5. Grid-based Clustering 6. Similarity Assessment
Clustering Part2: Advanced Topics (November)
![Page 17: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/17.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
DBSCAN (http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf )
DBSCAN is a density-based algorithm.– Density = number of points within a specified radius (Eps)
– Input parameter: MinPts and Eps
– A point is a core point if it has more than a specified number of points (MinPts) within Eps
These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
![Page 18: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/18.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
DBSCAN: Core, Border, and Noise Points
![Page 19: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/19.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
DBSCAN Algorithm (simplified view for teaching)
1. Create a graph whose nodes are the points to be clustered
2. For each core-point c create an edge from c to every point p in the -neighborhood of c
3. Set N to the nodes of the graph;
4. If N does not contain any core points terminate
5. Pick a core point c in N
6. Let X be the set of nodes that can be reached from c by going forward;
1. create a cluster containing X{c}
2. N=N/(X{c})
7. Continue with step 4Remarks: points that are not assigned to any cluster are outliers;http://www2.cs.uh.edu/~ceick/7363/Papers/dbscan.pdf gives a more efficient implementation by performing steps 2 and 6 in parallel
![Page 20: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/20.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border and noise
Eps = 10, MinPts = 4
![Page 21: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/21.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
![Page 22: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/22.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
When DBSCAN Does NOT Work Well
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.12)
• Varying densities
• High-dimensional data
Problems with
![Page 23: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/23.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
Assignment 3 Dataset: Earthquake
![Page 24: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/24.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
Assignment3 Dataset: Complex9
http://www2.cs.uh.edu/~ml_kdd/Complex&Diamond/2DData.htm
K-Means in Weka DBSCAN in Weka
Dataset: http://www2.cs.uh.edu/~ml_kdd/Complex&Diamond/Complex9.txt
![Page 25: Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in 2009 1.Advanced Clustering and Outlier Detection 2.Advanced Classification](https://reader035.vdocument.in/reader035/viewer/2022062315/5697bfd41a28abf838cacf5f/html5/thumbnails/25.jpg)
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN
DBSCAN: Determining EPS and MinPts
Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance
Noise points have the kth nearest neighbor at farther distance
So, plot sorted distance of every point to its kth nearest neighbor
Non-Core-pointsCore-points
Run DBSCAN for Minp=4 and =5