chapter dm:ii (continued) - webis.de · cluster evaluation [tan/steinbach/kumar 2005] random points...
TRANSCRIPT
![Page 1: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/1.jpg)
Chapter DM:II (continued)
II. Cluster Analysisq Cluster Analysis Basicsq Hierarchical Cluster Analysisq Iterative Cluster Analysisq Density-Based Cluster Analysisq Cluster Evaluationq Constrained Cluster Analysis
DM:II-198 Cluster Analysis © STEIN 2006-2019
![Page 2: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/2.jpg)
Cluster EvaluationOverview
“The validation of clustering structures is the most difficult andfrustrating part of cluster analysis. Without a strong effort in thisdirection, cluster analysis will remain a black art accessible only to thosetrue believers who have experience and great courage.”
[Jain/Dubes 1990]
DM:II-199 Cluster Analysis © STEIN 2006-2019
![Page 3: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/3.jpg)
Cluster Evaluation [Tan/Steinbach/Kumar 2005]
Random points
DM:II-200 Cluster Analysis © STEIN 2006-2019
![Page 4: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/4.jpg)
Cluster Evaluation [Tan/Steinbach/Kumar 2005]
Random points DBSCAN
DM:II-201 Cluster Analysis © STEIN 2006-2019
![Page 5: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/5.jpg)
Cluster Evaluation [Tan/Steinbach/Kumar 2005]
Random points DBSCAN
k-means
DM:II-202 Cluster Analysis © STEIN 2006-2019
![Page 6: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/6.jpg)
Cluster Evaluation [Tan/Steinbach/Kumar 2005]
Random points DBSCAN
k-means Complete link
DM:II-203 Cluster Analysis © STEIN 2006-2019
![Page 7: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/7.jpg)
Cluster EvaluationOverview
Cluster evaluation can address different issues:
q Provide evidence whether data contains non-random structures.
q Relate found structures in the data to externally provided class information.
q Rank alternative clusterings with regard to their quality.
q Determine the ideal number of clusters.
q Provide information to choose a suited clustering approach.
DM:II-204 Cluster Analysis © STEIN 2006-2019
![Page 8: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/8.jpg)
Cluster EvaluationOverview
Cluster evaluation can address different issues:
q Provide evidence whether data contains non-random structures.
q Relate found structures in the data to externally provided class information.
q Rank alternative clusterings with regard to their quality.
q Determine the ideal number of clusters.
q Provide information to choose a suited clustering approach.
(1) External validity measures:Analyze how close is a clustering to an (external) reference.
(2) Internal validity measures:Analyze intrinsic characteristics of a clustering.
(3) Relative validity measures:Analyze the sensitivity (of internal measures) during clustering generation.
DM:II-205 Cluster Analysis © STEIN 2006-2019
![Page 9: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/9.jpg)
Cluster EvaluationOverview
covering analysis
information-theoretic
external
internal
relative
absolute
clustervalidity static: structure analysis
dynamic: re-cluster stability
elbow criterion,GAP statistics
F-Measure, Purity,RAND statistics
dilution analysis
Kullback-Leibler,Entropy
Davies-Bouldin,Dunn, ρ, λ
(1) External validity measures:Analyze how close is a clustering to an (external) reference.
(2) Internal validity measures:Analyze intrinsic characteristics of a clustering.
(3) Relative validity measures:Analyze the sensitivity (of internal measures) during clustering generation.
DM:II-206 Cluster Analysis © STEIN 2006-2019
![Page 10: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/10.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
DM:II-207 Cluster Analysis © STEIN 2006-2019
![Page 11: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/11.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Class i
DM:II-208 Cluster Analysis © STEIN 2006-2019
![Page 12: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/12.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Cluster j
DM:II-209 Cluster Analysis © STEIN 2006-2019
![Page 13: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/13.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Cluster j
(node-based analysis)
TruthP N
Hypothesis P TP (a) FP (b)N FN (c) TN (d)
DM:II-210 Cluster Analysis © STEIN 2006-2019
![Page 14: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/14.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Cluster j
(node-based analysis)
TruthP N
Hypothesis P TP (a) FP (b)N FN (c) TN (d)
DM:II-211 Cluster Analysis © STEIN 2006-2019
![Page 15: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/15.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Cluster j
(node-based analysis)
TruthP N
Hypothesis P TP (a) FP (b)N FN (c) TN (d)
DM:II-212 Cluster Analysis © STEIN 2006-2019
![Page 16: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/16.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Cluster j
(node-based analysis)
TruthP N
Hypothesis P TP (a) FP (b)N FN (c) TN (d)
DM:II-213 Cluster Analysis © STEIN 2006-2019
![Page 17: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/17.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Cluster j
(node-based analysis)
TruthP N
Hypothesis P TP (a) FP (b)N FN (c) TN (d)
DM:II-214 Cluster Analysis © STEIN 2006-2019
![Page 18: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/18.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Cluster j
(node-based analysis)
TruthP N
Hypothesis P TP (a) FP (b)N FN (c) TN (d)
Precision:
a
a + b
Recall:
a
a + c
DM:II-215 Cluster Analysis © STEIN 2006-2019
![Page 19: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/19.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Cluster j
(node-based analysis)
TruthP N
Hypothesis P TP (a) FP (b)N FN (c) TN (d)
Precision:
a
a + b
Recall:
a
a + c
F -measure:
Fα =1 + α
1precision + α
recall
α = 1 harmonic meanα ∈ (0; 1) favor precision over recallα > 1 favor recall over precision
DM:II-216 Cluster Analysis © STEIN 2006-2019
![Page 20: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/20.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Classes:
DM:II-217 Cluster Analysis © STEIN 2006-2019
![Page 21: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/21.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Recall / = 0.26 Precision /( ∪ ) = 0.94 F-Measure = 0.40
In cluster:Target:Classes:
High precision, low recall ⇒ low F -measure.
DM:II-218 Cluster Analysis © STEIN 2006-2019
![Page 22: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/22.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Recall / = 0.59 Precision /( ∪ ) = 0.53 F-Measure = 0.56
In cluster:Target:Classes:
Low precision, low recall ⇒ low F -measure.
DM:II-219 Cluster Analysis © STEIN 2006-2019
![Page 23: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/23.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Target Class)
Recall / = 0.92 Precision /( ∪ ) = 0.99 F-Measure = 0.95
In cluster:Target:Classes:
High precision, high recall ⇒ high F -measure.
DM:II-220 Cluster Analysis © STEIN 2006-2019
![Page 24: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/24.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Clustering)
Cluster j
(node-based analysis)
q Clustering C = {C1, . . . , Ck} and classification C∗ = {C∗1 , . . . , C∗l } of D.
q Fi,j is the F -measure of a cluster j computed with respect to a class i.
Precision of cluster j with respect to class i is |Cj ∩ C∗i |/|Cj| (here: Preci,j = 0.71)
Recall of cluster j with respect to class i is |Cj ∩ C∗i |/|C∗i | (here: Reci,j = 1.0)
DM:II-221 Cluster Analysis © STEIN 2006-2019
![Page 25: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/25.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Clustering)
Cluster j
(node-based analysis)
q Clustering C = {C1, . . . , Ck} and classification C∗ = {C∗1 , . . . , C∗l } of D.
q Fi,j is the F -measure of a cluster j computed with respect to a class i.
Precision of cluster j with respect to class i is |Cj ∩ C∗i |/|Cj| (here: Preci,j = 0.71)
Recall of cluster j with respect to class i is |Cj ∩ C∗i |/|C∗i | (here: Reci,j = 1.0)
Ü Micro-averaged F -measure for 〈D, C, C∗〉 :
F =
l∑i=1
|C∗i ||D| · max
j=1,...,k{Fi,j}
DM:II-222 Cluster Analysis © STEIN 2006-2019
![Page 26: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/26.jpg)
Cluster Evaluation(1) External Validity Measures: F -Measure (for a Clustering)
Cluster j
(node-based analysis)
q Clustering C = {C1, . . . , Ck} and classification C∗ = {C∗1 , . . . , C∗l } of D.
q Fi,j is the F -measure of a cluster j computed with respect to a class i.
Precision of cluster j with respect to class i is |Cj ∩ C∗i |/|Cj| (here: Preci,j = 0.71)
Recall of cluster j with respect to class i is |Cj ∩ C∗i |/|C∗i | (here: Reci,j = 1.0)
Ü Macro-averaged F -measure for 〈D, C, C∗〉 :
F =1
l
l∑i=1
maxj=1,...,k
{Fi,j}
DM:II-223 Cluster Analysis © STEIN 2006-2019
![Page 27: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/27.jpg)
Remarks:
q Micro averaging treats objects (documents) equally, whereas macro averaging treats classesequally.
DM:II-224 Cluster Analysis © STEIN 2006-2019
![Page 28: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/28.jpg)
Cluster Evaluation(1) External Validity Measures: Entropy
Cluster j
(node-based analysis)
DM:II-225 Cluster Analysis © STEIN 2006-2019
![Page 29: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/29.jpg)
Cluster Evaluation(1) External Validity Measures: Entropy
Cluster j
(node-based analysis)
q A cluster C acts as information source L.L emits cluster labels L1, . . . , Ll with probabilities P (L1), . . . , P (Ll).
DM:II-226 Cluster Analysis © STEIN 2006-2019
![Page 30: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/30.jpg)
Cluster Evaluation(1) External Validity Measures: Entropy
Cluster j
(node-based analysis)
q A cluster C acts as information source L.L emits cluster labels L1, . . . , Ll with probabilities P (L1), . . . , P (Ll).
L1 = , L2 = , P̂ ( ) = 10/14, P̂ ( ) = 4/14
DM:II-227 Cluster Analysis © STEIN 2006-2019
![Page 31: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/31.jpg)
Cluster Evaluation(1) External Validity Measures: Entropy
Cluster j
(node-based analysis)
q A cluster C acts as information source L.L emits cluster labels L1, . . . , Ll with probabilities P (L1), . . . , P (Ll).
L1 = , L2 = , P̂ ( ) = 10/14, P̂ ( ) = 4/14
q Entropy of L : H(L) = −∑li=1 P (Li) · log 2(P (Li))
Entropy of Cj wrt. C∗ : H(Cj) = −∑
Cj∩C∗i 6=∅|Cj ∩ C∗i |/|Cj| · log 2(|Cj ∩ C∗i |/|Cj|)
DM:II-228 Cluster Analysis © STEIN 2006-2019
![Page 32: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/32.jpg)
Cluster Evaluation(1) External Validity Measures: Entropy
Cluster j
(node-based analysis)
q A cluster C acts as information source L.L emits cluster labels L1, . . . , Ll with probabilities P (L1), . . . , P (Ll).
L1 = , L2 = , P̂ ( ) = 10/14, P̂ ( ) = 4/14
q Entropy of L : H(L) = −∑li=1 P (Li) · log 2(P (Li))
Entropy of Cj wrt. C∗ : H(Cj) = −∑
Cj∩C∗i 6=∅|Cj ∩ C∗i |/|Cj| · log 2(|Cj ∩ C∗i |/|Cj|)
Ü Entropy of C wrt. C∗ : H(C) =∑Cj∈C|Cj|/|D| ·H(Cj)
DM:II-229 Cluster Analysis © STEIN 2006-2019
![Page 33: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/33.jpg)
Cluster Evaluation(1) External Validity Measures: Rand, Jaccard
Cluster j
true positive
false positive
true negative
false negative
(edge-based analysis)
DM:II-230 Cluster Analysis © STEIN 2006-2019
![Page 34: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/34.jpg)
Cluster Evaluation(1) External Validity Measures: Rand, Jaccard
Cluster j
true positive
false positive
true negative
false negative
(edge-based analysis)
q R(C) =|TP | + |TN |
|TP | + |TN | + |FP | + |FN | =|TP | + |TN |n(n− 1)/2
, with n = |D|
q J(C) =|TP |
|TP | + |FP | + |FN |
DM:II-231 Cluster Analysis © STEIN 2006-2019
![Page 35: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/35.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2
...− − − − − 1.0 0.6− − − − − − 1.0
∼
1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1
...− − − − − 1 1− − − − − − 1
q Construct occurrence matrix based on cluster analysis.
q Compare similarity matrix to occurrence matrix: correlation τDM:II-232 Cluster Analysis © STEIN 2006-2019
![Page 36: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/36.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2
...− − − − − 1.0 0.6− − − − − − 1.0
∼
1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1
...− − − − − 1 1− − − − − − 1
q Construct occurrence matrix based on cluster analysis.
q Compare similarity matrix to occurrence matrix: correlation τDM:II-233 Cluster Analysis © STEIN 2006-2019
![Page 37: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/37.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2
...− − − − − 1.0 0.6− − − − − − 1.0
∼
1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1
...− − − − − 1 1− − − − − − 1
q Construct occurrence matrix based on cluster analysis.
q Compare similarity matrix to occurrence matrix: correlation τDM:II-234 Cluster Analysis © STEIN 2006-2019
![Page 38: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/38.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2
...− − − − − 1.0 0.6− − − − − − 1.0
∼
1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1
...− − − − − 1 1− − − − − − 1
q Construct occurrence matrix based on cluster analysis.
q Compare similarity matrix to occurrence matrix: correlation τDM:II-235 Cluster Analysis © STEIN 2006-2019
![Page 39: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/39.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
k-meansτ = 0.58
k-meansτ = 0.92
1.0 0.2 0.1 0.3 . . . 0.1 0.0− 1.0 0.1 0.0 . . . 0.0 0.2
...− − − − − 1.0 0.6− − − − − − 1.0
∼
1 0 0 1 . . . 0 0− 1 0 0 . . . 0 1
...− − − − − 1 1− − − − − − 1
q Construct occurrence matrix based on cluster analysis.
q Compare similarity matrix to occurrence matrix: correlation τDM:II-236 Cluster Analysis © STEIN 2006-2019
![Page 40: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/40.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
k-means at structured data. Similarity matrix sorted by cluster label.
DM:II-237 Cluster Analysis © STEIN 2006-2019
![Page 41: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/41.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
DBSCAN at random data. Similarity matrix sorted by cluster label.
DM:II-238 Cluster Analysis © STEIN 2006-2019
![Page 42: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/42.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
k-means at random data. Similarity matrix sorted by cluster label.
DM:II-239 Cluster Analysis © STEIN 2006-2019
![Page 43: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/43.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
Complete link at random data. Similarity matrix sorted by cluster label.
DM:II-240 Cluster Analysis © STEIN 2006-2019
![Page 44: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/44.jpg)
Cluster Evaluation(2) Internal Validity Measures: Edge Correlation [Tan/Steinbach/Kumar 2005]
DBSCAN at structured data. Similarity matrix sorted by cluster label.
DM:II-241 Cluster Analysis © STEIN 2006-2019
![Page 45: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/45.jpg)
Cluster Evaluation(2) Internal Validity Measures: Structural Analysis
q Distance for two clusters, dC(C1, C2).
q Diameter of a cluster, ∆(C).
q Scatter within a cluster, σ2(C), SSE.DM:II-242 Cluster Analysis © STEIN 2006-2019
![Page 46: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/46.jpg)
Cluster Evaluation(2) Internal Validity Measures: Structural Analysis
dC(C1, C2)
q Distance for two clusters, dC(C1, C2).
q Diameter of a cluster, ∆(C).
q Scatter within a cluster, σ2(C), SSE.DM:II-243 Cluster Analysis © STEIN 2006-2019
![Page 47: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/47.jpg)
Cluster Evaluation(2) Internal Validity Measures: Structural Analysis
dC(C1, C2)
∆(C)
q Distance for two clusters, dC(C1, C2).
q Diameter of a cluster, ∆(C).
q Scatter within a cluster, σ2(C), SSE.DM:II-244 Cluster Analysis © STEIN 2006-2019
![Page 48: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/48.jpg)
Cluster Evaluation(2) Internal Validity Measures: Structural Analysis
dC(C1, C2)
∆(C)
σ2(C)
q Distance for two clusters, dC(C1, C2).
q Diameter of a cluster, ∆(C).
q Scatter within a cluster, σ2(C), SSE.DM:II-245 Cluster Analysis © STEIN 2006-2019
![Page 49: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/49.jpg)
Cluster Evaluation(2) Internal Validity Measures: Dunn Index
I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}
,
I(C)→ max
DM:II-246 Cluster Analysis © STEIN 2006-2019
![Page 50: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/50.jpg)
Cluster Evaluation(2) Internal Validity Measures: Dunn Index
I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}
,
I(C)→ max
Cluster distance
DM:II-247 Cluster Analysis © STEIN 2006-2019
![Page 51: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/51.jpg)
Cluster Evaluation(2) Internal Validity Measures: Dunn Index
I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}
,
I(C)→ max
Cluster diameter
DM:II-248 Cluster Analysis © STEIN 2006-2019
![Page 52: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/52.jpg)
Cluster Evaluation(2) Internal Validity Measures: Dunn Index
I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}
,
I(C)→ max
q Dunn is susceptible to noise.
q Dunn is biased towards the worst substructure in a clustering (cf. the min).
q Dunn value too low since distances and diameters are not put into relation.DM:II-249 Cluster Analysis © STEIN 2006-2019
![Page 53: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/53.jpg)
Cluster Evaluation(2) Internal Validity Measures: Dunn Index
∆(C1)
C1C3
C2
dC(C1, C2)
I(C) =mini6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}
,
I(C)→ max
q Dunn is susceptible to noise.
q Dunn is biased towards the worst substructure in a clustering (cf. the min).
q Dunn value too low since distances and diameters are not put into relation.DM:II-250 Cluster Analysis © STEIN 2006-2019
![Page 54: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/54.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
DM:II-251 Cluster Analysis © STEIN 2006-2019
![Page 55: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/55.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
Different models (feature sets) yield different similarity graphs.
DM:II-252 Cluster Analysis © STEIN 2006-2019
![Page 56: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/56.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
Different models (feature sets) yield different similarity graphs.
DM:II-253 Cluster Analysis © STEIN 2006-2019
![Page 57: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/57.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
Compare (for alternative clusterings) the similarity density within the clusters to theaverage similarity of the entire graph.
DM:II-254 Cluster Analysis © STEIN 2006-2019
![Page 58: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/58.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
Compare (for alternative clusterings) the similarity density within the clusters to theaverage similarity of the entire graph.
DM:II-255 Cluster Analysis © STEIN 2006-2019
![Page 59: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/59.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
Compare (for alternative clusterings) the similarity density within the clusters to theaverage similarity of the entire graph.
DM:II-256 Cluster Analysis © STEIN 2006-2019
![Page 60: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/60.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
Graph G = 〈V,E〉 :
q G is called sparse if |E| = O(|V |), G is called dense if |E| = O(|V |2)
Ü the density θ computes from the equation |E| = |V |θ
DM:II-257 Cluster Analysis © STEIN 2006-2019
![Page 61: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/61.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
Graph G = 〈V,E〉 :
q G is called sparse if |E| = O(|V |), G is called dense if |E| = O(|V |2)
Ü the density θ computes from the equation |E| = |V |θ
Similarity graph G = 〈V,E,w〉 :
q |E| ∼ w(G) =∑e∈E
w(e)
Ü the density θ computes from the equation w(G) = |V |θ
DM:II-258 Cluster Analysis © STEIN 2006-2019
![Page 62: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/62.jpg)
Cluster Evaluation(2) Internal Validity Measures: Expected Density ρ [Stein/Meyer zu Eissen 2007]
Graph G = 〈V,E〉 :
q G is called sparse if |E| = O(|V |), G is called dense if |E| = O(|V |2)
Ü the density θ computes from the equation |E| = |V |θ
Similarity graph G = 〈V,E,w〉 :
q |E| ∼ w(G) =∑e∈E
w(e)
Ü the density θ computes from the equation w(G) = |V |θ
Cluster Ci induces subgraph Gi :
Ü the expected density ρ relates the density of Gi to the density average in G
ρ(Gi) =w(Gi)
|Vi|θDM:II-259 Cluster Analysis © STEIN 2006-2019
![Page 63: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/63.jpg)
Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion
1. Hyperparameter alternatives of a clustering algorithm: π1, . . . , πmq number of centroids for k-meansq stopping level for hierarchical algorithmsq neighborhood size for DBSCAN
2. Set of clusterings C = {Cπ1, . . . , Cπm} associated with π1, . . . , πm.
3. Points of an error curve {(πi, e(Cπi)
)| i = 1, . . . ,m}.
DM:II-260 Cluster Analysis © STEIN 2006-2019
![Page 64: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/64.jpg)
Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion
1. Hyperparameter alternatives of a clustering algorithm: π1, . . . , πmq number of centroids for k-meansq stopping level for hierarchical algorithmsq neighborhood size for DBSCAN
2. Set of clusterings C = {Cπ1, . . . , Cπm} associated with π1, . . . , πm.
3. Points of an error curve {(πi, e(Cπi)
)| i = 1, . . . ,m}.
DM:II-261 Cluster Analysis © STEIN 2006-2019
![Page 65: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/65.jpg)
Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion
1. Hyperparameter alternatives of a clustering algorithm: π1, . . . , πmq number of centroids for k-meansq stopping level for hierarchical algorithmsq neighborhood size for DBSCAN
2. Set of clusterings C = {Cπ1, . . . , Cπm} associated with π1, . . . , πm.
3. Points of an error curve {(πi, e(Cπi)
)| i = 1, . . . ,m}.
π = cluster number
e = SSE
|V|1 k
4. Find the point that maximizes error reduction with regard to its successor.
DM:II-262 Cluster Analysis © STEIN 2006-2019
![Page 66: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/66.jpg)
Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion
dC: Hamming distanceMerging: complete link
http://cs.jhu.edu/~razvanm/fs-expedition/2.6.x.html
Relations between 1377 file systems for Linux Kernel 2.6.0. [Razvan Musaloiu 2009]
DM:II-263 Cluster Analysis © STEIN 2006-2019
![Page 67: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/67.jpg)
Cluster Evaluation(3) Relative Validity Measures: Elbow Criterion
dC: Hamming distanceMerging: group average link
http://cs.jhu.edu/~razvanm/fs-expedition/2.6.x.html
Relations between 1377 file systems for Linux Kernel 2.6.0. [Razvan Musaloiu 2009]
DM:II-264 Cluster Analysis © STEIN 2006-2019
![Page 68: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/68.jpg)
Cluster EvaluationCorrelation between External and Internal Measures
In the wild, we are not given a reference classification.
Ü An external evaluation is not possible.(though many papers report on such experiments)
Ü Resort to an internal evaluation.(connectivity, squared error sums, distance-diameter heuristics, etc.)
DM:II-265 Cluster Analysis © STEIN 2006-2019
![Page 69: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/69.jpg)
Cluster EvaluationCorrelation between External and Internal Measures
In the wild, we are not given a reference classification.
Ü An external evaluation is not possible.(though many papers report on such experiments)
Ü Resort to an internal evaluation.(connectivity, squared error sums, distance-diameter heuristics, etc.)
“To which extent can an internal evaluation φ be used to predict for aclustering its distance from the best reference classification—say, topredict the F -measure?”
argmaxφ
{τ〈X, Y 〉 | x = F (C), y = φ(C), C ∈ C}[Stein/Meyer zu Eissen 2007]
DM:II-266 Cluster Analysis © STEIN 2006-2019
![Page 70: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/70.jpg)
Cluster EvaluationCorrelation between External and Internal Measures
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F-M
easu
re
Cluster Validity
Perfect correlation (desired).DM:II-267 Cluster Analysis © STEIN 2006-2019
![Page 71: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/71.jpg)
Cluster EvaluationCorrelation between External and Internal Measures
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-35 -30 -25 -20 -15 -10 -5 0
F-M
easu
re
Davies-Bouldin
5 classes, 800 documents
Davies-Bouldin:1
k·
k∑i=1
maxj
s(Ci) + s(Cj)
dC(Ci, Cj)
Prefers spherical clusters.DM:II-268 Cluster Analysis © STEIN 2006-2019
![Page 72: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/72.jpg)
Cluster EvaluationCorrelation between External and Internal Measures
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
F-M
easu
re
Dunn Index
5 classes, 800 documents
Dunn Index:mini 6=j{dC(Ci, Cj)}max1≤l≤k{∆(Cl)}
Maximizes dilatation = inter/intra-cluster-diameter.DM:II-269 Cluster Analysis © STEIN 2006-2019
![Page 73: Chapter DM:II (continued) - webis.de · Cluster Evaluation [Tan/Steinbach/Kumar 2005] Random points DM:II-200 Cluster Analysis ©STEIN 2006-2019](https://reader030.vdocument.in/reader030/viewer/2022041222/5e0bb0aabeb12f5aad2f8024/html5/thumbnails/73.jpg)
Cluster EvaluationCorrelation between External and Internal Measures
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
600 650 700 750 800 850 900 950 1000
F-M
easu
re
Expected Density
5 classes, 800 documents
Expected Density: ρ̄ =
k∑i=1
|Vi||V | ·
w(Gi)
|Vi|θ
Independent of cluster forms and sizes.DM:II-270 Cluster Analysis © STEIN 2006-2019