hierarchical clustering - stanford...
TRANSCRIPT
Statistics 202:Data Mining
c©JonathanTaylor
Statistics 202: Data MiningWeek 9
Based in part on slides from textbook, slides of Susan Holmes
c©Jonathan Taylor
December 2, 2012
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part I
Hierarchical clustering
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Description
Produces a set of nested clusters organized as ahierarchical tree.
Can be visualized as a dendrogram: A tree like diagramthat records the sequences of merges or splits.
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram – A tree like diagram that records the sequences of
merges or splits
A clustering and its dendrogram.
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Strengths
Do not have to assume any particular number of clusters.Each horizontal cut of the tree yields a clustering.
The tree may correspond to a meaningful taxonomy: (e.g.,animal kingdom, phylogeny reconstruction, ...)
Need only a similarity or distance matrix forimplementation.
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Agglomerative
Start with the points as individual clusters.
At each step, merge the closest pair of clusters until onlyone cluster (or some fixed number k clusters) remain.
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Divisive
Start with one, all-inclusive cluster.
At each step, split a cluster until each cluster contains apoint (or there are k clusters).
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Agglomerative Clustering Algorithm
1 Compute the proximity matrix.
2 Let each data point be a cluster.3 While there is more than one cluster:
1 Merge the two closest clusters.2 Update the proximity matrix.
The major difference is the computation of proximity of twoclusters.
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Starting point for agglomerative clustering.
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10
Intermediate Situation
After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2 C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
Intermediate point, with 5 clusters.
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
We will merge C2 and C5.
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
After Merging
The question is “How do we update the proximity matrix?”
C1
C4
C2 U C5
C3 ? ? ? ?
?
?
?
C2 U C5 C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
How do we update proximity matrix?
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
Proximity Matrix
We need a notion of similarity between clusters.
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
Single linkage uses the minimum distance.
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
Complete linkage uses the maximum distance.
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
Group average linkage uses the average distance betweengroups.
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17
How to Define Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an objective
function – Ward’s Method uses squared error
× ×
Centroid uses the distance between the centroids of the clusters(presumes one can compute centroids...)
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18
Cluster Similarity: MIN or Single Link
Similarity of two clusters is based on the two most similar (closest) points in the different clusters – Determined by one pair of points, i.e., by one link in
the proximity graph.
1 2 3 4 5
Proximity matrix and dendrogram of single linkage.
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Distance matrix for nested clusterings
111 0.24 0.22 0.37 0.34 0.23222 0.15 0.20 0.14 0.25
333 0.15 0.28 0.11444 0.29 0.22
555 0.39666
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19
Hierarchical Clustering: MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
3
4
5
Nested cluster representation and dendrogram of single linkage.
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Single linkage
Can handle irregularly shaped regions fairly naturally.
Sensitive to noise and outliers in the form of “chaining”.
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (single linkage)
22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (single linkage)
23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Cluster Similarity: MAX or Complete Linkage
Similarity of two clusters is based on the two least similar (most distant) points in the different clusters – Determined by all pairs of points in the two clusters
1 2 3 4 5
Proximity matrix and dendrogram of complete linkage.
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23
Hierarchical Clustering: MAX
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2 5
3
4
Nested cluster and dendrogram of complete linkage.
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Complete linkage
Less sensitive to noise and outliers than single linkage.
Regions are generally compact, but may violate“closeness”. That is, points may much closer to somepoints in neighbouring cluster than its own cluster.
This manifests itself as breaking large clusters.
Clusters are biased to be globular.
26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (complete linkage)
27 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (complete linkage)
28 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (complete linkage)
29 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27
Hierarchical Clustering: Group Average
Nested Clusters Dendrogram
1
2
3
4
5
6
1
2
5
3
4
Nested cluster and dendrogram of group average linkage.
30 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Average linkage
Given two elements of the partition Cr ,Cs , we mightconsider
dGA(Cr ,Cs) =1
|Cr ||Cs |∑
xxx∈Cr ,yyy∈Cs
d(xxx ,yyy)
A compromise between single and complete linkage.
Shares globular clusters of complete, less sensitive thansingle.
31 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (average linkage)
32 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (average linkage)
33 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Ward’s linkage
Similarity of two clusters is based on the increase insquared error when two clusters are merged.
Similar to average if dissimilarity between points isdistance squared. Hence, it shares many properties ofaverage linkage.
A hierarchical analogue of K -means.
Sometimes used to initialize K -means.
34 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (Ward’s linkage)
35 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data (Ward’s linkage)
36 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (complete linkage)
37 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (single linkage)
38 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (average linkage)
39 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (Ward’s linkage)
40 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Computational issues
O(n2) space since it uses the proximity matrix.
O(n3) time in many cases as there are N steps, and ateach step a matrix of size N2 must be updated and/orsearched.
41 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Hierarchical clustering
Statistical issues
Once a decision is made to combine two clusters, it cannotbe undone.
No objective function is directly minimized.
Different schemes have problems with one or more of thefollowing:
Sensitivity to noise and outliers.Difficulty handling different sized clusters and convexshapes.Breaking large cluster.
42 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part II
Model-based clustering
43 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model-based clustering
General approach
Choose a type of mixture model (e.g. multivariateNormal) and a maximum number of clusters, K
Use a specialized hierarchical clustering technique.
Uses some criterion to determine optimal model andnumber of clusters.
44 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model-based clustering
Choosing a mixture model
General form of the mixture model
f (xxx) =k∑
j=1
πj f (xxx ; θj)
For multivariate normal, θj = (µj ,Σj).
The EM algorithm we discussed before assumed Σj are alldifferent in the different classes.
Other possibilities: Σj = Σ, Σj = λj · I , etc.
45 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model-based clustering
Choosing a mixture model
Generally, we can write
Σj = cjDjAjDTj
with cjdiag(Aj) the eigenvalues of Σj with max(Aj) = 1.
The parameter cj is the size, Aj is the shape, and Dj is theorientation.
46 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model-based agglomerative clustering
Ward’s criterion
A hierarchical clustering algorithm that merges k clusters{C k
1 , . . . ,Ckk } into k − 1 clusters based on
WSS =k−1∑
j=1
WSS(C k−1j )
where WSS is the within-cluster sum of squared distances.
The procedure merges the two clusters C ki ,C
kl that
produce the smallest increase in WSS .
47 / 1
Statistics 202:Data Mining
c©JonathanTaylor
NCI data (Ward’s linkage)
48 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model-based agglomerative clustering
Model-based agglomerative clustering
If Σj = σ2 · I , then Ward’s criterion is equivalent tomerging based on the criterion
−2 log L(θ, l)
where
L(θ, l) =n∏
i=1
f (xixixi ; θli )
is called the classification likelihood.
This idea can be used to make a hierarchical clusteringalgorithm for other types of multivariate normal models,i.e. equal shape, same size, etc.
49 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model selection & BIC
Bayesian Information Criterion
After a merge, the clusters are taken as initial startingpoints for the EM algorithm.
This results in several mixture models: one for eachnumber of clusters and each type of mixture modelconsidered.
How do we choose?
This raises the topic of model selection
50 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model selection & BIC
Bayesian Information Criterion
Suppose we have several possible modelsM = {M1, . . . ,MT} for a data set which we assume isgiven by a data matrix XXX n×p.
These models have parameters Θ = {θ1, . . . , θT}.Further, suppose that each one has a likelihood, Lj and
Θ̂ = {θ̂1, . . . , θ̂T} are the maximum likelihood estimators.
We can compare−2 logLj(θ̂j)
but this ignores how much “fitting” each model does.
A common approach is to add a penalty that makesdifferent models comparable.
51 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model selection & BIC
Bayesian Information Criterion
The BIC of a model is usually
BIC (Mj) = −2 logLj(θ̂j) + log n ·# parameters in Mj .
The BIC can be thought of as approximating
P(Mj is correct|XXX n×p)
under an appropriate Bayesian model for XXX .
52 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model selection & BIC
Bayesian Information Criterion
Typically, statisticians will try to prove choosing modelwith best BIC yields “correct model”.
Some theoretical justification is needed for this, and thisbreaks down for mixture models. Nevertheless, it is stillused.
Another common criterion is AIC (Akaike InformationCriterion)
AIC (Mj) = −2 logLj(θ̂j) + 2 ·# parameters in Mj .
53 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model-based clustering
Summary
1 Choose a type of mixture model (e.g. multivariateNormal) and a maximum number of clusters, K
2 Use a specialized hierarchical clustering technique:model-based hierarchical agglomeration.
3 Use clusters from previous step to initialize EM for themixture model.
4 Uses BIC to compare different mixture models and modelswith different numbers of clusters.
54 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data “best” model: equal shape, 2components
55 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data
56 / 1
Statistics 202:Data Mining
c©JonathanTaylor
The Iris data
57 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Part III
Outliers
58 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Concepts
What is an outlier? The set of data points that areconsiderably different than the remainder of the data . . .
When do they appear in data mining tasks?
Given a data matrix XXX , find all the cases xxx i ∈ XXX withanomaly/outlier scores greater than some threshold t. Or,the top n outlier scores.Given a data matrix XXX , containing mostly normal (butunlabeled) data points, and a test case xxxnew, compute ananomaly/outlier score of xxxnew with respect to XXX .
Applications
Credit card fraud detection;Network intrusion detection;Misspecification of a model.
59 / 1
Statistics 202:Data Mining
c©JonathanTaylor
What is an outlier?
60 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Issues
How many outliers are there in the data?
Method is unsupervised, similar to clustering or findingclusters with only 1 point in them.
Usual assumption: There are considerably more “normal”observations than “abnormal” observations(outliers/anomalies) in the data.
61 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
General steps
Build a profile of the “normal” behavior. The profilegenerally consists of summary statistics of this “normal”population.
Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.
General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.
Other schemes based on distances can be quasi-motivatedby such statistical techniques . . .
62 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Statistical approach
Assume a parametric model describing the distribution ofthe data (e.g., normal distribution)
Apply a statistical test that depends on:
Data distribution (e.g. normal)Parameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit, α or Type Ierror)
63 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Grubbs’ Test
Suppose we have a sample of n numbersZZZ = {Z1, . . . ,Zn}, i.e. a n × 1 data matrix.
Assuming data is from normal distribution, Grubbs’ testsuses distribution of
max1≤i≤n Zi − Z̄ZZ
SD(ZZZ )
to search for outlying large values.
64 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Grubbs’ Test
Lower tail variant:
min1≤i≤n Zi − Z̄ZZ
SD(ZZZ )
Two-sided variant:
max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )
65 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Grubbs’ Test
Having chosen a test-statistic, we must determine athreshold that sets our “threshold” rule
Often this is set via a hypothesis test to control Type Ierror.
For large positive outlier, threshold is based on choosingsome acceptable Type I error α and finding cα so that
P0
(max1≤i≤n |Zi − Z̄ZZ |
SD(ZZZ )≥ cα
)≈ α
Above, P0 denotes the distribution of ZZZ under theassumption there are no outliers.
If ZZZ are IID N(µ, σ2) it is generally possible to compute adecent approximation of this probability using Bonferonni.
66 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Grubbs’ Test
Two sided critical level has the form
cα =n − 1√
n
√√√√ t2α/(2n),n−2n − 2 + t2α/(2n),n−2
whereP(Tk ≥ tγ,k) = γ
is the upper tail quantile of Tk .
In R, you can use the functions pnorm, qnorm, pt, qt
for these quantities.
67 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Model based: linear regression with outliers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11
Model based techniques
! First build a model
! Points which don’t fit the model well are identified as outliers
! For the example at the right, a least squares regression model would be appropriate
! Residuals can be fed in to Grubbs’ test.
Figure : Residuals from model can be fed into Grubbs’ test orBonferroni (variant)
68 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Multivariate data
If the non-outlying data is assumed to be multivariateGaussian, what is the analogy of Grubbs’ statistic
max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )
Answer: use Mahalanobis distance
max1≤i≤n
(Zi − Z̄ZZ )T Σ̂−1(Zi − Z̄ZZ )
Above, each individual statistic has what looks like aHotelling’s T 2 distribution.
69 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Likelihood approach
Assume data is a mixture
F = (1− λ)M + λA.
Above, M is the distribution of “most of the data.”
The distribution A is an “outlier” distribution, could beuniform on a bounding box for the data.
This is a mixture model. If M is parametric, then the EMalgorithm fits naturally here.
Any points assigned to A are “outliers.”
70 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Likelihood approach
Do we estimate λ or fix it?
The book starts describing an algorithm that tries tomaximize the equivalent classification likelihood
L(θM , θA; l) =
(1− λ)#lM
∏
i∈lMfM(xi , θM)
×
λ#lA
∏
i∈lAfA(xi ; θA)
71 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Likelihood approach: Algorithm
Algorithm tries to maximize this by forming iterativeestimates (Mt ,At) of “normal” and “outlying” datapoints.
1 At each stage, tries to place individual points of Mt to At .2 Find (θ̂M , θ̂A) based on partition new partition (if
necessary).3 If increase in likelihood is large enough, call these new set
(Mt+1,At+1).4 Repeat until no further changes.
72 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Nearest neighbour approach
Many ways to define outliers.
Example: data points for which there are fewer than kneighboring points within a distance ε.
Example: the n points whose distance to k-th nearestneighbour is largest.
The n points whose average distance to the first k nearestneighobours is largest.
Each of these methods all depend on choice of someparameters: k, n, ε. Difficult to choose these in asystematic way.
73 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Density approach
For each point, xxx i compute a density estimate fxxx i ,k usingits k nearest neighbours.
Density estimate used is
fxxx i ,k =
(∑yyy∈N(xxx i ,k)
d(xxx i ,yyy)
#N(xxx i , k)
)−1
Define
LOF (xxx i ) =fxxx i ,k
(∑
y∈N(xxx i ,k)fy ,k)/#N(xxx i , k)
74 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22
Density-based: LOF approach
! For each point, compute the density of its local neighborhood ! Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the density of its nearest neighbors
! Outliers are points with largest LOF value
p2 ! p1
!
In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers
Figure : Nearest neighbour vs. density based
75 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Detection rate
Set P(O) to be the proportion of outliers or anomalies.
Set P(D|O) to be the probability of declaring an outlier ifit truly is an outlier. This is the detection rate.
Set P(D|Oc) to the probability of declaring an outlier if itis truly not an outlier.
76 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Outliers
Bayesian detection rate
Bayesian detection rate is
P(O|D) =P(D|O)P(O)
P(D|O)P(O) + P(D|Oc)P(Oc).
The false alarm rate or false discovery rate is
P(Oc |D) =P(D|Oc)P(Oc)
P(D|Oc)P(Oc) + P(D|O)P(O).
77 / 1