hierarchical clustering - stanford...

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 9

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

December 2, 2012

1 / 1


c©JonathanTaylor

Part I

Hierarchical clustering

2 / 1


c©JonathanTaylor


Description

Produces a set of nested clusters organized as ahierarchical tree.

Can be visualized as a dendrogram: A tree like diagramthat records the sequences of merges or splits.

3 / 1


c©JonathanTaylor


© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Hierarchical Clustering

 Produces a set of nested clusters organized as a hierarchical tree

 Can be visualized as a dendrogram –  A tree like diagram that records the sequences of

merges or splits

A clustering and its dendrogram.

4 / 1


c©JonathanTaylor


Strengths

Do not have to assume any particular number of clusters.Each horizontal cut of the tree yields a clustering.

The tree may correspond to a meaningful taxonomy: (e.g.,animal kingdom, phylogeny reconstruction, ...)

Need only a similarity or distance matrix forimplementation.

5 / 1


c©JonathanTaylor


Agglomerative

Start with the points as individual clusters.

At each step, merge the closest pair of clusters until onlyone cluster (or some fixed number k clusters) remain.

6 / 1


c©JonathanTaylor


Divisive

Start with one, all-inclusive cluster.

At each step, split a cluster until each cluster contains apoint (or there are k clusters).

7 / 1


c©JonathanTaylor


Agglomerative Clustering Algorithm

1 Compute the proximity matrix.

2 Let each data point be a cluster.3 While there is more than one cluster:

1 Merge the two closest clusters.2 Update the proximity matrix.

The major difference is the computation of proximity of twoclusters.

8 / 1


c©JonathanTaylor


Starting point for agglomerative clustering.

9 / 1


c©JonathanTaylor



Intermediate Situation

  After some merging steps, we have some clusters

C1

C4

C2 C5

C3

C2 C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

Intermediate point, with 5 clusters.

10 / 1


c©JonathanTaylor


We will merge C2 and C5.

11 / 1


c©JonathanTaylor



After Merging

  The question is “How do we update the proximity matrix?”

C1

C4

C2 U C5

C3 ? ? ? ?

?

?

?

C2 U C5 C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

How do we update proximity matrix?

12 / 1


c©JonathanTaylor



How to Define Inter-Cluster Similarity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Similarity?

  MIN   MAX   Group Average   Distance Between Centroids   Other methods driven by an objective

function –  Ward’s Method uses squared error

Proximity Matrix

We need a notion of similarity between clusters.

13 / 1


c©JonathanTaylor




p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix



Single linkage uses the minimum distance.

14 / 1


c©JonathanTaylor




p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix



Complete linkage uses the maximum distance.

15 / 1


c©JonathanTaylor




p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix



Group average linkage uses the average distance betweengroups.

16 / 1


c©JonathanTaylor




p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix



× ×

Centroid uses the distance between the centroids of the clusters(presumes one can compute centroids...)

17 / 1


c©JonathanTaylor



Cluster Similarity: MIN or Single Link

 Similarity of two clusters is based on the two most similar (closest) points in the different clusters –  Determined by one pair of points, i.e., by one link in

the proximity graph.

1 2 3 4 5

Proximity matrix and dendrogram of single linkage.

18 / 1


c©JonathanTaylor


Distance matrix for nested clusterings

111 0.24 0.22 0.37 0.34 0.23222 0.15 0.20 0.14 0.25

333 0.15 0.28 0.11444 0.29 0.22

555 0.39666

19 / 1


c©JonathanTaylor



Hierarchical Clustering: MIN

Nested Clusters Dendrogram

1

2

3

4

5

6

1

2

3

4

5

Nested cluster representation and dendrogram of single linkage.

20 / 1


c©JonathanTaylor


Single linkage

Can handle irregularly shaped regions fairly naturally.

Sensitive to noise and outliers in the form of “chaining”.

21 / 1


c©JonathanTaylor

The Iris data (single linkage)

22 / 1


c©JonathanTaylor

The Iris data (single linkage)

23 / 1


c©JonathanTaylor



Cluster Similarity: MAX or Complete Linkage

 Similarity of two clusters is based on the two least similar (most distant) points in the different clusters –  Determined by all pairs of points in the two clusters

1 2 3 4 5

Proximity matrix and dendrogram of complete linkage.

24 / 1


c©JonathanTaylor



Hierarchical Clustering: MAX


1

2

3

4

5

6

1

2 5

3

4

Nested cluster and dendrogram of complete linkage.

25 / 1


c©JonathanTaylor


Complete linkage

Less sensitive to noise and outliers than single linkage.

Regions are generally compact, but may violate“closeness”. That is, points may much closer to somepoints in neighbouring cluster than its own cluster.

This manifests itself as breaking large clusters.

Clusters are biased to be globular.

26 / 1


c©JonathanTaylor

The Iris data (complete linkage)

27 / 1


c©JonathanTaylor


28 / 1


c©JonathanTaylor


29 / 1


c©JonathanTaylor



Hierarchical Clustering: Group Average


1

2

3

4

5

6

1

2

5

3

4

Nested cluster and dendrogram of group average linkage.

30 / 1


c©JonathanTaylor


Average linkage

Given two elements of the partition Cr ,Cs , we mightconsider

dGA(Cr ,Cs) =1

|Cr ||Cs |∑

xxx∈Cr ,yyy∈Cs

d(xxx ,yyy)

A compromise between single and complete linkage.

Shares globular clusters of complete, less sensitive thansingle.

31 / 1


c©JonathanTaylor

The Iris data (average linkage)

32 / 1


c©JonathanTaylor

The Iris data (average linkage)

33 / 1


c©JonathanTaylor


Ward’s linkage

Similarity of two clusters is based on the increase insquared error when two clusters are merged.

Similar to average if dissimilarity between points isdistance squared. Hence, it shares many properties ofaverage linkage.

A hierarchical analogue of K -means.

Sometimes used to initialize K -means.

34 / 1


c©JonathanTaylor

The Iris data (Ward’s linkage)

35 / 1


c©JonathanTaylor

The Iris data (Ward’s linkage)

36 / 1


c©JonathanTaylor

NCI data (complete linkage)

37 / 1


c©JonathanTaylor

NCI data (single linkage)

38 / 1


c©JonathanTaylor

NCI data (average linkage)

39 / 1


c©JonathanTaylor

NCI data (Ward’s linkage)

40 / 1


c©JonathanTaylor


Computational issues

O(n2) space since it uses the proximity matrix.

O(n3) time in many cases as there are N steps, and ateach step a matrix of size N2 must be updated and/orsearched.

41 / 1


c©JonathanTaylor


Statistical issues

Once a decision is made to combine two clusters, it cannotbe undone.

No objective function is directly minimized.

Different schemes have problems with one or more of thefollowing:

Sensitivity to noise and outliers.Difficulty handling different sized clusters and convexshapes.Breaking large cluster.

42 / 1


c©JonathanTaylor

Part II

Model-based clustering

43 / 1


c©JonathanTaylor


General approach

Choose a type of mixture model (e.g. multivariateNormal) and a maximum number of clusters, K

Use a specialized hierarchical clustering technique.

Uses some criterion to determine optimal model andnumber of clusters.

44 / 1


c©JonathanTaylor


Choosing a mixture model

General form of the mixture model

f (xxx) =k∑

j=1

πj f (xxx ; θj)

For multivariate normal, θj = (µj ,Σj).

The EM algorithm we discussed before assumed Σj are alldifferent in the different classes.

Other possibilities: Σj = Σ, Σj = λj · I , etc.

45 / 1


c©JonathanTaylor


Choosing a mixture model

Generally, we can write

Σj = cjDjAjDTj

with cjdiag(Aj) the eigenvalues of Σj with max(Aj) = 1.

The parameter cj is the size, Aj is the shape, and Dj is theorientation.

46 / 1


c©JonathanTaylor

Model-based agglomerative clustering

Ward’s criterion

A hierarchical clustering algorithm that merges k clusters{C k

1 , . . . ,Ckk } into k − 1 clusters based on

WSS =k−1∑

j=1

WSS(C k−1j )

where WSS is the within-cluster sum of squared distances.

The procedure merges the two clusters C ki ,C

kl that

produce the smallest increase in WSS .

47 / 1


c©JonathanTaylor

NCI data (Ward’s linkage)

48 / 1


c©JonathanTaylor



If Σj = σ2 · I , then Ward’s criterion is equivalent tomerging based on the criterion

−2 log L(θ, l)

where

L(θ, l) =n∏

i=1

f (xixixi ; θli )

is called the classification likelihood.

This idea can be used to make a hierarchical clusteringalgorithm for other types of multivariate normal models,i.e. equal shape, same size, etc.

49 / 1


c©JonathanTaylor

Model selection & BIC

Bayesian Information Criterion

After a merge, the clusters are taken as initial startingpoints for the EM algorithm.

This results in several mixture models: one for eachnumber of clusters and each type of mixture modelconsidered.

How do we choose?

This raises the topic of model selection

50 / 1


c©JonathanTaylor



Suppose we have several possible modelsM = {M1, . . . ,MT} for a data set which we assume isgiven by a data matrix XXX n×p.

These models have parameters Θ = {θ1, . . . , θT}.Further, suppose that each one has a likelihood, Lj and

Θ̂ = {θ̂1, . . . , θ̂T} are the maximum likelihood estimators.

We can compare−2 logLj(θ̂j)

but this ignores how much “fitting” each model does.

A common approach is to add a penalty that makesdifferent models comparable.

51 / 1


c©JonathanTaylor



The BIC of a model is usually

BIC (Mj) = −2 logLj(θ̂j) + log n ·# parameters in Mj .

The BIC can be thought of as approximating

P(Mj is correct|XXX n×p)

under an appropriate Bayesian model for XXX .

52 / 1


c©JonathanTaylor



Typically, statisticians will try to prove choosing modelwith best BIC yields “correct model”.

Some theoretical justification is needed for this, and thisbreaks down for mixture models. Nevertheless, it is stillused.

Another common criterion is AIC (Akaike InformationCriterion)

AIC (Mj) = −2 logLj(θ̂j) + 2 ·# parameters in Mj .

53 / 1


c©JonathanTaylor


Summary

1 Choose a type of mixture model (e.g. multivariateNormal) and a maximum number of clusters, K

2 Use a specialized hierarchical clustering technique:model-based hierarchical agglomeration.

3 Use clusters from previous step to initialize EM for themixture model.

4 Uses BIC to compare different mixture models and modelswith different numbers of clusters.

54 / 1


c©JonathanTaylor

The Iris data “best” model: equal shape, 2components

55 / 1


c©JonathanTaylor

The Iris data

56 / 1


c©JonathanTaylor

The Iris data

57 / 1


c©JonathanTaylor

Part III

Outliers

58 / 1


c©JonathanTaylor

Outliers

Concepts

What is an outlier? The set of data points that areconsiderably different than the remainder of the data . . .

When do they appear in data mining tasks?

Given a data matrix XXX , find all the cases xxx i ∈ XXX withanomaly/outlier scores greater than some threshold t. Or,the top n outlier scores.Given a data matrix XXX , containing mostly normal (butunlabeled) data points, and a test case xxxnew, compute ananomaly/outlier score of xxxnew with respect to XXX .

Applications

Credit card fraud detection;Network intrusion detection;Misspecification of a model.

59 / 1


c©JonathanTaylor

What is an outlier?

60 / 1


c©JonathanTaylor

Outliers

Issues

How many outliers are there in the data?

Method is unsupervised, similar to clustering or findingclusters with only 1 point in them.

Usual assumption: There are considerably more “normal”observations than “abnormal” observations(outliers/anomalies) in the data.

61 / 1


c©JonathanTaylor

Outliers

General steps

Build a profile of the “normal” behavior. The profilegenerally consists of summary statistics of this “normal”population.

Use these summary statistics to detect anomalies, i.e.points whose characteristics are very far from the normalprofile.

General types of schemes involve a statistical model of“normal”, and “far” is measured in terms of likelihood.

Other schemes based on distances can be quasi-motivatedby such statistical techniques . . .

62 / 1


c©JonathanTaylor

Outliers

Statistical approach

Assume a parametric model describing the distribution ofthe data (e.g., normal distribution)

Apply a statistical test that depends on:

Data distribution (e.g. normal)Parameter of distribution (e.g., mean, variance)Number of expected outliers (confidence limit, α or Type Ierror)

63 / 1


c©JonathanTaylor

Outliers

Grubbs’ Test

Suppose we have a sample of n numbersZZZ = {Z1, . . . ,Zn}, i.e. a n × 1 data matrix.

Assuming data is from normal distribution, Grubbs’ testsuses distribution of

max1≤i≤n Zi − Z̄ZZ

SD(ZZZ )

to search for outlying large values.

64 / 1


c©JonathanTaylor

Outliers

Grubbs’ Test

Lower tail variant:

min1≤i≤n Zi − Z̄ZZ

SD(ZZZ )

Two-sided variant:

max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )

65 / 1


c©JonathanTaylor

Outliers

Grubbs’ Test

Having chosen a test-statistic, we must determine athreshold that sets our “threshold” rule

Often this is set via a hypothesis test to control Type Ierror.

For large positive outlier, threshold is based on choosingsome acceptable Type I error α and finding cα so that

P0

(max1≤i≤n |Zi − Z̄ZZ |

SD(ZZZ )≥ cα

)≈ α

Above, P0 denotes the distribution of ZZZ under theassumption there are no outliers.

If ZZZ are IID N(µ, σ2) it is generally possible to compute adecent approximation of this probability using Bonferonni.

66 / 1


c©JonathanTaylor

Outliers

Grubbs’ Test

Two sided critical level has the form

cα =n − 1√

n

√√√√ t2α/(2n),n−2n − 2 + t2α/(2n),n−2

whereP(Tk ≥ tγ,k) = γ

is the upper tail quantile of Tk .

In R, you can use the functions pnorm, qnorm, pt, qt

for these quantities.

67 / 1


c©JonathanTaylor

Model based: linear regression with outliers


Model based techniques

!  First build a model

!  Points which don’t fit the model well are identified as outliers

!  For the example at the right, a least squares regression model would be appropriate

!  Residuals can be fed in to Grubbs’ test.

Figure : Residuals from model can be fed into Grubbs’ test orBonferroni (variant)

68 / 1


c©JonathanTaylor

Outliers

Multivariate data

If the non-outlying data is assumed to be multivariateGaussian, what is the analogy of Grubbs’ statistic

max1≤i≤n |Zi − Z̄ZZ |SD(ZZZ )

Answer: use Mahalanobis distance

max1≤i≤n

(Zi − Z̄ZZ )T Σ̂−1(Zi − Z̄ZZ )

Above, each individual statistic has what looks like aHotelling’s T 2 distribution.

69 / 1


c©JonathanTaylor

Outliers

Likelihood approach

Assume data is a mixture

F = (1− λ)M + λA.

Above, M is the distribution of “most of the data.”

The distribution A is an “outlier” distribution, could beuniform on a bounding box for the data.

This is a mixture model. If M is parametric, then the EMalgorithm fits naturally here.

Any points assigned to A are “outliers.”

70 / 1


c©JonathanTaylor

Outliers

Likelihood approach

Do we estimate λ or fix it?

The book starts describing an algorithm that tries tomaximize the equivalent classification likelihood

L(θM , θA; l) =

(1− λ)#lM

∏

i∈lMfM(xi , θM)

×

λ#lA

∏

i∈lAfA(xi ; θA)

71 / 1


c©JonathanTaylor

Outliers

Likelihood approach: Algorithm

Algorithm tries to maximize this by forming iterativeestimates (Mt ,At) of “normal” and “outlying” datapoints.

1 At each stage, tries to place individual points of Mt to At .2 Find (θ̂M , θ̂A) based on partition new partition (if

necessary).3 If increase in likelihood is large enough, call these new set

(Mt+1,At+1).4 Repeat until no further changes.

72 / 1


c©JonathanTaylor

Outliers

Nearest neighbour approach

Many ways to define outliers.

Example: data points for which there are fewer than kneighboring points within a distance ε.

Example: the n points whose distance to k-th nearestneighbour is largest.

The n points whose average distance to the first k nearestneighobours is largest.

Each of these methods all depend on choice of someparameters: k, n, ε. Difficult to choose these in asystematic way.

73 / 1


c©JonathanTaylor

Outliers

Density approach

For each point, xxx i compute a density estimate fxxx i ,k usingits k nearest neighbours.

Density estimate used is

fxxx i ,k =

(∑yyy∈N(xxx i ,k)

d(xxx i ,yyy)

#N(xxx i , k)

)−1

Define

LOF (xxx i ) =fxxx i ,k

(∑

y∈N(xxx i ,k)fy ,k)/#N(xxx i , k)

74 / 1


c©JonathanTaylor

Outliers


Density-based: LOF approach

!  For each point, compute the density of its local neighborhood !  Compute local outlier factor (LOF) of a sample p as the

average of the ratios of the density of sample p and the density of its nearest neighbors

!  Outliers are points with largest LOF value

p2 ! p1

!

In the NN approach, p2 is not considered as outlier, while LOF approach find both p1 and p2 as outliers

Figure : Nearest neighbour vs. density based

75 / 1


c©JonathanTaylor

Outliers

Detection rate

Set P(O) to be the proportion of outliers or anomalies.

Set P(D|O) to be the probability of declaring an outlier ifit truly is an outlier. This is the detection rate.

Set P(D|Oc) to the probability of declaring an outlier if itis truly not an outlier.

76 / 1

hierarchical clustering - stanford...

Documents